Asp Forum - firefox html, my downloaded html and firebug html different?

Adam Akhtar

8/16/2008 8:15:00 AM

Hi Im a relatively new rubyist and programmer in general and currently
reading Everyday scripting and trying out webscraping using amazon as a
target.

To determine suitable regular expressions i first just viewed the page
source via firefox. Shortly after i found firebug. I noticed that there
were some differences in the source code between firefoxs source code
and firbugs. Firebug seems to add and maybe lack code and vice versa.
Some of my regular expressions would work but they definately mathched
in the firefox source view, I know because i copied the source into a
regex editor and applied my reg ex and it highlighted. So this made
question the html i was grabbing so i saved it to a text file in my
code. When i viewed the text file this too was different than the
firefox code hence why my reg exs were not matching

For the moment assume my regexs are right and that im more concerened
with why there are differences. Can anyone explain why this is
happening? Which version is the real source html???

Here is my code

def get_web_page_text(a_url)
page = open(a_url)
text = page.read
end

html = get_web_page_text('http://www.amazon.com/gp/product/09745...)

File.open('html.txt','w') do |out|
out << html
end
--
Posted via http://www.ruby-....

9 Answers

Adam Akhtar

8/16/2008 8:17:00 AM

>Some of my regular expressions would work but they definately mathched
> in the firefox source view, I know because i copied the source into a
> regex editor and applied my reg ex and it highlighted.

Should read

Some of my regular expressions would NOT work ....
--
Posted via http://www.ruby-....

Michael Morin

8/16/2008 10:12:00 AM

Adam Akhtar wrote:
> Hi Im a relatively new rubyist and programmer in general and currently
> reading Everyday scripting and trying out webscraping using amazon as a
> target.
>
> To determine suitable regular expressions i first just viewed the page
> source via firefox. Shortly after i found firebug. I noticed that there
> were some differences in the source code between firefoxs source code
> and firbugs. Firebug seems to add and maybe lack code and vice versa.
> Some of my regular expressions would work but they definately mathched
> in the firefox source view, I know because i copied the source into a
> regex editor and applied my reg ex and it highlighted. So this made
> question the html i was grabbing so i saved it to a text file in my
> code. When i viewed the text file this too was different than the
> firefox code hence why my reg exs were not matching
>
> For the moment assume my regexs are right and that im more concerened
> with why there are differences. Can anyone explain why this is
> happening? Which version is the real source html???
>
> Here is my code
>
> def get_web_page_text(a_url)
> page = open(a_url)
> text = page.read
> end
>
> html = get_web_page_text('http://www.amazon.com/gp/product/09745...)
>
>
> File.open('html.txt','w') do |out|
> out << html
> end

The "view source" function of Firefox shows you the source code of the
page as it was downloaded from the server. Firebug is more
sophisticated, it shows you the DOM tree of the document. Javascript
can alter the DOM tree (which is essentially what AJAX does), so you
might be seeing the DOM tree after it's been modified by some Javascript
code.

--
Michael Morin
Guide to Ruby
http://ruby....
Become an About.com Guide: beaguide.about.com
About.com is part of the New York Times Company

TPReal

8/16/2008 10:14:00 AM

Hello there.

I don't know why the html differs, but I think the browsers must send
some different data to the server in their requests. Maybe you have some
cookies for the page in one of them, and don't have any in the other, or
maybe the Accept header is different. You can check what your Firefox
sends using Data Tamper:
https://addons.mozilla.org/en-US/firefox... .

Now if you want your own requests from Ruby to be more flexible, use
Net::HTTP instead.

require 'net/http'

Net::HTTP::start("www.amazon.com"){ |http|
header=
{
"Accept"=>"*/*",
"User-Agent"=>"MyRubyProgram",
}
h,b=*http.get("/")
p h
p b

p h.code
p h.message
p h.to_hash

}
--
Posted via http://www.ruby-....

TPReal

8/16/2008 10:18:00 AM

Thomas Bl. wrote:
> header=
> {
> "Accept"=>"*/*",
> "User-Agent"=>"MyRubyProgram",
> }
> h,b=*http.get("/")

Sorry, it should be h,b=*http.get("/",header) of course.
--
Posted via http://www.ruby-....

Thomas Wieczorek

8/16/2008 3:07:00 PM

Every browser cleans up invalid markup. Each one has a different way
to do it. Firefox, for example, adds to every <table> a <tbody>, when
it doesn't exist. Firebug shows you the cleaned up source.
I had to download a website once, because it was so crappy and I
searched for the table entry by hand. It had a path like
"\html\body\table\tr\td\tr\center\font\b\font". Quite annoying, but it
speeded up scraping.

You could try the hpricot gem to get data from websites if the regex
become to complex.

Adam Akhtar

8/16/2008 4:14:00 PM

Thank you everyone for your help so far.

I tackled the problem by not viewing firefox source or firebugs, instead
i just saved and viewed the html code downloaded via open(a_url) with
open-uri. I wasnt sure if this did any `tidy up` like firefox or firebug
but after various trys i could be confident that what it downloaded was
the real deal.

That meant though that id have to view the html in something like
notepad and it aint easy to read. I really wish I could rely on the code
firebug displays as its so easy to find the areas you need to restrict
your searches to.

As a side note is there a plugin or libary that sits in between your
code and the targets webpage. The plugin would `clean` the code as its
downloaded? Perhaps in an identical way to firebug?

The upside would be that it would be easier to grab what you want as
there would be more regular structure, downside i guess would be longer
run times. Just a thought though.

Ive used hpricot a while ago and it wasnt so great on badly designed
webpages so i ended up resorting to regexps. But if i find a nice
website i think ill give it another try!

--
Posted via http://www.ruby-....

Phlip

8/16/2008 5:18:00 PM

Adam Akhtar wrote:

> As a side note is there a plugin or libary that sits in between your
> code and the targets webpage. The plugin would `clean` the code as its
> downloaded? Perhaps in an identical way to firebug?

Why advertise your HTML is sloppy?

At work, we use assert_xpath, assert_tidy, and LibXML in all our functional
tests. They scream bloody murder if we have a single ill-formed ID. Then we
clean up our html.erb and keep going.

--
Phlip

Adam Akhtar

8/16/2008 5:41:00 PM

>
> Why advertise your HTML is sloppy?

Hi Phil,

Its not my html though, its a third partys website that im scraping so I
cant fix the HTML.

--
Posted via http://www.ruby-....

Florian Gilcher

8/16/2008 7:55:00 PM

On Aug 16, 2008, at 5:07 PM, Thomas Wieczorek wrote:

> Every browser cleans up invalid markup. Each one has a different way
> to do it. Firefox, for example, adds to every <table> a <tbody>, when
> it doesn't exist. Firebug shows you the cleaned up source.

Actually, if the table has no header, no footer and only one ody, the
tbody-tag is not
required but implicitly assumed.

So it always exists in the dom displayed by firebug (as it is added
and thus existing)
but does not when you manipulate the document with a tool that does not
build the dom beforehand (the source viewer).

Regards,
Florian Gilcher

comp.lang.ruby

firefox html, my downloaded html and firebug html different?

Adam Akhtar

Adam Akhtar

Michael Morin

TPReal

TPReal

Thomas Wieczorek

Adam Akhtar

Phlip

Adam Akhtar

Florian Gilcher

x Login to ForumsZone