[lnkForumImage]
TotalShareware - Download Free Software

Confronta i prezzi di migliaia di prodotti.
Asp Forum
 Home | Login | Register | Search 


 

Forums >

comp.lang.ruby

Re: hpricot and regexp?

Dan Diebolt

5/14/2008 6:54:00 AM

[Note: parts of this message were removed to make it a legal post.]

Google cached pages have this structure:

<body>
<table width="100%" border="1">
</table>
<hr/>
<div style="position: relative;">
</div>
</body>

where the first <table> contains boilerplate cache text and a copy of the page is in the <div>.

This is what I would use to clip out the date:

url="http://64.233.167.104/search?q=cache:hydO8fs-rmQJ:en.wikipedia.org/wiki/Court-martial+court+martial&hl=en&ct=clnk&cd=1&gl=us&client=firef...
doc = Hpricot(open(url))
a=doc.search("/table").inner_text
a[/retrieved on (.*?) GMT/,1]
=>May 13, 2008 11:37:34


Feng Tien <pood.forums@gmail.com> wrote: Feng Tien wrote:
> Feng Tien wrote:
>> I'm trying to grab the "cache date" off of the google search.
>>
>> using Mechanize (and built in hpricot)
>>
>>
>> agent = WWW::Mechanize.new
>> agent.user_agent_alias = 'Mac Safari'
>> page = agent.get("http://www.google....)
>> search_form = page.forms.with.name("f").first
>> search_form.q = "Hello"
>> search_results = agent.submit(search_form)
>> cache_date = agent.click search_results.links.text('Cached')
>>
>> date = cache_date.search('table table > td').inner_html
>>
>>
>> How do i grab the date like on this page:
>> http://209.85.173.104/search?q=cache%3Ashacknews.com&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-US:official&client...
>>
>> the part that's right after "as retrieved on" (the date)
>> Is there a built in hpricot method that can search by rexep?
>> or will I have to use something like gsub?
>
>
> oops, I mean, grep.
>
> oh, i got it down to this:
>
> date = cache_date.search('table table > td').inner_text.grep(/retrieved
> on (.+)./)
>
>
> which outputs:["This is G o o g l e's cache of http://www.... as
> retrieved on May 11, 2008 01:09:29 GMT.\n"]
>
> How do I get rid of everything before the date?


Now I have this:

date = cache_date.search('table table > td').inner_text.grep(/retrieved
on (.+)./).to_s.gsub(/.+as retrieved on /,"").gsub(/.\n/,"")

which gives me exactly what i need. is there a better way to doing this?
--
Posted via http://www.ruby-....



1 Answer

Feng Tien

5/14/2008 4:08:00 PM

0

Dan Diebolt,

wow, much shorter, thanks!

--
Posted via http://www.ruby-....