Dan Zwell
8/27/2007 4:26:00 AM
Mark Gallop wrote:
> Hi Charles,
>
> Charles Pareto wrote:
>> links = page_content.scan(/<a class=1.*?href=\"(.*?)\"/).flatten
>>
> I don't think that regular expression (regexp) works. Maybe google has
> changed their code since the book was written. I think it goes "href"
> then "class".
>
> If you work out the correct regexp, let us know.
>
> Cheers,
> Mark
>
>
As Mark said, google changed their code somewhat. If you work out the
correct regular expression and it still seems to give erratic results,
here is a hint: the naive solution uses ".*?" in a certain place, but
that will still match too many results. Try [^"]*? instead, because you
probably don't want to match quotes. (I just tried this, and that was
the problem I encountered.)
By the way, a robust regex to match all HTML links looks kind of nasty,
but perhaps you should try writing one--it's a good exercise. (Of
course, that's not what you want for this--you want to match all links
of class=l.)
Regards,
Dan