Alex Young
3/7/2007 2:09:00 PM
Jenda Krynicky wrote:
> J. mp wrote:
>> I'm really bad with this things called regular expressions, so I'm
>> looking for help again.
>>
>> Now, if I have a String like
>> "some string some content <title>this I want</title>"
>>
>> And I want to use the scan function to extract what is between <title>
>> and </title> how can I build my regular expression. The final result
>> should be:
>> this I want
>>
>> Thnaks
>
> You generaly want to use a HTML parser ... provided that Wuby has one.
>
I wonder what the first hit from googling "ruby html parser" is? Ah
yes, hpricot. A perfectly valid approach.
Personally, in the past I've libtidy'd html to xml and used REXML's
stream parser. This has the rather wonderful benefit of actually being
able to fix some fairly broken html, and failing early if it can't.
> You may be lucky with <title> since it's likely to not include any
> attributes, but still there might be some whitespace INSIDE the tags,
> there may be a comment inside the <title>...</title> that you may or may
> not want, there may be a <title> or </title> inside a comment etc. etc.
> etc.
>
> In (censored) I'd use HTML::Parser from CPAN, but shhhh ... this is a
> Wuby site, we don't speak of such things here.
>
It's a mailing list, not a site... Easy to confuse, possibly, but the
mailing list is the primary interface.
--
Alex