Jörg W Mittag
5/16/2007 4:18:00 AM
Phrogz wrote:
> On May 15, 9:50 am, "M. R." <r...@schwingerverband.ch> wrote:
>> I want to filter the content of a body-Tag in html. How can I do this
>> with regular expression?
>>
>> @h = Net::HTTP.new(url, 80)
>> @response = @h.get(file, nil)
>>
>> if response.message == "OK"
>> @body_content = response.scan(/..................../).to_s
>> end
> Assuming your HTML is valid, then simply:
> @body_content = response[ /<body[^>]*>(.+?)</body>/m, 1 ]
Whenever someone asks me how to parse HTML with regular expressions, I
usually tell them: don't. HTML is an extremely complex language; if
you want to parse HTML, use an HTML parser. For example, the
following snippet is a perfectly well-formed and valid HTML document,
but none of the regexps posted in this thread so far are able to
correctly parse it:
<HTML/
<HEAD/
<TITLE/>/
<P/>
Oh, and, no, there is nothing missing there (well, except for the
DOCTYPE declaration, I left that out for brevity -- this snippet is
valid HTML 2.0, HTML 3.2 and HTML 4.01), that is actually a complete,
well-formed and valid HTML document.
The content of the above document's body element, flattened to a
string, should be something like this: '<P>></P>'.
Using an actual HTML parser like Hpricot might be a much better
choice. Actually, I just checked and Hpricot doesn't seem to work
either and neither does RubyfulSoup. Strange. What other Ruby HTML
parsers are there that I could try?
jwm