Asp Forum - trying to use regex

merrittr

6/20/2007 8:11:00 AM

hi i am trying to strip out text between body tags but when run it i
get:

rob@rob-laptop:~/ruby$ ./html2.rb
../html2.rb:14: unknown regexp options - bdy
../html2.rb:14: unterminated string meets end of file
../html2.rb:14: parse error, unexpected tSTRING_END, expecting
tSTRING_CONTENT or tREGEXP_END or tSTRING_DBEG or tSTRING_DVAR

#! /usr/bin/ruby

@h = File.open "test.html"
@response = @h.gets

text = @response.scan(/<body[^>]*>(.+?)</body>/)[0]
puts text

3 Answers

Alex Gutteridge

6/20/2007 8:42:00 AM

On 20 Jun 2007, at 17:15, merrittr wrote:

> hi i am trying to strip out text between body tags but when run it i
> get:
>
> rob@rob-laptop:~/ruby$ ./html2.rb
> ./html2.rb:14: unknown regexp options - bdy
> ./html2.rb:14: unterminated string meets end of file
> ./html2.rb:14: parse error, unexpected tSTRING_END, expecting
> tSTRING_CONTENT or tREGEXP_END or tSTRING_DBEG or tSTRING_DVAR
>
>
>
>
> #! /usr/bin/ruby
>
> @h = File.open "test.html"
> @response = @h.gets
>
> text = @response.scan(/<body[^>]*>(.+?)</body>/)[0]
> puts text

You need to escape the '/' in your regexp, and unless your html file
is one line you may need to also add the multiline option:

text = @response.scan(/<body[^>]*>(.+?)<\/body>/m)[0]

Alex Gutteridge

Bioinformatics Center
Kyoto University

Rob Biedenharn

6/20/2007 2:37:00 PM

On Jun 20, 2007, at 4:41 AM, Alex Gutteridge wrote:
> On 20 Jun 2007, at 17:15, merrittr wrote:
>> hi i am trying to strip out text between body tags but when run it i
>> get:
>>
>> rob@rob-laptop:~/ruby$ ./html2.rb
>> ./html2.rb:14: unknown regexp options - bdy
>> ./html2.rb:14: unterminated string meets end of file
>> ./html2.rb:14: parse error, unexpected tSTRING_END, expecting
>> tSTRING_CONTENT or tREGEXP_END or tSTRING_DBEG or tSTRING_DVAR
>>
>> #! /usr/bin/ruby
>>
>> @h = File.open "test.html"
>> @response = @h.gets
>>
>> text = @response.scan(/<body[^>]*>(.+?)</body>/)[0]
>> puts text
>
> You need to escape the '/' in your regexp, and unless your html
> file is one line you may need to also add the multiline option:
>
> text = @response.scan(/<body[^>]*>(.+?)<\/body>/m)[0]
>
> Alex Gutteridge
>
> Bioinformatics Center
> Kyoto University

Or you can use the %r{} form of a Regexp literal:

text = @response.scan(%r{<body\b.*?>(.*?)</body>}mi)[0]

\b matches a "word boundary"
m is the multi-line option that causes . to match newlines, too
i is the case insensitive option (so BODY would also be matched)

-Rob

Rob Biedenharn http://agileconsult...
Rob@AgileConsultingLLC.com

Drew Olson

6/20/2007 2:43:00 PM

merrittr wrote:
> hi i am trying to strip out text between body tags but when run it i
> get:

HTML parsing can get quite complicated, why not use a library? I've
heard great things about http://code.whytheluckystiff.ne...

--
Posted via http://www.ruby-....

comp.lang.ruby

trying to use regex

merrittr

Alex Gutteridge

Rob Biedenharn

Drew Olson

x Login to ForumsZone