Asp Forum - regexp problem

Joao Silva

2/9/2009 11:39:00 AM

how i can extract:

<td>Traffic left:</td><td
align=right><b><script>document.write(setzeTT(""+Math.ceil(-123313/1000)));</script>
MB</b></td>

i need this nuber: 123313? I tried to match this in many ways but i stil
have problem with escape characters.
--
Posted via http://www.ruby-....

19 Answers

msnews.microsoft.com

2/9/2009 11:52:00 AM

Of course that depends upon how general this needs to be. If it will
always be the first part of the first parameter to a call to
Math.ceil and negated, then:

======================================================================
text = <<EOS
<td>Traffic left:</td><td
align
=
right><b><script>document.write(setzeTT(""+Math.ceil(-123313/1000)));</
script>
MB</b></td>
EOS

m = text.match(/Math\.ceil\(\-(\d+)/)
puts m[1] if m
======================================================================

Of course, it seems "suspicious that you don't want to pick up the
minus, and this seems to take a lot of consistency for granted. For a
good answer, you'll need to specify what conditions will always be the
same.

On Feb 9, 2009, at 6:39 AM, Joao Silva wrote:

> how i can extract:
>
> <td>Traffic left:</td><td
> align
> =
> right
> ><b><script>document.write(setzeTT(""+Math.ceil(-123313/1000)));</
> script>
> MB</b></td>
>
> i need this nuber: 123313? I tried to match this in many ways but i
> stil
> have problem with escape characters.
> --
> Posted via http://www.ruby-....
>

Joao Silva

2/9/2009 12:50:00 PM

> m = text.match(/Math\.ceil\(\-(\d+)/)

I cannot use regexp on this - need regexp on whole this prase
(<td>Traffic left:</td>.....), because document is full of strings like
this.
--
Posted via http://www.ruby-....

msnews.microsoft.com

2/10/2009 12:12:00 AM

If you're only trying to pull out the single number, this REGEX will
work for the whole phrase you provided.

One of the things you want to do with a REGEX is to avoid any more
detail than is necessary to find what you're looking for. The REGEX
does not need to "match" the whole string.
On Feb 9, 2009, at 7:49 AM, Joao Silva wrote:

>
>> m = text.match(/Math\.ceil\(\-(\d+)/)
>
> I cannot use regexp on this - need regexp on whole this prase
> (<td>Traffic left:</td>.....), because document is full of strings
> like
> this.
> --
> Posted via http://www.ruby-....
>

7stud --

2/10/2009 5:13:00 AM

Mike Cargal wrote:
> If you're only trying to pull out the single number, this REGEX will
> work for the whole phrase you provided.
>

The problem is that your regex will also retrieve 9999999 in this html:

<td>NOT TRAFFIC LEFT:</td><td
align=right><b><script>document.write(setzeTT(""+Math.ceil(-9999999/1000)));</script>
MB</b></td>

and the op is trying to tell you that he doesn't want that number.

Parsing html with regex's is a bad strategy.
--
Posted via http://www.ruby-....

William James

2/10/2009 8:19:00 AM

Joao Silva wrote:

> how i can extract:
>
> <td>Traffic left:</td><td
> align=right><b><script>document.write(setzeTT(""+Math.ceil(-123313/100
> 0)));</script> MB</b></td>
>
> i need this nuber: 123313? I tried to match this in many ways but i
> stil have problem with escape characters.

list = DATA.read.scan( %r{<td.*?>\s*(.*?)\s*</td>}im ).flatten

list.each_cons(2){|a,b|
if "Traffic left:" == a and b =~ /Math.ceil\((-?\d+)/
p $1
end
}

__END__

<td>NOT TRAFFIC LEFT:</td><td
align=right><b>
<script>document.write(setzeTT(""+Math.ceil(-9999999/1000)));
</script>
MB</b></td>

<td> Traffic left:
</td><td
align=right><b><script>
document.write(setzeTT(""+Math.ceil(-123313/1000)));
</script>
MB</b></td>

Rick DeNatale

2/10/2009 2:20:00 PM

[Note: parts of this message were removed to make it a legal post.]

On Tue, Feb 10, 2009 at 3:19 AM, William James <w_a_x_man@yahoo.com> wrote:

> Joao Silva wrote:
>
> > how i can extract:
> >
> > <td>Traffic left:</td><td
> > align=right><b><script>document.write(setzeTT(""+Math.ceil(-123313/100
> > 0)));</script> MB</b></td>
> >
> > i need this nuber: 123313? I tried to match this in many ways but i
> > stil have problem with escape characters.
>
>
> list = DATA.read.scan( %r{<td.*?>\s*(.*?)\s*</td>}im ).flatten
>
> list.each_cons(2){|a,b|
> if "Traffic left:" == a and b =~ /Math.ceil\((-?\d+)/
> p $1
> end
> }
>
>
> __END__
>
> <td>NOT TRAFFIC LEFT:</td><td
> align=right><b>
> <script>document.write(setzeTT(""+Math.ceil(-9999999/1000)));
> </script>
> MB</b></td>
>
> <td> Traffic left:
> </td><td
> align=right><b><script>
> document.write(setzeTT(""+Math.ceil(-123313/1000)));
> </script>
> MB</b></td>
>
>
As 7Stud pointed out, a toolbox with only regular expressions inside is
often a poor choice for dealing with xml/html

Here's a rather verbose and commented program using a combination of hpricot
and a regular expression to do something like what I think you are looking
for:

require 'rubygems'
require 'hpricot'

def get_traffic_left_numbers(string)
doc = Hpricot(string)
results = []
# iterate over all of the td elements in the document
traffic_lefts = doc.search("td").each do |td1|
# check to see if the td contents is "Traffic left:"
if td1.inner_text == "Traffic left:"
# if yes, get the next sibling
td2 = td1.next_sibling
# and then for each script tag inside
td2.search("script") do | script |
# get the script_tag text
script_text = script.inner_text
# Use a regexp to capture the number
number = /Math\.ceil\(-?(\d+)/.match(script_text)
# add the number we found, if any, to the results array
results << number[1] if number
end
end
end
results
end

p get_traffic_left_numbers("<td>Traffic left:</td><td
align=right><b><script>document.write(setzeTT(""+Math.ceil(-123313/1000)));</script>
MB</b></td>
<td>NOT TRAFFIC LEFT:</td><td
align=right><b><script>document.write(setzeTT(""+Math.ceil(-9999999/1000)));</script>
MB</b></td>")

When run this outputs:

["123313"]

In other words it produces an array of strings representing the target
numbers in a script tag within a td tag which follows another td tag whose
inner text is "Traffic left:"

HTH

--
Rick DeNatale

Blog: http://talklikeaduck.denh...
Twitter: http://twitter.com/Ri...

Igor Pirnovar

2/10/2009 6:15:00 PM

Rick Denatale wrote:
> On Tue, Feb 10, 2009 at 3:19 AM, William James wrote:
>
> As 7Stud pointed out, a toolbox with only regular expressions
> inside is often a poor choice for dealing with xml/html
>
> Here's a rather verbose and commented program using a
> combination of hpricot and a regular expression to do
> something like what I think you are looking for:
>
> require 'rubygems'
> require 'hpricot'
> . . .
>
> When run this outputs: ["123313"]
>
> In other words it produces an array of strings representing
> the target numbers in a script tag within a td tag which
> follows another td tag whose inner text is "Traffic left:"

Rick, your solution is swell, and it is probably worth while considering
by someone whose day job is parsing html/xml documents. However, purely
from a language and/or from a programmer's perspective William's
solution is far more appealing, much shorter, easier to understand and
requires virtually no additional learning effort. It nullifies or
"flattens" the comment started out by 7Stud that you also elevated to an
undeserving height.
--
Posted via http://www.ruby-....

Rick DeNatale

2/10/2009 9:47:00 PM

[Note: parts of this message were removed to make it a legal post.]

On Tue, Feb 10, 2009 at 1:15 PM, Igor Pirnovar <gooigpi@gmail.com> wrote:

> Rick Denatale wrote:
> > On Tue, Feb 10, 2009 at 3:19 AM, William James wrote:
> >
> > As 7Stud pointed out, a toolbox with only regular expressions
> > inside is often a poor choice for dealing with xml/html
> >
> > Here's a rather verbose and commented program using a
> > combination of hpricot and a regular expression to do
> > something like what I think you are looking for:
> >
> > require 'rubygems'
> > require 'hpricot'
> > . . .
> >
> > When run this outputs: ["123313"]
> >
> > In other words it produces an array of strings representing
> > the target numbers in a script tag within a td tag which
> > follows another td tag whose inner text is "Traffic left:"
>
> Rick, your solution is swell, and it is probably worth while considering
> by someone whose day job is parsing html/xml documents. However, purely
> from a language and/or from a programmer's perspective William's
> solution is far more appealing,

subjective.

> much shorter,

certainly, particularly with my pedagogical comments,

easier to understand and

I'd be quite willing to argue that.

>
> requires virtually no additional learning effort.

Yes, we wouldn't want to expend any unnecessary effort on learning would we.

And by the way to get that to work (in Ruby 1.8) a nuby rubyist would have
to learn that you'd need to include 'enumerable' to get the cons method.

> It nullifies or
> "flattens" the comment started out by 7Stud that you also elevated to an
> undeserving height.
>

You can treat regular expressions as a Maslovian hammer, but I've had enough
experiences with xml to realize that that hammer is often a very poor tool
for parsing html. I'd rather expend my learning budget in learning how to
apply a tool like Hpricot than to debug my own low-level attempts.

But, as they say, to each his own.

--
Rick DeNatale

Blog: http://talklikeaduck.denh...
Twitter: http://twitter.com/Ri...

Igor Pirnovar

2/11/2009 4:34:00 AM

Rick Denatale wrote:
> On Tue, Feb 10, 2009 at 1:15 PM, Igor Pirnovar <gooigpi@gmail.com>
> wrote:
>
>> > require 'rubygems'
>> by someone whose day job is parsing html/xml documents. However, purely
>> from a language and/or from a programmer's perspective William's
>> solution is far more appealing,
>
> subjective.
>> much shorter,
> certainly, particularly with my pedagogical comments,

and much nicer as well as more elegant, I should add. But more
importantly William's solution is inherently packed with its own
semantics that needs no pedagogue to explain its purpose or meaning!
True, beauty is in the eyes of the beholder, but if you think of all
those engineering accomplishments that defy ageing you will certainly
notice none of them need any pedagogic, aesthetic or any other comments.

> Yes, we wouldn't want to expend any unnecessary effort on learning
> would we.

No, we most certainly would not, especially when there's absolutely no
need for it! This is why Java is such a drag. There large number of
classes that appear to be relevant to the Java environment itself have
been prolifically growing, to the point that programmers are suffocated
in "alpha.beta.gamma..." notations, never mind the unnecessary clutter
they have to memorize in order to be able to assign semantic value to
each token. You may as well write tons of pedagogic comments for every
line. At the end you do not see the trees because of the forest.
Besides, since when a long learning curve is an appreciable attribute?

> ... work (in Ruby 1.8) a nuby rubyist would have to learn that
> you'd need to include 'enumerable' to get the cons method.

What can I say, any language is a constantly evolving thing but at least
in the case of of Ruby's "enumerable" represents a shift towards better
quality which for the user means less unnecessary overhead and smaller
learning curve. I seriously doubt that now-days any astute Ruby newbie
seeks to learn Ruby 1.8 ignoring Ruby 1.9, I'd much rather say it's just
the opposite, precisely because one would try to avoid learning too much
clutter.

> I've had enough experiences with xml to realize that that
> hammer is often a very poor tool for parsing html. I'd rather
> expend my learning budget in learning how to apply a tool like
> Hpricot than to debug my own low-level attempts.

Precisely, if your life revolves around xml and html, Hpricot may be the
better way. However, for an occasional brush with a Markup Language my
old Perl book and core Ruby should do just fine.

Cheers,
igor :)
--
Posted via http://www.ruby-....

William James

2/11/2009 7:54:00 AM

Rick DeNatale wrote:

>
> And by the way to get that to work (in Ruby 1.8) a nuby rubyist would
> have to learn that you'd need to include 'enumerable' to get the cons
> method.

I didn't need to, and I'm using

ruby 1.8.7 (2008-05-31 patchlevel 0) [i386-mswin32]

comp.lang.ruby

regexp problem

Joao Silva

msnews.microsoft.com

Joao Silva

msnews.microsoft.com

7stud --

William James

Rick DeNatale

Igor Pirnovar

Rick DeNatale

Igor Pirnovar

William James

x Login to ForumsZone