[lnkForumImage]
TotalShareware - Download Free Software

Confronta i prezzi di migliaia di prodotti.
Asp Forum
 Home | Login | Register | Search 


 

Forums >

comp.lang.ruby

Regex extraction

Scott Rubin

12/15/2004 2:39:00 PM

Hello,

I'm writing an application that parses log files, specifically gaim html log
files, extracts any links it finds and creates an RSS feed of those links. I
have a working program that's about 60 lines of ruby, but it is far from
perfect. Most of the necessary fixes and improvements are things I know how to
do, but just take time. But there are a couple things I need help with.

First, in ruby, how do I extract parts of a regex? Let's use the example from
my program. Normally I could use an expression like the following

href\s*=\s*?:(\"?<url>[^\"]*)\")

And this would allow me to get the <url> out of the expression. But this
doesn't seem to work in ruby, or at least I don't know how to make it work in
ruby. What I would really like to do is match the entire <a href tag structure.
I would want to extract: the protocol (ftp,http) the url (www.website.com),
and the text which appears between the <a> and the </a> into three string
variables. And I have to extract this entire structure from any random line of
text in which the structure either exists or does not. I'm guaranteed that it
wont be partial i.e: an <a> without a </a>.

The other thing I don't know how to do is replace things like &amp; with &. Is
there anything in the ruby standard library, maybe in rexml, that automatically
takes care of all those standard entities for me? I looked, but I couldn't find
one.

Thanks a lot,

Scott Rubin
3 Answers

Robert Klemme

12/15/2004 3:00:00 PM

0


"Scott Rubin" <slr2777@cs.rit.edu> schrieb im Newsbeitrag
news:41c04ca2$1@buckaroo.cs.rit.edu...
> Hello,
>
> I'm writing an application that parses log files, specifically gaim html
log
> files, extracts any links it finds and creates an RSS feed of those
links. I
> have a working program that's about 60 lines of ruby, but it is far from
> perfect. Most of the necessary fixes and improvements are things I know
how to
> do, but just take time. But there are a couple things I need help with.
>
> First, in ruby, how do I extract parts of a regex? Let's use the
example from
> my program. Normally I could use an expression like the following
>
> href\s*=\s*?:(\"?<url>[^\"]*)\")
>
> And this would allow me to get the <url> out of the expression. But
this
> doesn't seem to work in ruby, or at least I don't know how to make it
work in
> ruby. What I would really like to do is match the entire <a href tag
structure.
> I would want to extract: the protocol (ftp,http) the url
(www.website.com),
> and the text which appears between the <a> and the </a> into three
string
> variables. And I have to extract this entire structure from any random
line of
> text in which the structure either exists or does not. I'm guaranteed
that it
> wont be partial i.e: an <a> without a </a>.

You need grouping. As a first shot:

if %r{<a\s+href="(\w+)://([^"]+)"[^>]*>([^<]*)</a>}i =~ text
proto, url, text = $1, $2, $3
end

> The other thing I don't know how to do is replace things like &amp; with
&. Is
> there anything in the ruby standard library, maybe in rexml, that
automatically
> takes care of all those standard entities for me? I looked, but I
couldn't find
> one.

Dunno. But you can easily create that on your own:

ENT = {
"amp" => "&",
"gt" => ">",
# ...
}

text.gsub!(%r{&(\w+);}i) {|m| ENT[$1] || m}

Kind regards

robert

Craig Moran

12/15/2004 3:34:00 PM

0

> > The other thing I don't know how to do is replace things like &amp; with
> &. Is
> > there anything in the ruby standard library, maybe in rexml, that
> automatically
> > takes care of all those standard entities for me? I looked, but I
> couldn't find
> > one.

On replacing &amp; -- This code will replace a few other items that
begin with & (ampersand) and end with ; (semicolon) including what you
wish to accomplish.

text.gsub!(/&.*?;/m) { |i|
case i
when "&amp;"
"&"
when "&nbsp;"
""
when "&copy;"
""
when "&#174;"
""
when "&#147;"
"`"
when "&#148;"
"'"
when "&#183;"
"-"
when "&middot;"
"-"
when "&#8212;"
"--"
else
""
end # case
}


awushu

12/16/2004 12:53:00 AM

0

Scott Rubin wrote:
> First, in ruby, how do I extract parts of a regex?
>
> The other thing I don't know how to do is replace things like &amp;
with &.

require 'cgi'
CGI.unescapeHTML("&amp;") # => "&"

For extracting parts of the match, try
a, b, c = /(.)(.)(.)/.match("abc").captures

-awu