Asp Forum - super-newbee Ruby regex help?

Aaron Reimann

8/1/2006 9:00:00 PM

This is pretty complex considering that I am just now reading "Learn to
Program" by Chris Pine (it is a book teaching you how to program in
Ruby). It is very basic. I am somewhat good with PHP but and wanting
to move into RoR and want to learn Ruby before I learn Rails.

Anyway, I found a real life situation where I think Ruby could do this
very quickly (and if I need to do it again, I can just run the script).
I need to remove some stuff from a text file. Simple huh? Here is
the site that I need the list from:

http://edge.i-hacked.com/250-working-proxies-for-safe-web-access-from-work...

In that page there is one line of "code" that has all of the
links...here is part of it:
<a href="http://www.3proxy.com&... Proxy</a> || <a
href="http://www.3proxy.net&... Proxy</a> || <a
href="http://www.3proxy.org&... Proxy</a>

I have taken just that line and saved that as a text file.

I need to strip everything where I wind up with this:
3proxy\.com
3proxy\.net
3proxy\.org
4proxy\.com

I will be taking that list (all 300 of them) and adding them to my
content filtering box. That way, all of these sites will be blocked.

Do you guys know of any sites that might have a similar situation where
I can see the code? or have any of you done something similar? I can
probably modify stuff to make it fit my needs, but stuff like
http://www.regular-expressions.info... doesn't give me enough
info to start.

what i have right now is: file = File.open("list.txt","w")

lol

Sorry I'm a nubee... :)

thanks,
aaron

5 Answers

Vincent Fourmond

8/1/2006 9:29:00 PM

Hello !

> In that page there is one line of "code" that has all of the
> links...here is part of it:
> <a href="http://www.3proxy.com&... Proxy</a> || <a
> href="http://www.3proxy.net&... Proxy</a> || <a
> href="http://www.3proxy.org&... Proxy</a>
>
> I have taken just that line and saved that as a text file.
>
> I need to strip everything where I wind up with this:
> 3proxy\.com
> 3proxy\.net
> 3proxy\.org
> 4proxy\.com
>

OK, what you need is to extract the part 3proxy.com from the String
<a href="http://www.3proxy.com&... Proxy</a>

For that, a RE like the following should do

/http:\/\/www\.([^"]+)/

You can read it this way: "find substrings that start with http://www.
(don't forget to escape /in the RE, else ruby will think that it is
ending; you also need to escape the dot, although in this case it
shouldn't matter much)
and are followed by some text that doesn't contain ". The parenthesis
around say you're interested in it; you'll be able to use what it did
match with the $1 variable. Note that this part will match as much as
possible, so you'll actually get everything you want.

Then a possible way to do what you want would be

proxies = [] # array where the proxies will be
f = File.open('your_file_with_the_list_youre_reading')
f.readlines.each do |l| # iterate on each line
l.scan(/http:\/\/www\.([^"]+)/) do # scan the line for the pattern
proxies << $1 # add the content of $1 to your list
end
end
p proxies

This should work...

Have a good time with Ruby !

Vince

William James

8/2/2006 7:13:00 AM

Aaron Reimann wrote:
> This is pretty complex considering that I am just now reading "Learn to
> Program" by Chris Pine (it is a book teaching you how to program in
> Ruby). It is very basic. I am somewhat good with PHP but and wanting
> to move into RoR and want to learn Ruby before I learn Rails.
>
> Anyway, I found a real life situation where I think Ruby could do this
> very quickly (and if I need to do it again, I can just run the script).
> I need to remove some stuff from a text file. Simple huh? Here is
> the site that I need the list from:
>
> http://edge.i-hacked.com/250-working-proxies-for-safe-web-access-from-work...
>
> In that page there is one line of "code" that has all of the
> links...here is part of it:
> <a href="http://www.3proxy.com&... Proxy</a> || <a
> href="http://www.3proxy.net&... Proxy</a> || <a
> href="http://www.3proxy.org&... Proxy</a>
>
> I have taken just that line and saved that as a text file.
>
> I need to strip everything where I wind up with this:
> 3proxy\.com
> 3proxy\.net
> 3proxy\.org
> 4proxy\.com
>
> I will be taking that list (all 300 of them) and adding them to my
> content filtering box. That way, all of these sites will be blocked.
>
> Do you guys know of any sites that might have a similar situation where
> I can see the code? or have any of you done something similar? I can
> probably modify stuff to make it fit my needs, but stuff like
> http://www.regular-expressions.info... doesn't give me enough
> info to start.
>
> what i have right now is: file = File.open("list.txt","w")

If the file already exists, you'll destroy it by using the "w" option.
Since some of the anchor tags span more than one line,
let's read the whole file at once:

p IO.read( 'list.txt' ).
scan( %r{<a \s+ href="http://www\.([^"]*)"}x ).flatten

Aaron Reimann

8/2/2006 2:11:00 PM

Thank you guys. I have not tried all that has been suggested, but I
got this code emailed to me:

###
require 'rubygems'
require 'mechanize'

url="http://edge.i-hacked.com/250-working-proxies-for-safe-web-access-from-work-or-sc...
agent = WWW::Mechanize.new
page = agent.get(url)

page.body.scan(/http:\/\/www\.([^"]+)/) do
p $1
end
###

I had to install the 'mechanize' gem, but it works...overall. I have
to figure out how to "write" the output into a text file. but this is
pretty cool.

I will be trying the one below too.

thanks!
aaron

Vincent Fourmond wrote:
> Hello !
>
> > In that page there is one line of "code" that has all of the
> > links...here is part of it:
> > <a href="http://www.3proxy.com&... Proxy</a> || <a
> > href="http://www.3proxy.net&... Proxy</a> || <a
> > href="http://www.3proxy.org&... Proxy</a>
> >
> > I have taken just that line and saved that as a text file.
> >
> > I need to strip everything where I wind up with this:
> > 3proxy\.com
> > 3proxy\.net
> > 3proxy\.org
> > 4proxy\.com
> >
>
> OK, what you need is to extract the part 3proxy.com from the String
> <a href="http://www.3proxy.com&... Proxy</a>
>
> For that, a RE like the following should do
>
> /http:\/\/www\.([^"]+)/
>
> You can read it this way: "find substrings that start with http://www.
> (don't forget to escape /in the RE, else ruby will think that it is
> ending; you also need to escape the dot, although in this case it
> shouldn't matter much)
> and are followed by some text that doesn't contain ". The parenthesis
> around say you're interested in it; you'll be able to use what it did
> match with the $1 variable. Note that this part will match as much as
> possible, so you'll actually get everything you want.
>
> Then a possible way to do what you want would be
>
> proxies = [] # array where the proxies will be
> f = File.open('your_file_with_the_list_youre_reading')
> f.readlines.each do |l| # iterate on each line
> l.scan(/http:\/\/www\.([^"]+)/) do # scan the line for the pattern
> proxies << $1 # add the content of $1 to your list
> end
> end
> p proxies
>
> This should work...
>
> Have a good time with Ruby !
>
> Vince

Cliff Cyphers

8/2/2006 2:47:00 PM

Aaron Reimann wrote:
> I had to install the 'mechanize' gem, but it works...overall. I have
> to figure out how to "write" the output into a text file. but this is
> pretty cool.
>

Update filename and you are set.

url="http://edge.i-hacked.com/250-working-proxies-for-safe-web-access-from-work-or-sc...
filename="/tmp/tmp2.txt"
agent = WWW::Mechanize.new
page = agent.get(url)

session_fd = File.open(filename, "w")
page.body.scan(/http:\/\/www\.([^"]+)/) do
session_fd.puts $1
end
session_fd.close

Daniel Harple

8/2/2006 3:01:00 PM

On Aug 2, 2006, at 10:15 AM, Aaron Reimann wrote:

> Thank you guys. I have not tried all that has been suggested, but I
> got this code emailed to me:
>
> ###
> require 'rubygems'
> require 'mechanize'
>
> url="http://edge.i-hacked.com/250-working-proxies-for...
> access-from-work-or-school"
> agent = WWW::Mechanize.new
> page = agent.get(url)
>
> page.body.scan(/http:\/\/www\.([^"]+)/) do
> p $1
> end
> ###
>
> I had to install the 'mechanize' gem, but it works...overall. I have
> to figure out how to "write" the output into a text file. but this is
> pretty cool.

Mechanize has a method to get all the links for a Page:

require "rubygems"
require "mechanize"

url="http://edge.i-hacked.com/250-working-proxies-for...
access-from-work-or-school"
links = WWW::Mechanize.new.get(url).links.map { |a| a.uri rescue
nil }.flatten
File.open('links.txt', 'w') { |f| f.puts(links) }

This saves all the relative links, however.

-- Daniel

comp.lang.ruby

super-newbee Ruby regex help?

Aaron Reimann

Vincent Fourmond

William James

Aaron Reimann

Cliff Cyphers

Daniel Harple

x Login to ForumsZone