[lnkForumImage]
TotalShareware - Download Free Software

Confronta i prezzi di migliaia di prodotti.
Asp Forum
 Home | Login | Register | Search 


 

Forums >

comp.lang.ruby

Challenge: Extract episode descriptions.

Stedwick

1/19/2008 4:18:00 AM

This is just a whimsical question, really. I've been working on a
website where people can vote on episodes of TV shows (and I happen to
be a big Star Trek fan, so I'm starting there ha ha). By the way, the
website is, literally, 40 lines of code. I'm loving Ruby on Rails so
far.

http://brocoum.com/voter/startrekvoyage...

Anyway, I need to extract the episode descriptions for the tool tips,
and the descriptions come from TV.com. Unfortunately, this has turned
out to be rather harder than it looks!

http://www.tv.com/star-trek-deep-space-nine/show/166/episode_guide.html?season=0&tag=season_dropdown;...

If any of you feel up to the challenge, see if you can streamline my
code below, or write better code yourself. I can't help but think that
there's an easier way to do this!

# open html file
f = File.read("episode_guide.html")

# keep track of the number of descriptions found
count = 0

# each description is enclosed in a multiline <p> </p> tag
f.scan(/<p>.*?<\/p>/m) do |match|
# start with a blank description
desc = ''
# i want to condense each desc into a single line, and remove the
stardate info
match.each_line {|m|
# remove stardate...<br /> because the stardate is not always on
its own line
m.sub!(/^.*<br \/>/,'')
# remove unnecessary whitespace from beginning
m.sub!(/^\s*/,'')
# add non-stardate and non-blank lines to the desc and remove
trailing \n
desc += m.chomp unless m =~ /stardate:/i or !(m =~ /\w/)
}

# remove html tags
desc.gsub!(/<.*?>/,'')
# fix periods ie. "Hi there.I love you." => "Hi there. I love you."
# these period problems were caused by concatenating the paragraphs
above into one line
desc.gsub!(/(\w\.)(\w)/,'\1 \2')
# fix stupid html &nbsp; type stuff
desc.gsub!(/&nbsp;/," ")
desc.gsub!(/&#39;/,"'")
# make all spaces single
desc.gsub!(/ {2,}/,' ')

# output finished description followed by blank line and increment
counter
puts desc + "\n\n"
count += 1
end

# make sure i got all 176 episode descriptions
puts count

Philip
6 Answers

yermej

1/19/2008 6:36:00 AM

0

On Jan 18, 10:18 pm, Stedwick <philip.broc...@gmail.com> wrote:
> This is just a whimsical question, really. I've been working on a
> website where people can vote on episodes of TV shows (and I happen to
> be a big Star Trek fan, so I'm starting there ha ha). By the way, the
> website is, literally, 40 lines of code. I'm loving Ruby on Rails so
> far.
>
> http://brocoum.com/voter/startrekvoyage...
>
> Anyway, I need to extract the episode descriptions for the tool tips,
> and the descriptions come from TV.com. Unfortunately, this has turned
> out to be rather harder than it looks!
>
> http://www.tv.com/star-trek-deep-space-nine/show/166/episod......
>
> If any of you feel up to the challenge, see if you can streamline my
> code below, or write better code yourself. I can't help but think that
> there's an easier way to do this!
>
> # open html file
> f = File.read("episode_guide.html")
>
> # keep track of the number of descriptions found
> count = 0
>
> # each description is enclosed in a multiline <p> </p> tag
> f.scan(/<p>.*?<\/p>/m) do |match|
> # start with a blank description
> desc = ''
> # i want to condense each desc into a single line, and remove the
> stardate info
> match.each_line {|m|
> # remove stardate...<br /> because the stardate is not always on
> its own line
> m.sub!(/^.*<br \/>/,'')
> # remove unnecessary whitespace from beginning
> m.sub!(/^\s*/,'')
> # add non-stardate and non-blank lines to the desc and remove
> trailing \n
> desc += m.chomp unless m =~ /stardate:/i or !(m =~ /\w/)
> }
>
> # remove html tags
> desc.gsub!(/<.*?>/,'')
> # fix periods ie. "Hi there.I love you." => "Hi there. I love you."
> # these period problems were caused by concatenating the paragraphs
> above into one line
> desc.gsub!(/(\w\.)(\w)/,'\1 \2')
> # fix stupid html &nbsp; type stuff
> desc.gsub!(/&nbsp;/," ")
> desc.gsub!(/&#39;/,"'")
> # make all spaces single
> desc.gsub!(/ {2,}/,' ')
>
> # output finished description followed by blank line and increment
> counter
> puts desc + "\n\n"
> count += 1
> end
>
> # make sure i got all 176 episode descriptions
> puts count
>
> Philip

Look into Hpricot - http://code.whytheluckystiff.ne... - or
another HTML parser. It makes things like this much easier - no need
for regexes.

Jean-François Trân

1/19/2008 6:42:00 AM

0

2008/1/19, Stedwick <philip.brocoum@gmail.com>:

> If any of you feel up to the challenge, see if you can streamline my
> code below, or write better code yourself. I can't help but think that
> there's an easier way to do this!
>
> # open html file
> f =3D File.read("episode_guide.html")
>
> # keep track of the number of descriptions found
> count =3D 0
>
> # each description is enclosed in a multiline <p> </p> tag
> f.scan(/<p>.*?<\/p>/m) do |match|

[...]

You should take a look at Hpricot gem to make the
html scraping easier.

-- Jean-Fran=E7ois.

lrlebron@gmail.com

1/19/2008 10:58:00 PM

0

On Jan 18, 10:18 pm, Stedwick <philip.broc...@gmail.com> wrote:
> This is just a whimsical question, really. I've been working on a
> website where people can vote on episodes of TV shows (and I happen to
> be a big Star Trek fan, so I'm starting there ha ha). By the way, the
> website is, literally, 40 lines of code. I'm loving Ruby on Rails so
> far.
>
> http://brocoum.com/voter/startrekvoyage...
>
> Anyway, I need to extract the episode descriptions for the tool tips,
> and the descriptions come from TV.com. Unfortunately, this has turned
> out to be rather harder than it looks!
>
> http://www.tv.com/star-trek-deep-space-nine...episod......
>
> If any of you feel up to the challenge, see if you can streamline my
> code below, or write better code yourself. I can't help but think that
> there's an easier way to do this!
>
> # open html file
> f = File.read("episode_guide.html")
>
> # keep track of the number of descriptions found
> count = 0
>
> # each description is enclosed in a multiline <p> </p> tag
> f.scan(/<p>.*?<\/p>/m) do |match|
> # start with a blank description
> desc = ''
> # i want to condense each desc into a single line, and remove the
> stardate info
> match.each_line {|m|
> # remove stardate...<br /> because the stardate is not always on
> its own line
> m.sub!(/^.*<br \/>/,'')
> # remove unnecessary whitespace from beginning
> m.sub!(/^\s*/,'')
> # add non-stardate and non-blank lines to the desc and remove
> trailing \n
> desc += m.chomp unless m =~ /stardate:/i or !(m =~ /\w/)
> }
>
> # remove html tags
> desc.gsub!(/<.*?>/,'')
> # fix periods ie. "Hi there.I love you." => "Hi there. I love you."
> # these period problems were caused by concatenating the paragraphs
> above into one line
> desc.gsub!(/(\w\.)(\w)/,'\1 \2')
> # fix stupid html &nbsp; type stuff
> desc.gsub!(/&nbsp;/," ")
> desc.gsub!(/&#39;/,"'")
> # make all spaces single
> desc.gsub!(/ {2,}/,' ')
>
> # output finished description followed by blank line and increment
> counter
> puts desc + "\n\n"
> count += 1
> end
>
> # make sure i got all 176 episode descriptions
> puts count
>
> Philip

This is not exactly what you want. But you may find it helpful

require 'hpricot'
require 'open-uri'

url ='http://www.tv.com/star-trek-deep-space-nine...
episode_guide.html?printable=1'
@doc =Hpricot(open(url))

@doc.search("/html/body/div[1]/div").each do |div|

div.search("h1/a") do |h1|
puts h1.inner_text.strip().squeeze(" ").gsub("\n"," ")
end

div.search("//div[@class='f-verdana f-small lh-16 mt-15 mb-15']") do
|div|
puts div.inner_text.strip().squeeze(" ").gsub("\n"," ")
puts
end

end

William James

1/20/2008 4:39:00 AM

0



Stedwick wrote:
> This is just a whimsical question, really. I've been working on a
> website where people can vote on episodes of TV shows (and I happen to
> be a big Star Trek fan, so I'm starting there ha ha). By the way, the
> website is, literally, 40 lines of code. I'm loving Ruby on Rails so
> far.
>
> http://brocoum.com/voter/startrekvoyage...
>
> Anyway, I need to extract the episode descriptions for the tool tips,
> and the descriptions come from TV.com. Unfortunately, this has turned
> out to be rather harder than it looks!
>
> http://www.tv.com/star-trek-deep-space-nine/show/166/episode_guide.html?season=0&tag=season_dropdown;...
>
> If any of you feel up to the challenge, see if you can streamline my
> code below, or write better code yourself. I can't help but think that
> there's an easier way to do this!
>
> # open html file
> f = File.read("episode_guide.html")
>
> # keep track of the number of descriptions found
> count = 0
>
> # each description is enclosed in a multiline <p> </p> tag
> f.scan(/<p>.*?<\/p>/m) do |match|
> # start with a blank description
> desc = ''
> # i want to condense each desc into a single line, and remove the
> stardate info
> match.each_line {|m|
> # remove stardate...<br /> because the stardate is not always on
> its own line
> m.sub!(/^.*<br \/>/,'')
> # remove unnecessary whitespace from beginning
> m.sub!(/^\s*/,'')
> # add non-stardate and non-blank lines to the desc and remove
> trailing \n
> desc += m.chomp unless m =~ /stardate:/i or !(m =~ /\w/)
> }
>
> # remove html tags
> desc.gsub!(/<.*?>/,'')
> # fix periods ie. "Hi there.I love you." => "Hi there. I love you."
> # these period problems were caused by concatenating the paragraphs
> above into one line
> desc.gsub!(/(\w\.)(\w)/,'\1 \2')
> # fix stupid html &nbsp; type stuff
> desc.gsub!(/&nbsp;/," ")
> desc.gsub!(/&#39;/,"'")
> # make all spaces single
> desc.gsub!(/ {2,}/,' ')
>
> # output finished description followed by blank line and increment
> counter
> puts desc + "\n\n"
> count += 1
> end
>
> # make sure i got all 176 episode descriptions
> puts count
>
> Philip

text = IO.read("episode_guide.html")
a = text.scan(/<p>\s*stardate:[ a-z.\d]*(.*?)<\/p>/mi).flatten.
map{|s|
s.strip.gsub(/&nbsp;/," ").gsub(/<.*?>|&[^;]+;/m,"").
gsub(/\s+/, " ") }
puts a.join("\n\n")
puts
puts a.size

William James

1/20/2008 9:39:00 AM

0

On Jan 19, 10:39 pm, William James <w_a_x_...@yahoo.com> wrote:

> text = IO.read("episode_guide.html")
> a = text.scan(/<p>\s*stardate:[ a-z.\d]*(.*?)<\/p>/mi).flatten.
> map{|s|
> s.strip.gsub(/ /," ").gsub(/<.*?>|&[^;]+;/m,"").
> gsub(/\s+/, " ") }
> puts a.join("\n\n")
> puts
> puts a.size

Corrected:

text = IO.read("episode_guide.html")
a = text.scan(/<p>\s*stardate:[ a-z.\d]*(.*?)<\/p>/mi).flatten.
map{|s|
s.gsub(/&nbsp;/," ").gsub(/<.*?>/m,"").gsub("&#39;","'").
gsub(/\s+/, " ").strip }
puts a.join("\n\n")
puts
puts a.size

Stedwick

1/21/2008 10:33:00 PM

0

On Jan 20, 4:38 am, William James <w_a_x_...@yahoo.com> wrote:
> On Jan 19, 10:39 pm, William James <w_a_x_...@yahoo.com> wrote:
>
> > text = IO.read("episode_guide.html")
> > a = text.scan(/<p>\s*stardate:[ a-z.\d]*(.*?)<\/p>/mi).flatten.
> > map{|s|
> > s.strip.gsub(/ /," ").gsub(/<.*?>|&[^;]+;/m,"").
> > gsub(/\s+/, " ") }
> > puts a.join("\n\n")
> > puts
> > puts a.size
>
> Corrected:
>
> text = IO.read("episode_guide.html")
> a = text.scan(/<p>\s*stardate:[ a-z.\d]*(.*?)<\/p>/mi).flatten.
> map{|s|
> s.gsub(/ /," ").gsub(/<.*?>/m,"").gsub("'","'").
> gsub(/\s+/, " ").strip }
> puts a.join("\n\n")
> puts
> puts a.size

I'm liking yours so far William :-) It's pretty elegant.