Todd Benson
3/28/2008 9:35:00 AM
On Fri, Mar 28, 2008 at 2:11 AM, Adam Akhtar <adamtemporary@gmail.com> wrot=
e:
> Hi im starting to use hrpicot and im having problems extracting
> descriptions of various concert events from a page. Here is a sample of
> the html
>
>
> <p>
> <a name=3D"concerts"/>
> <span class=3D"heading">Concerts</span>
> <br/>
> <span class=3D"subheading">POPULAR</span>
> <br/>
> <br/>
> <span class=3D"textbold">Middle Field! Vol.4</span >
> <br/>
> Featuring electric-pop band The Stealth, Mac and Masaru, and others. Mar
> 28, 7pm, =A52,500 (adv)/ =A53,000 (door). Shibuya O-Nest. Tel: 03-3498-9=
999.
> <br/>
> <br/>
> <span class=3D"textbold">Philip Woo featuring Brenda Vaughn</span>
> <br/>
> Japanese pianist and soul singer performing with Andy Wulf and Kaori
> Kobayashi. Mar 28 & 29, 7 & 9:30pm, =A53,150. Cotton Club, Marunouchi.
> Tel: 03-3215-1555.
> <br/>
> ...
> ...
> ...
> etc
>
> I can get the artist band names fine using
> names =3D doc.search("//span[@class=3D'textbold']")
>
> but i cant get teh descriptions. In fact the descriptions aren't
> indvidually wrapped up in any tags but rather just clumped together
> under the paragraph tab with line breaks <br/>
>
> So I thought id just try
> descriptions =3D
> doc.search("/html/body/div/table/tbody/tr[4]/td/table/tbody/tr/td[2]/tab=
le/tbody/tr/td/span/p")
> but when i try to puts descriptions nothing is printed to the screen.
>
> How would i go about getting this info??? any tips or ideas?
>
> Thanks
Wow! It looks nice, but the html is really ugly. This would be
pretty hard to scrape on a regular basis. For artists, there are a
mix of <strong></strong> tags, <span class=3D"textbold"></span> tags,
and I noticed one artist with no surrounding tags at all (Ex-press
Ver.2).
It can be really hard to work with inconsistent html, but I suppose it
could be done to some degree of accuracy. Any hpricot masters out
there? I'm sure you'd have to attack with regexps as well. Maybe
turning into text and then parsing is a better idea after all.
Todd