Asp Forum - Regular Expression interesting problem

Arun Kumar

3/28/2009 9:07:00 AM

Hi,
I'm learning about regular expressions right now for a html scraping
based assignment. But now I've reached a problem. Given below are two
different html tags.

<link
href="http://newsrss.bbc.co.uk/rss/newsonline_world_edition/help/rss/rss...
rel="alternate" type="application/rss+xml" title="BBC NEWS | Help | RSS"
/>

<link rel="alternate" type="application/rss+xml" title="YouTube - Top
Favorites Today"
href="http://gdata.youtube.com/feeds/base/standardfeeds/top_favorites?client=ytapi-youtube-index&time=today&v=2...

Now what i want is to capture the href-url if the type =
"application/rss+xml". It seems to be simple but it is the position of
the 'type' that creates the problem. In first tag the 'type' is after
href and in the second the 'type' is before it. It seems to me as an
interesting problem, but i need help for solving it. Please help me.

Regards
Arun Kumar
--
Posted via http://www.ruby-....

8 Answers

Eric Hodel

3/28/2009 10:01:00 AM

On Mar 28, 2009, at 02:07, Arun Kumar wrote:
> Hi,
> I'm learning about regular expressions right now for a html scraping
> based assignment. But now I've reached a problem. Given below are two
> different html tags.
>
> <link
> href="http://newsrss.bbc.co.uk/rss/newsonline_world_edition/help/r...
> "
> rel="alternate" type="application/rss+xml" title="BBC NEWS | Help |
> RSS"
> />
>
> <link rel="alternate" type="application/rss+xml" title="YouTube - Top
> Favorites Today"
> href="http://gdata.youtube.com/feeds/base/standardfeeds/top_favorites?client=ytapi-youtube-index&time=tod...
> ">
>
> Now what i want is to capture the href-url if the type =
> "application/rss+xml". It seems to be simple but it is the position of
> the 'type' that creates the problem. In first tag the 'type' is after
> href and in the second the 'type' is before it. It seems to me as an
> interesting problem, but i need help for solving it. Please help me.

I suggest you use Nokogiri.

Barring that, don't use regular expressions, use something more
appropriate like StringScanner from strscan.rb. `ri StringScanner`
will get you started.

PS: You'll probably want to do something like scan for <, then scan
for a tag name, then scan for attributes, then scan for >, etc.

PPS: There's no need to post twice.

Arun Kumar

3/28/2009 10:08:00 AM

Eric Hodel wrote:
> On Mar 28, 2009, at 02:07, Arun Kumar wrote:
>> />
>> interesting problem, but i need help for solving it. Please help me.
> I suggest you use Nokogiri.
>
> Barring that, don't use regular expressions, use something more
> appropriate like StringScanner from strscan.rb. `ri StringScanner`
> will get you started.
>
Nokogiri is a good option. But i want to use net/http for my assignment
and it is compulsory.

> PS: You'll probably want to do something like scan for <, then scan
> for a tag name, then scan for attributes, then scan for >, etc.

As you said I have to check the tag ie'<link' first and then check for
the attributes. But still the position of the type attribute is the
problem.

Thanks for ur quick reply

Regards
Arun Kumar
--
Posted via http://www.ruby-....

James Coglan

3/28/2009 10:22:00 AM

[Note: parts of this message were removed to make it a legal post.]

2009/3/28 Arun Kumar <arunkumar@innovaturelabs.com>

> Eric Hodel wrote:
> > On Mar 28, 2009, at 02:07, Arun Kumar wrote:
> >> />
> >> interesting problem, but i need help for solving it. Please help me.
> > I suggest you use Nokogiri.
> >
> > Barring that, don't use regular expressions, use something more
> > appropriate like StringScanner from strscan.rb. `ri StringScanner`
> > will get you started.
> >
> Nokogiri is a good option. But i want to use net/http for my assignment
> and it is compulsory.
>
> > PS: You'll probably want to do something like scan for <, then scan
> > for a tag name, then scan for attributes, then scan for >, etc.
>
> As you said I have to check the tag ie'<link' first and then check for
> the attributes. But still the position of the type attribute is the
> problem.

Looks like you will need to parse in stages -- I can't get String#scan to
capture everything using a single regex, though there's every chance I've
screwed up the expression somehow:

'<link type="application" href="http://google... rel="alternate" />'.scan
/<([^\s]+)(?:\s+([^\s]+)="([^"]*)")*\s*\/?>/i
#=> [['link', 'rel', 'alternate']]

Robert Klemme

3/28/2009 10:50:00 AM

On 28.03.2009 11:00, Eric Hodel wrote:

> PS: You'll probably want to do something like scan for <, then scan
> for a tag name, then scan for attributes, then scan for >, etc.

I'd probably rather scan for each <link> tag and then analyze it, i.e.

doc.scan %r{<link[^>]*>}i do |link|
if %r{(?i:type)=["']application/rss\+xml["']} =~ link
...
end
end

Note that the scanning RX is weak.

But I agree, rather use the proper tool for the job.

Cheers

robert

Sean O'Halpin

3/28/2009 11:33:00 AM

On Sat, Mar 28, 2009 at 9:07 AM, Arun Kumar
<arunkumar@innovaturelabs.com> wrote:
> Hi,
> I'm learning about regular expressions right now for a html scraping
> based assignment. But now I've reached a problem. Given below are two
> different html tags.
>
> <link
> href="http://newsrss.bbc.co.uk/rss/newsonline_world_edition/help/rss/rss...
> rel="alternate" type="application/rss+xml" title="BBC NEWS | Help | RSS"
> />
>
> <link rel="alternate" type="application/rss+xml" title="YouTube - Top
> Favorites Today"
> href="http://gdata.youtube.com/feeds/base/standardfeeds/top_favorites?client=ytapi-youtube-index&time=today&v=2...
>
> Now what i want is to capture the href-url if the type =
> "application/rss+xml". It seems to be simple but it is the position of
> the 'type' that creates the problem. In first tag the 'type' is after
> href and in the second the 'type' is before it. It seems to me as an
> interesting problem, but i need help for solving it. Please help me.
>
> Regards
> Arun Kumar

In your last post you were telling us about a strict 'boss' who
wouldn't let you use REXML or any XML parsing libraries. I take it
this 'boss' is your teacher.

This isn't an interesting problem. Do your own homework and don't lie
to try to get others to do it for you.

Arun Kumar

3/28/2009 11:45:00 AM

Sean O'halpin wrote:
> On Sat, Mar 28, 2009 at 9:07 AM, Arun Kumar
> <arunkumar@innovaturelabs.com> wrote:
>> <link rel="alternate" type="application/rss+xml" title="YouTube - Top
>> Arun Kumar
> In your last post you were telling us about a strict 'boss' who
> wouldn't let you use REXML or any XML parsing libraries. I take it
> this 'boss' is your teacher.
>
> This isn't an interesting problem. Do your own homework and don't lie
> to try to get others to do it for you.

Hi,
You have completely misunderstood me. I'm working as a software engineer
trainee right now. The first problem that i had has been solved. Now it
is a new assignment. To tell frankly. I have just 2 weeks of experience
in ruby and there is nobody right here that have knwledge about ruby.
That is why i'm asking a favour through this community. I'm sorry if i'm
troubling u guys so much.

Regards
Arun Kumar . C. M.
--
Posted via http://www.ruby-....

7stud --

3/28/2009 12:55:00 PM

Sean O'Halpin

3/28/2009 1:06:00 PM

On Sat, Mar 28, 2009 at 11:44 AM, Arun Kumar
<arunkumar@innovaturelabs.com> wrote:
> Hi,
> You have completely misunderstood me. I'm working as a software engineer
> trainee right now. The first problem that i had has been solved. Now it
> is a new assignment. To tell frankly. I have just 2 weeks of experience
> in ruby and there is nobody right here that have knwledge about ruby.
> That is why i'm asking a favour through this community. I'm sorry if i'm
> troubling u guys so much.

If I have misrepresented you, then you have my sincerest apologies.
However, you have not really represented your own position terribly
well. Now that we know you are a trainee with little experience who is
currently specifically being trained in regular expressions, it makes
more sense that you cannot use REXML, etc. But this was not clear from
your previous posts.

By the way, you are more likely to get a positive response if you at
least show how far you have got with the problem yourself before
coming to the list.

And to make up for my grouchy mood this morning, here's my contribution:

hashes = []
data.scan(/<link[^>]+?>/) do |link|
hashes << Hash[*link.scan(/([a-z]+)=["']?([^"]+)["']?/).flatten]
end
require 'pp'
pp hashes.select{ |hash| hash["type"] == "application/rss+xml" }

But I have no idea if this will meet the requirements of your
assignment or if you will understand it.

Regards,
Sean

comp.lang.ruby

Regular Expression interesting problem

Arun Kumar

Eric Hodel

Arun Kumar

James Coglan

Robert Klemme

Sean O'Halpin

Arun Kumar

7stud --

Sean O'Halpin

x Login to ForumsZone