Asp Forum - Re: * Alan Silver wrote, On 24-2-2010 19:16:Try the HTMLAgilityPack, it is much

Josh a

2/7/2011 1:54:00 PM

I tried using HTMLAgilityPack in one of my projects but the performance slowed down dramatically but with regex it was it was a piece of cake...

for the curious one I am trying to crawl approx 1m domains to extract some selected information from links found on homepage and other pages linked from homepage

> On Wednesday, February 24, 2010 1:16 PM Alan Silver wrote:

> Hello,
>
> I am trying to write some code to check for a link in some HTML that has
> been pulled from a web site. I think this should be easy with a RegEx,
> but I cannot get my head round it.
>
> To make sure it is clear, a normal HTML link looks like...
>
> <a href="http://www.microsoft.com/sompage.aspx&quo... page</a>
>
> ...but can also look like...
>
> <a href="http://www.microsoft.com/sompage.... rel="nofollow">some
> page</a>
>
> There are loads of other variations, but this is all that interests me
> right now.
>
> I want to check the HTML to see...
>
> 1) Is there a link to my target URL (which will be given), and
> 2) Does that link have the rel="nofollow" part or not?
>
> Anyone any ideas how I would do this? I have tried all sorts of things,
> but not got anything that works.
>
> Just to throw a spanner in the works, the rel="nofollow" bit could
> appear before or after the href="whatever" bit.
>
> I would be really grateful for any help here.
>
> TIA
>
> --
> Alan Silver
> (anything added below this line is nothing to do with me)

>> On Wednesday, February 24, 2010 2:01 PM Alan Silver wrote:

>> Just to follow up on my own post, I have finally got something that nearly
>> works, but it is not quite there.
>>
>> Say I want to look for a link to the domain www.fred.com, then the
>> regex...
>>
>> <a .*?nofollow.*?http://www\.fred\.com.*?>.*?</a>
>>
>> ...will match the following...
>>
>> <a rel="nofollow" href="http://www.fred.com">fred...
>>
>> ...which is right, but it will also match...
>>
>> <a href="http://www.cnn.... rel="nofollow">CNN</a><a
>> href="http://www.fred.com">fred...
>>
>> ...which I do not want. It seems that the regex is matching the nofollow
>> part to the first link, and so telling me that the whole HTML fragment
>> contains a nofollow link to www.fred.com. This is wrong.
>>
>> So, how do I modify this regex so that it will not look at the nofollow
>> part in another link?
>>
>> Thanks for any help
>>
>> --
>> Alan Silver
>> (anything added below this line is nothing to do with me)

>>> On Thursday, February 25, 2010 5:17 AM Jesse Houwing wrote:

>>> * Alan Silver wrote, On 24-2-2010 19:16:
>>>
>>> Try the HTMLAgilityPack, it is much better for getting the information
>>> you want.
>>>
>>> See Codeplex.com/HtmlAgilityPack
>>>
>>> Jesse
>>>
>>> --
>>> Jesse Houwing
>>> jesse.houwing at sogeti.nl

>>>> On Thursday, February 25, 2010 9:07 AM eBob.com wrote:

>>>> I have not played with the HTMLAgilityPack or any other HTML parser so I
>>>> cannot compare that approach to RegEx.
>>>>
>>>> I highly recommend Expresso from UltraPico for experimenting with regular
>>>> expressions. (it is free.)
>>>>
>>>> I think your problem is that .*? is sucking up too many characters and
>>>> overflowing into another tag. So instead of matching . (any character) you
>>>> could try matching any character other than "<".
>>>>
>>>> Based on what you have told us, and just off the top of my head, I think my
>>>> expression would look for, in pseudo regex,
>>>>
>>>> <a optional nofollow http://www\.fred\.com optional nofollow </a>
>>>>
>>>> That would match some dumb html which had nofollow before and after the url,
>>>> but I'd guess that does not matter. I do not know if there is a way in regex
>>>> to insist that the nofollow can appear in one place or another but not both.
>>>> But using "named groups" (I think that is the right terminology) you could
>>>> determine where the nofollows had occurred.
>>>>
>>>> Good Luck, Bob

>>>>> On Thursday, February 25, 2010 3:10 PM Alan Silver wrote:

>>>>> Hello,
>>>>>
>>>>> Thanks for the reply. I have Expresso, which is very good, but does not
>>>>> necessarily tell you how to build the regex you want.
>>>>>
>>>>> However, after some playing around, I came up with something that
>>>>> worked. As you pointed out, the regex was greedy, and was matching with
>>>>> stuff outside of the current tag. I added some bits to stop that, and it
>>>>> worked fine.
>>>>>
>>>>> I had to do two regexs, one to catch the nofollow before the href, and
>>>>> one when it was after. The code I ended up with was...
>>>>>
>>>>> Regex regLink = new Regex(@"<a .*?http://" + targetUrl.Replace(".",
>>>>> @"\.") + @".*?>.*?</a>", RegexOptions.Singleline);
>>>>>
>>>>> Regex regLinkNofollowL = new Regex(@"<a [^<>]+nofollow[^<>]+http://" +
>>>>> targetUrl.Replace(".", @"\.") + @"[^<>]+>", RegexOptions.Singleline);
>>>>>
>>>>> Regex regLinkNofollowR = new Regex(@"<a [^<>]+http://" +
>>>>> targetUrl.Replace(".", @"\.") + @"[^<>]+nofollow[^<>]+>",
>>>>> RegexOptions.Singleline);
>>>>>
>>>>> The string variable targetUrl contains the domain name of the link I
>>>>> want to look for.
>>>>>
>>>>> regLink.IsMatch(html) will be true if a link is found
>>>>>
>>>>> regLinkNofollowL.IsMatch(html) will be true if the link has a nofollow
>>>>> before the href
>>>>>
>>>>> regLinkNofollowR.IsMatch(html) will be true if the link has a nofollow
>>>>> after the href
>>>>>
>>>>> Hope this is of some use to someone.
>>>>>
>>>>> Thanks again for the reply.
>>>>>
>>>>>
>>>>> --
>>>>> Alan Silver

>>>>>> On Thursday, February 25, 2010 3:15 PM Alan Silver wrote:

>>>>>> Harumph! If I'd seen that earlier, I could have saved a good few hours
>>>>>> of frustration!
>>>>>>
>>>>>> Mind you, what I ended up with was very neat and compact, so I cannot
>>>>>> complain I suppose.
>>>>>>
>>>>>> Thanks for pointing that one out. It certainly deserves a close look.
>>>>>>
>>>>>> --
>>>>>> Alan Silver
>>>>>> (anything added below this line is nothing to do with me)

>>>>>>> On Tuesday, March 02, 2010 12:18 PM Michael Wojcik wrote:

>>>>>>> Alan Silver wrote:
>>>>>>>
>>>>>>> it is difficult to do this completely reliably, because implementing
>>>>>>> the entire HTML DTD *plus* violations of it that are accepted by
>>>>>>> common UAs (browsers and such) in a DFA is very complicated.
>>>>>>>
>>>>>>> If we make some assumptions about the quality of the HTML you are
>>>>>>> dealing with, though, we can simplify it considerably. Let's say that
>>>>>>> it has to be well-formed, and that there is no whitespace between "<"
>>>>>>> and "a" of an anchor tag.
>>>>>>>
>>>>>>> Then you can prevent your regex above from spanning multiple anchor
>>>>>>> elements by:
>>>>>>>
>>>>>>> - Ensuring you do not span the end of the <a> tag when matching the
>>>>>>> attributes you are looking for within it. Change ".*" in that part of
>>>>>>> the regex to "[^>]*", so the subexpression will stop at the closing ">".
>>>>>>>
>>>>>>> - Ensuring we do not capture "</a" between the "<a>" tag and the
>>>>>>> closing "</a>" tag - that is, that we stop at the first "</a>" and
>>>>>>> do not continue on to a later one, swallowing additional entire anchor
>>>>>>> elements in the process. You can do that with a regex that matches:
>>>>>>>
>>>>>>> - any number of:
>>>>>>> - any number of characters that are not "<", then
>>>>>>> - either:
>>>>>>> - "<" followed by a character that is not "/", or
>>>>>>> - "</" followed by a character that is not "a"
>>>>>>>
>>>>>>> That can be expressed by this regex expression:
>>>>>>>
>>>>>>> ([^<]*((<[^/])|(</[^a]))*)*
>>>>>>>
>>>>>>> (Read it from the inside out. "(<[^/])" is "'<' followed by a
>>>>>>> character that is not '/'", and so on.)
>>>>>>>
>>>>>>> That gives us:
>>>>>>>
>>>>>>> <a
>>>>>>> [^>]*nofollow[^>]*http://www\.fred\.com[^>]*>([^<]*((<[^/])|(</[^a]))*)*</a>
>>>>>>>
>>>>>>> (That's probably going to be wrapped. It should be all on one line,
>>>>>>> obviously.)
>>>>>>>
>>>>>>> Also note that you do not need the "?" operator after ".*"; the "*"
>>>>>>> matches zero or more of the preceding element.
>>>>>>>
>>>>>>> This works with your examples above. It also correctly handles child
>>>>>>> elements of the anchor element (other than <a> within <a>, which is not
>>>>>>> well-formed):
>>>>>>>
>>>>>>> <a rel="nofollow" href="http://fred.com">f<b>r</b>ed...
>>>>>>>
>>>>>>> It seems to me that there ought to be a way to handle the second half
>>>>>>> of that regex with negative lookahead, which might be simpler, but I
>>>>>>> could not get that to work with a couple of quick tries.
>>>>>>>
>>>>>>> All this is assuming you actually need to match the entire anchor
>>>>>>> element in the HTML source for some reason. If you just want to verify
>>>>>>> whether the <a> tag is present with those attributes, you can ignore
>>>>>>> what comes after the closing ">" and greatly simplify the regex.
>>>>>>>
>>>>>>> --
>>>>>>> Michael Wojcik
>>>>>>> Micro Focus
>>>>>>> Rhetoric & Writing, Michigan State University

>>>>>>>> On Wednesday, March 03, 2010 9:37 AM Alan Silver wrote:

>>>>>>>> Wow, what a comprehensive reply! Comments below...
>>>>>>>>
>>>>>>>>
>>>>>>>> Yup, I was assuming (perhaps foolishly) that such a simple thing as an
>>>>>>>> anchor tag might be generally well-formed ;-)
>>>>>>>>
>>>>>>>> <snip>
>>>>>>>> <Snip>
>>>>>>>>
>>>>>>>> I realised after posting that I was only interested in the opening part
>>>>>>>> of the tag, as my interest here is whether or not the link is there, and
>>>>>>>> if there is a nofollow value set. I ignored the anchor text and closing
>>>>>>>> tag.
>>>>>>>>
>>>>>>>> So, how does your regex compare with the one I posted a couple of days
>>>>>>>> ago? I solved the problem I had in a similar way to yours (I think), and
>>>>>>>> ended up with...
>>>>>>>>
>>>>>>>> <a [^<>]+nofollow[^<>]+http://www\.fred\.com[^<>]+>
>>>>>>>>
>>>>>>>> This one only matches if there is a nofollow. I need to detect that, so
>>>>>>>> I had one regex to check for an anchor tag...
>>>>>>>>
>>>>>>>> <a .*?http://www\.fred\.com.*?>.*?</a>
>>>>>>>>
>>>>>>>> ...and then the previous regex to match a nofollow before the href and a
>>>>>>>> similar one for when the nofollow is after the href.
>>>>>>>>
>>>>>>>> Is there anything to choose between your method and mine? I am a rank
>>>>>>>> beginner as regexs, so if yours has some distinct advantage, please
>>>>>>>> explain what. It could just be that they are two slightly different ways
>>>>>>>> of doing the same thing, I do not know.
>>>>>>>>
>>>>>>>> Thanks very much for the reply
>>>>>>>>
>>>>>>>> --
>>>>>>>> Alan Silver
>>>>>>>> (anything added below this line is nothing to do with me)

>>>>>>>>> On Wednesday, March 03, 2010 12:48 PM Michael Wojcik wrote:

>>>>>>>>> Alan Silver wrote:
>>>>>>>>>
>>>>>>>>> Alas, with HTML, you never know (unless you validate the HTML). User
>>>>>>>>> Agents will accept all sorts of garbage, so many authors do not feel
>>>>>>>>> any need to create valid markup. But usually you can get by with some
>>>>>>>>> assumptions and live with a small probability of encountering bogus
>>>>>>>>> markup that does not work.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> If you remove the part of mine that captures the element content and
>>>>>>>>> closing tag, your regex and mine have a few differences, but in
>>>>>>>>> practice they should be equally usable.
>>>>>>>>>
>>>>>>>>> You're eliminating "<" from inside the a tag. It should not appear
>>>>>>>>> there (unless the page uses the SGML short tag syntax, but I have never
>>>>>>>>> seen anyone do so), so in practice my "[^>]" and your "[^<>]" will
>>>>>>>>> produce the same results. Use whichever you prefer. (Some people might
>>>>>>>>> find yours more readable, due to its visual symmetry.)
>>>>>>>>>
>>>>>>>>> You're using the + operator where I use the * operator. We expect that
>>>>>>>>> at least one character will be matched in all of those places, so
>>>>>>>>> again this should not make any difference in practice.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> You could combine all three of these into a single expression, but
>>>>>>>>> frankly if all you are looking for is whether you have a match - you are
>>>>>>>>> not capturing groups or anything like that - I'd stick with the three
>>>>>>>>> regexes you have now. They work, and they are easier to read,
>>>>>>>>> understand, and maintain.
>>>>>>>>>
>>>>>>>>> People who write a lot of regexes tend to start viewing them as an
>>>>>>>>> opportunity for cleverness to the point of obscurity, like TECO macros
>>>>>>>>> were back in the day. Personally, I am a fan of readability and
>>>>>>>>> maintainability. Where I have hard-coded regexes in my code, I usually
>>>>>>>>> split the string up into component parts with comments, so the reader
>>>>>>>>> can see what I am doing.
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Michael Wojcik
>>>>>>>>> Micro Focus
>>>>>>>>> Rhetoric & Writing, Michigan State University

>>>>>>>>>> On Sunday, March 07, 2010 2:12 PM Alan Silver wrote:

>>>>>>>>>> Hee hee, I am with you. I remember my (fairly brief) foray in Perl. I got
>>>>>>>>>> the same impression there - some people were only interested in how
>>>>>>>>>> short (and therefore unreadable) they could make their coding.
>>>>>>>>>>
>>>>>>>>>> Anyway, I am glad what I did is basically the same as yours. I understand
>>>>>>>>>> it a lot better now. Thanks very much for the help.
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Alan Silver
>>>>>>>>>> (anything added below this line is nothing to do with me)

>>>>>>>>>> Submitted via EggHeadCafe
>>>>>>>>>> SQL Server CLR Stored Procedures for External Access
>>>>>>>>>> http://www.eggheadcafe.com/tutorials/aspnet/08c40d08-af4a-41f6-9352-91ac82b90078/sql-server-clr-stored-procedures-for-external-a...

1 Answer

Alan Silver

2/25/2010 8:15:00 PM

>Try the HTMLAgilityPack, it's much better for getting the information
>you want.
>
>See Codeplex.com/HtmlAgilityPack

Harumph! If I'd seen that earlier, I could have saved a good few hours
of frustration!

Mind you, what I ended up with was very neat and compact, so I can't
complain I suppose.

Thanks for pointing that one out. It certainly deserves a close look.

--
Alan Silver
(anything added below this line is nothing to do with me)

microsoft.public.dotnet.framework

Re: * Alan Silver wrote, On 24-2-2010 19:16:Try the HTMLAgilityPack, it is much

Josh a

Alan Silver

x Login to ForumsZone