[lnkForumImage]
TotalShareware - Download Free Software

Confronta i prezzi di migliaia di prodotti.
Asp Forum
 Home | Login | Register | Search 


 

Forums >

comp.lang.ruby

String replacing help

Jeremy Woertink

4/19/2009 4:59:00 AM

I'm working with Mechanize doing some screen scraping. Because of the
project, I have to use an older version of Mechanize for now, I'm using
0.8.4.

The goal of what I'm trying to do is take a string and insert pipes '|'
before words that are *not* inside of <a></a>.

I have:
>> template_body.class
=> Hpricot::Elements
>> template_body.to_html
=> "<pre><a href=\"javascript:document.f6.SLID.value='F36';
document.f6.submit();\" onMouseOut=\"window.status='';\"
title=\"Select\" onMouseOver=\"window.status='Select'; return
true;\">HEAD</a>&nbsp;\n <a
href=\"javascript:document.f6.SLID.value='F37'; document.f6.submit();\"
onMouseOut=\"window.status='';\" title=\"Select\"
onMouseOver=\"window.status='Select'; return true;\">TITLE</a>&nbsp;<a
href=\"javascript:document.f6.SLID.value='F38'; document.f6.submit();\"
onMouseOut=\"window.status='';\" title=\"Select\"
onMouseOver=\"window.status='Select'; return true;\">\"test\"</a>\n<a
href=\"javascript:document.f6.SLID.value='F39'; document.f6.submit();\"
onMouseOut=\"window.status='';\" title=\"Select\"
onMouseOver=\"window.status='Select'; return true;\">BODY</a>&nbsp;\n
<a href=\"javascript:document.f6.SLID.value='F40';
document.f6.submit();\" onMouseOut=\"window.status='';\"
title=\"Select\" onMouseOver=\"window.status='Select'; return
true;\">DIV</a>&nbsp;id&nbsp;<a
href=\"javascript:document.f6.SLID.value='F41'; document.f6.submit();\"
onMouseOut=\"window.status='';\" title=\"Select\"
onMouseOver=\"window.status='Select'; return true;\">\"main-div\"</a>\n
<a href=\"javascript:document.f6.SLID.value='F42';
document.f6.submit();\" onMouseOut=\"window.status='';\"
title=\"Select\" onMouseOver=\"window.status='Select'; return
true;\">CSS-WITH-LINK</a>&nbsp;destination&nbsp;<a
href=\"javascript:document.f6.SLID.value='F43'; document.f6.submit();\"
onMouseOut=\"window.status='';\" title=\"Select\"
onMouseOver=\"window.status='Select'; return true;\">TO</a>&nbsp;<a
href=\"javascript:document.f6.SLID.value='F44'; document.f6.submit();\"
onMouseOut=\"window.status='';\" title=\"Select\"
onMouseOver=\"window.status='Select'; return true;\">:index</a>\n
<a href=\"javascript:document.f6.SLID.value='F45';
document.f6.submit();\" onMouseOut=\"window.status='';\"
title=\"Select\" onMouseOver=\"window.status='Select'; return
true;\">IMAGE</a>&nbsp;source&nbsp;<a
href=\"javascript:document.f6.SLID.value='F46'; document.f6.submit();\"
onMouseOut=\"window.status='';\" title=\"Select\"
onMouseOver=\"window.status='Select'; return
true;\">RENDER</a>&nbsp;image&nbsp;<a
href=\"javascript:document.f6.SLID.value='F47'; document.f6.submit();\"
onMouseOut=\"window.status='';\" title=\"Select\"
onMouseOver=\"window.status='Select'; return true;\">@image</a>\n
max-height&nbsp;<a href=\"javascript:document.f6.SLID.value='F48';
document.f6.submit();\" onMouseOut=\"window.status='';\"
title=\"Select\" onMouseOver=\"window.status='Select'; return
true;\">m-h</a>\n</pre>"


The words I'm trying to insert the pipe before have a non-breaking space
tags around them. I had it working where I can iterate through
everything and return a new string, but I end up losing all line breaks
and non-breaking spaces using

new_body = ''
template_body.to_html.split("&nbsp;").each do |el|
el.split("\n").each do |e|
unless e.empty? or e =~ /<\/?[^>]*>/
e = '|' + e
end
end
new_body += el
end


Any ideas?

Thanks,
~Jeremy
--
Posted via http://www.ruby-....

5 Answers

Andrew Timberlake

4/19/2009 5:32:00 AM

0

On Sun, Apr 19, 2009 at 6:59 AM, Jeremy Woertink
<jeremywoertink@gmail.com> wrote:
> I'm working with Mechanize doing some screen scraping. Because of the
> project, I have to use an older version of Mechanize for now, I'm using
> 0.8.4.
>
> The goal of what I'm trying to do is take a string and insert pipes '|'
> before words that are *not* inside of <a></a>.
>
> I have:
>>> template_body.class
> =3D> Hpricot::Elements
>>> template_body.to_html
> =3D> "<pre><a href=3D\"javascript:document.f6.SLID.value=3D'F36';
> document.f6.submit();\" onMouseOut=3D\"window.status=3D'';\"
> title=3D\"Select\" onMouseOver=3D\"window.status=3D'Select'; return
> true;\">HEAD</a>&nbsp;\n =A0<a
> href=3D\"javascript:document.f6.SLID.value=3D'F37'; document.f6.submit();=
\"
> onMouseOut=3D\"window.status=3D'';\" title=3D\"Select\"
> onMouseOver=3D\"window.status=3D'Select'; return true;\">TITLE</a>&nbsp;<=
a
> href=3D\"javascript:document.f6.SLID.value=3D'F38'; document.f6.submit();=
\"
> onMouseOut=3D\"window.status=3D'';\" title=3D\"Select\"
> onMouseOver=3D\"window.status=3D'Select'; return true;\">\"test\"</a>\n<a
> href=3D\"javascript:document.f6.SLID.value=3D'F39'; document.f6.submit();=
\"
> onMouseOut=3D\"window.status=3D'';\" title=3D\"Select\"
> onMouseOver=3D\"window.status=3D'Select'; return true;\">BODY</a>&nbsp;\n
> <a href=3D\"javascript:document.f6.SLID.value=3D'F40';
> document.f6.submit();\" onMouseOut=3D\"window.status=3D'';\"
> title=3D\"Select\" onMouseOver=3D\"window.status=3D'Select'; return
> true;\">DIV</a>&nbsp;id&nbsp;<a
> href=3D\"javascript:document.f6.SLID.value=3D'F41'; document.f6.submit();=
\"
> onMouseOut=3D\"window.status=3D'';\" title=3D\"Select\"
> onMouseOver=3D\"window.status=3D'Select'; return true;\">\"main-div\"</a>=
\n
> <a href=3D\"javascript:document.f6.SLID.value=3D'F42';
> document.f6.submit();\" onMouseOut=3D\"window.status=3D'';\"
> title=3D\"Select\" onMouseOver=3D\"window.status=3D'Select'; return
> true;\">CSS-WITH-LINK</a>&nbsp;destination&nbsp;<a
> href=3D\"javascript:document.f6.SLID.value=3D'F43'; document.f6.submit();=
\"
> onMouseOut=3D\"window.status=3D'';\" title=3D\"Select\"
> onMouseOver=3D\"window.status=3D'Select'; return true;\">TO</a>&nbsp;<a
> href=3D\"javascript:document.f6.SLID.value=3D'F44'; document.f6.submit();=
\"
> onMouseOut=3D\"window.status=3D'';\" title=3D\"Select\"
> onMouseOver=3D\"window.status=3D'Select'; return true;\">:index</a>\n
> <a href=3D\"javascript:document.f6.SLID.value=3D'F45';
> document.f6.submit();\" onMouseOut=3D\"window.status=3D'';\"
> title=3D\"Select\" onMouseOver=3D\"window.status=3D'Select'; return
> true;\">IMAGE</a>&nbsp;source&nbsp;<a
> href=3D\"javascript:document.f6.SLID.value=3D'F46'; document.f6.submit();=
\"
> onMouseOut=3D\"window.status=3D'';\" title=3D\"Select\"
> onMouseOver=3D\"window.status=3D'Select'; return
> true;\">RENDER</a>&nbsp;image&nbsp;<a
> href=3D\"javascript:document.f6.SLID.value=3D'F47'; document.f6.submit();=
\"
> onMouseOut=3D\"window.status=3D'';\" title=3D\"Select\"
> onMouseOver=3D\"window.status=3D'Select'; return true;\">@image</a>\n
> max-height&nbsp;<a href=3D\"javascript:document.f6.SLID.value=3D'F48';
> document.f6.submit();\" onMouseOut=3D\"window.status=3D'';\"
> title=3D\"Select\" onMouseOver=3D\"window.status=3D'Select'; return
> true;\">m-h</a>\n</pre>"
>
>
> The words I'm trying to insert the pipe before have a non-breaking space
> tags around them. I had it working where I can iterate through
> everything and return a new string, but I end up losing all line breaks
> and non-breaking spaces using
>
> =A0new_body =3D ''
> =A0 =A0 =A0template_body.to_html.split("&nbsp;").each do |el|
> =A0 =A0 =A0 =A0el.split("\n").each do |e|
> =A0 =A0 =A0 =A0 =A0unless e.empty? or e =3D~ /<\/?[^>]*>/
> =A0 =A0 =A0 =A0 =A0 =A0e =3D '|' + e
> =A0 =A0 =A0 =A0 =A0end
> =A0 =A0 =A0 =A0end
> =A0 =A0 =A0 =A0new_body +=3D el
> =A0 =A0 =A0end
>
>
> Any ideas?
>
> Thanks,
> ~Jeremy
> --
> Posted via http://www.ruby-....
>
>

Does this work?:

in_a =3D false
result =3D ""
s.scan(/<[^>]+>|[^<]+/).each do |e|
if e =3D~ /<a/
in_a =3D true
result << e
elsif e =3D~ /<\/a/
in_a =3D false
result << e
elsif !in_a && e =3D~ /\A\s*\w/
result << "|#{e}"
else
result << e
end
end
result


Andrew Timberlake
http://ramblingso...
http://www.linkedin.com/in/andrew...

"I have never let my schooling interfere with my education" - Mark Twain

7stud --

4/19/2009 8:00:00 AM

0

Jeremy Woertink wrote:
> The goal of what I'm trying to do is take a string and insert pipes '|'
> before words that are *not* inside of <a></a>.
>

Yet you were not able to provide an example of the desired result?


> The words I'm trying to insert the pipe before have a non-breaking space
> tags around them.

Based on *my* interpretation of your description, I think this is what
you want:

result = str.gsub(/&nbsp;([^<]*)&nbsp;/m, '&nbsp;|\1&nbsp;')

--
Posted via http://www.ruby-....

Jeremy Woertink

4/19/2009 9:53:00 AM

0

7stud -- wrote:
> Jeremy Woertink wrote:
>> The goal of what I'm trying to do is take a string and insert pipes '|'
>> before words that are *not* inside of <a></a>.
>>
>
> Yet you were not able to provide an example of the desired result?

I'll try what you have, but here is what the desired result

HEAD
TITLE "test"
BODY
DIV |id "main-div"
CSS-WITH-LINK |destination TO :index
IMAGE |source RENDER |image @image


I'm getting.....

HEAD

TITLE
"test"
BODY

DIV
|id
"main-div"
CSS-WITH-LINK
|destination
TO
:index

IMAGE
|source
RENDER
|image
@image
--
Posted via http://www.ruby-....

7stud --

4/19/2009 3:58:00 PM

0

Jeremy Woertink wrote:
> 7stud -- wrote:
>> Jeremy Woertink wrote:
>>> The goal of what I'm trying to do is take a string and insert pipes '|'
>>> before words that are *not* inside of <a></a>.
>>>
>>
>> Yet you were not able to provide an example of the desired result?
>
> I'll try what you have, but here is what the desired result
>
> HEAD
> TITLE "test"
> BODY
> DIV |id "main-div"
> CSS-WITH-LINK |destination TO :index
> IMAGE |source RENDER |image @image
>
>

Ok. Here's the deal. When asking these types of question, you need to
post two things:

1) The starting string.
2) The result string.


Apparently, you want to know what regex will transform this string:

"<pre><a href=\"javascript:document.f6.SLID.value='F36';
document.f6.submit();\" onMouseOut=\"window.status='';\"
title=\"Select\" onMouseOver=\"window.status='Select'; return
true;\">HEAD</a>&nbsp;\n <a
href=\"javascript:document.f6.SLID.value='F37'; document.f6.submit();\"
onMouseOut=\"window.status='';\" title=\"Select\"
onMouseOver=\"window.status='Select'; return true;\">TITLE</a>&nbsp;<a
href=\"javascript:document.f6.SLID.value='F38'; document.f6.submit();\"
onMouseOut=\"window.status='';\" title=\"Select\"
onMouseOver=\"window.status='Select'; return true;\">\"test\"</a>\n<a
href=\"javascript:document.f6.SLID.value='F39'; document.f6.submit();\"
onMouseOut=\"window.status='';\" title=\"Select\"
onMouseOver=\"window.status='Select'; return true;\">BODY</a>&nbsp;\n
<a href=\"javascript:document.f6.SLID.value='F40';
document.f6.submit();\" onMouseOut=\"window.status='';\"
title=\"Select\" onMouseOver=\"window.status='Select'; return
true;\">DIV</a>&nbsp;id&nbsp;<a
href=\"javascript:document.f6.SLID.value='F41'; document.f6.submit();\"
onMouseOut=\"window.status='';\" title=\"Select\"
onMouseOver=\"window.status='Select'; return true;\">\"main-div\"</a>\n
<a href=\"javascript:document.f6.SLID.value='F42';
document.f6.submit();\" onMouseOut=\"window.status='';\"
title=\"Select\" onMouseOver=\"window.status='Select'; return
true;\">CSS-WITH-LINK</a>&nbsp;destination&nbsp;<a
href=\"javascript:document.f6.SLID.value='F43'; document.f6.submit();\"
onMouseOut=\"window.status='';\" title=\"Select\"
onMouseOver=\"window.status='Select'; return true;\">TO</a>&nbsp;<a
href=\"javascript:document.f6.SLID.value='F44'; document.f6.submit();\"
onMouseOut=\"window.status='';\" title=\"Select\"
onMouseOver=\"window.status='Select'; return true;\">:index</a>\n
<a href=\"javascript:document.f6.SLID.value='F45';
document.f6.submit();\" onMouseOut=\"window.status='';\"
title=\"Select\" onMouseOver=\"window.status='Select'; return
true;\">IMAGE</a>&nbsp;source&nbsp;<a
href=\"javascript:document.f6.SLID.value='F46'; document.f6.submit();\"
onMouseOut=\"window.status='';\" title=\"Select\"
onMouseOver=\"window.status='Select'; return
true;\">RENDER</a>&nbsp;image&nbsp;<a
href=\"javascript:document.f6.SLID.value='F47'; document.f6.submit();\"
onMouseOut=\"window.status='';\" title=\"Select\"
onMouseOver=\"window.status='Select'; return true;\">@image</a>\n
max-height&nbsp;<a href=\"javascript:document.f6.SLID.value='F48';
document.f6.submit();\" onMouseOut=\"window.status='';\"
title=\"Select\" onMouseOver=\"window.status='Select'; return
true;\">m-h</a>\n</pre>"

into this string:

"HEAD
TITLE "test"
BODY
DIV |id "main-div"
CSS-WITH-LINK |destination TO :index
IMAGE |source RENDER |image @image"

Good luck with that.



--
Posted via http://www.ruby-....

Jeremy Woertink

4/19/2009 8:43:00 PM

0

yeah, sorry. I posted that at 3:00am after a few beers, I didn't think
about it until this morning.I made a few changes with your guys
suggestions. So, let's try this again...


Here is my method
http://rafb.net/p/rVu...

Here is what I am getting...
http://rafb.net/p/O1c...

Here is what I want
http://rafb.net/p/Omo...

My thought was that I would take that HTML string, and use a regexp that
would find the text not inside any anchor tags and just add the pipe to
it, then return that original HTML string. I want to keep all the &nbsp;
and \n so the formatting remains the same.

At this point, if I can figure out how to place the line breaks back in
the right place, then I may just be set.

Thanks for the help guys,

~Jeremy
--
Posted via http://www.ruby-....