William James
9/18/2007 3:58:00 PM
On Sep 18, 10:27 am, Peter Bailey <pbai...@bna.com> wrote:
> William James wrote:
> > On Sep 18, 8:28 am, Peter Bailey <pbai...@bna.com> wrote:
>
> >> entry as one that's 300 lines deep into the file. In fact, the results
> >> look exactly the same to me.
>
> > Don't give up yet. A regular expression is a very concentrated
> > piece of code, and it very often requires tweeking.
>
> > Can you show us the first entry in the file that should
> > be matched? That would enable us to test our reg.exps.
>
> > Some tricky points. A . won't match a newline unless
> > the m modifier is at the end of the regexp.
> > .* will often match too much unless you make it
> > non-greedy by appending ? (i.e., .*?).
> > Sometimes it's best to make the regexp case-insensitive
> > by using the i modifier.
> > You may assume that your text will always have
> > <issue code=
> > but perhaps it has
> > <issue code =
>
> > Try this:
>
> > %q{
> > <issueList>
> > <issue code = "BCD" >
> > I'm Issue XIV,
> > who are you?
> > </issue>
>
> > <issueList><issue code="XYZ">
> > I'm Issue XX, are you?
> > </issue>
>
> > }.scan(
> > /<issueList>\s*<issue +code *= *"[A-Z]{3}" *>(.*?)<\/issue>/m){
> > p $1
> > }
>
> Believe me, I haven't given up. I need this to work! I really appreciate
> your perseverance, though. Here's what I have now:
>
> xmlfile.scan(/<issueList>\s*<issue +code *="[A-Z]{3}
> *">(.*?)<\/issue>\n?/mi) do |match|
> codes = $1
> puts codes
> end
>
> My xml file that I'm testing is 2087 lines deep. The first entry in this
> file is on lines 21-23. Here they are:
>
> <issueList>
> <issue code="TRD">Trade (Domestic & Foreign)</issue>
> </issueList>
>
> So, these words, "Trade (Domestic & Foreign)" should be my first
> entry in my array. But, it continues to come up with the word
> "Immigration" as the first entry in the array, and that's way down on
> line 358.
>
> Thanks,
> Peter
During the posting process, your regexp was broken into
2 lines; when I corrected that, it worked.
Here I've slightly shortened it.
%q{
<issueList>
<issue code = "BCD" >
I'm Issue XIV,
who are you?
</issue>
<issueList><issue code="XYZ">
I'm Issue XX, are you?
</issue>
<issueList>
<issue code="TRD">Trade (Domestic & Foreign)</issue>
</issueList>
}.scan(
/<issueList>\s*<issue +code="[A-Z]{3}">(.*?)<\/issue>/m){
p $1
}
==== output ====
"\nI'm Issue XX, are you?\n"
"Trade (Domestic & Foreign)"
==== end of output ====
If this still won't work on your file, could the file
be contaminated with some non-displaying characters
that appear to be whitespace but aren't?
Perhaps this would be worth a try:
/<issueList>[^<>]*<issue\W+code="[A-Z]{3}">(.*?)<\/issue>/m