Asp Forum - finding string matches, in order, in a file

Peter Bailey

9/18/2007 12:42:00 PM

Hi,
I've got files I want to parse. I'm using a string scan routine that
populates an array. I need to pull the entries of that array out, in
order, eventually. I'm getting an array all right, but, I don't
understand its order. The first instance in the string, meaning the
whole file, is way down the list in the array. The first entry in the
array is an entry that's 300 lines deep into the file. Why isn't the
first instance in the string, the file, the first entry in the array?

xmlfile.scan(/<issueList>\n<issue code=\"[A-Z]{3}\">(.*)<\/issue>\n?/)
do |match|

codes = $1
puts codes

Thanks,
Peter
--
Posted via http://www.ruby-....

18 Answers

William James

9/18/2007 1:12:00 PM

Peter Bailey wrote:
> Hi,
> I've got files I want to parse. I'm using a string scan routine that
> populates an array. I need to pull the entries of that array out, in
> order, eventually. I'm getting an array all right, but, I don't
> understand its order. The first instance in the string, meaning the
> whole file, is way down the list in the array. The first entry in the
> array is an entry that's 300 lines deep into the file. Why isn't the
> first instance in the string, the file, the first entry in the array?

The file may have multiple copies of some entries, and
your regexp may be botched.

>
> xmlfile.scan(/<issueList>\n<issue code=\"[A-Z]{3}\">(.*)<\/issue>\n?/)

I don't like the looks of that regular expression. Try this one.

/<issueList>\n<issue code="[A-Z]{3}">(.*?)<\/issue>\n?/m

Peter Bailey

9/18/2007 1:28:00 PM

William James wrote:
> Peter Bailey wrote:
>> Hi,
>> I've got files I want to parse. I'm using a string scan routine that
>> populates an array. I need to pull the entries of that array out, in
>> order, eventually. I'm getting an array all right, but, I don't
>> understand its order. The first instance in the string, meaning the
>> whole file, is way down the list in the array. The first entry in the
>> array is an entry that's 300 lines deep into the file. Why isn't the
>> first instance in the string, the file, the first entry in the array?
>
> The file may have multiple copies of some entries, and
> your regexp may be botched.
>
>>
>> xmlfile.scan(/<issueList>\n<issue code=\"[A-Z]{3}\">(.*)<\/issue>\n?/)
>
> I don't like the looks of that regular expression. Try this one.
>
> /<issueList>\n<issue code="[A-Z]{3}">(.*?)<\/issue>\n?/m

Thanks, William. I tried your regex, but, I'm still getting the first
entry as one that's 300 lines deep into the file. In fact, the results
look exactly the same to me.

--
Posted via http://www.ruby-....

Robert Klemme

9/18/2007 1:55:00 PM

2007/9/18, Peter Bailey <pbailey@bna.com>:
> William James wrote:
> > Peter Bailey wrote:
> >> Hi,
> >> I've got files I want to parse. I'm using a string scan routine that
> >> populates an array. I need to pull the entries of that array out, in
> >> order, eventually. I'm getting an array all right, but, I don't
> >> understand its order. The first instance in the string, meaning the
> >> whole file, is way down the list in the array. The first entry in the
> >> array is an entry that's 300 lines deep into the file. Why isn't the
> >> first instance in the string, the file, the first entry in the array?
> >
> > The file may have multiple copies of some entries, and
> > your regexp may be botched.
> >
> >>
> >> xmlfile.scan(/<issueList>\n<issue code=\"[A-Z]{3}\">(.*)<\/issue>\n?/)
> >
> > I don't like the looks of that regular expression. Try this one.
> >
> > /<issueList>\n<issue code="[A-Z]{3}">(.*?)<\/issue>\n?/m
>
>
> Thanks, William. I tried your regex, but, I'm still getting the first
> entry as one that's 300 lines deep into the file. In fact, the results
> look exactly the same to me.

Still William's regexp is significantly better than the original one.
You seem to be processing XML files. It may be that there is some
white space between <issueList> and <issue> that you are not prepared
for. You can handle that by replacing \n with \s*.

A completely different approach is to use REXML or another XML tool
and use XPath search. This is way less error prone - but usually also
slower. If you just want to extract these codes then a SAX parser
approach might still be pretty fast.

Kind regards

robert

Peter Bailey

9/18/2007 2:31:00 PM

Robert Klemme wrote:
> 2007/9/18, Peter Bailey <pbailey@bna.com>:
>> >
>>
>> Thanks, William. I tried your regex, but, I'm still getting the first
>> entry as one that's 300 lines deep into the file. In fact, the results
>> look exactly the same to me.
>
> Still William's regexp is significantly better than the original one.
> You seem to be processing XML files. It may be that there is some
> white space between <issueList> and <issue> that you are not prepared
> for. You can handle that by replacing \n with \s*.
>
> A completely different approach is to use REXML or another XML tool
> and use XPath search. This is way less error prone - but usually also
> slower. If you just want to extract these codes then a SAX parser
> approach might still be pretty fast.
>
> Kind regards
>
> robert

Same old output. I'll look into REXML. I downloaded it. But, it's enough
for me to just learn Ruby. I don't know if I can handle yet another
scripting language. Anyway, thanks a lot.
-Peter

--
Posted via http://www.ruby-....

William James

9/18/2007 3:10:00 PM

On Sep 18, 8:28 am, Peter Bailey <pbai...@bna.com> wrote:
> William James wrote:
> > Peter Bailey wrote:
> >> Hi,
> >> I've got files I want to parse. I'm using a string scan routine that
> >> populates an array. I need to pull the entries of that array out, in
> >> order, eventually. I'm getting an array all right, but, I don't
> >> understand its order. The first instance in the string, meaning the
> >> whole file, is way down the list in the array. The first entry in the
> >> array is an entry that's 300 lines deep into the file. Why isn't the
> >> first instance in the string, the file, the first entry in the array?
>
> > The file may have multiple copies of some entries, and
> > your regexp may be botched.
>
> >> xmlfile.scan(/<issueList>\n<issue code=\"[A-Z]{3}\">(.*)<\/issue>\n?/)
>
> > I don't like the looks of that regular expression. Try this one.
>
> > /<issueList>\n<issue code="[A-Z]{3}">(.*?)<\/issue>\n?/m
>
> Thanks, William. I tried your regex, but, I'm still getting the first
> entry as one that's 300 lines deep into the file. In fact, the results
> look exactly the same to me.

Don't give up yet. A regular expression is a very concentrated
piece of code, and it very often requires tweeking.

Can you show us the first entry in the file that should
be matched? That would enable us to test our reg.exps.

Some tricky points. A . won't match a newline unless
the m modifier is at the end of the regexp.
..* will often match too much unless you make it
non-greedy by appending ? (i.e., .*?).
Sometimes it's best to make the regexp case-insensitive
by using the i modifier.
You may assume that your text will always have
<issue code=
but perhaps it has
<issue code =

Try this:

%q{
<issueList>
<issue code = "BCD" >
I'm Issue XIV,
who are you?
</issue>

<issueList><issue code="XYZ">
I'm Issue XX, are you?
</issue>

}.scan(
/<issueList>\s*<issue +code *= *"[A-Z]{3}" *>(.*?)<\/issue>/m){
p $1
}

Robert Klemme

9/18/2007 3:22:00 PM

2007/9/18, Peter Bailey <pbailey@bna.com>:
> Robert Klemme wrote:
> > 2007/9/18, Peter Bailey <pbailey@bna.com>:
> >> >
> >>
> >> Thanks, William. I tried your regex, but, I'm still getting the first
> >> entry as one that's 300 lines deep into the file. In fact, the results
> >> look exactly the same to me.
> >
> > Still William's regexp is significantly better than the original one.
> > You seem to be processing XML files. It may be that there is some
> > white space between <issueList> and <issue> that you are not prepared
> > for. You can handle that by replacing \n with \s*.
> >
> > A completely different approach is to use REXML or another XML tool
> > and use XPath search. This is way less error prone - but usually also
> > slower. If you just want to extract these codes then a SAX parser
> > approach might still be pretty fast.
> >
> > Kind regards
> >
> > robert
>
> Same old output. I'll look into REXML. I downloaded it.

It's part of the standard distribution.

> But, it's enough
> for me to just learn Ruby. I don't know if I can handle yet another
> scripting language. Anyway, thanks a lot.

Well, as William said: can you show a piece of the document you are
trying to match?

Kind regards

robert

Peter Bailey

9/18/2007 3:28:00 PM

William James wrote:
> On Sep 18, 8:28 am, Peter Bailey <pbai...@bna.com> wrote:
>>
>> entry as one that's 300 lines deep into the file. In fact, the results
>> look exactly the same to me.
>
> Don't give up yet. A regular expression is a very concentrated
> piece of code, and it very often requires tweeking.
>
> Can you show us the first entry in the file that should
> be matched? That would enable us to test our reg.exps.
>
> Some tricky points. A . won't match a newline unless
> the m modifier is at the end of the regexp.
> .* will often match too much unless you make it
> non-greedy by appending ? (i.e., .*?).
> Sometimes it's best to make the regexp case-insensitive
> by using the i modifier.
> You may assume that your text will always have
> <issue code=
> but perhaps it has
> <issue code =
>
> Try this:
>
> %q{
> <issueList>
> <issue code = "BCD" >
> I'm Issue XIV,
> who are you?
> </issue>
>
> <issueList><issue code="XYZ">
> I'm Issue XX, are you?
> </issue>
>
> }.scan(
> /<issueList>\s*<issue +code *= *"[A-Z]{3}" *>(.*?)<\/issue>/m){
> p $1
> }

Believe me, I haven't given up. I need this to work! I really appreciate
your perseverance, though. Here's what I have now:

xmlfile.scan(/<issueList>\s*<issue +code *="[A-Z]{3}
*">(.*?)<\/issue>\n?/mi) do |match|
codes = $1
puts codes
end

My xml file that I'm testing is 2087 lines deep. The first entry in this
file is on lines 21-23. Here they are:

<issueList>
<issue code="TRD">Trade (Domestic & Foreign)</issue>
</issueList>

So, these words, "Trade (Domestic & Foreign)" should be my first
entry in my array. But, it continues to come up with the word
"Immigration" as the first entry in the array, and that's way down on
line 358.

Thanks,
Peter

--
Posted via http://www.ruby-....

William James

9/18/2007 3:58:00 PM

On Sep 18, 10:27 am, Peter Bailey <pbai...@bna.com> wrote:
> William James wrote:
> > On Sep 18, 8:28 am, Peter Bailey <pbai...@bna.com> wrote:
>
> >> entry as one that's 300 lines deep into the file. In fact, the results
> >> look exactly the same to me.
>
> > Don't give up yet. A regular expression is a very concentrated
> > piece of code, and it very often requires tweeking.
>
> > Can you show us the first entry in the file that should
> > be matched? That would enable us to test our reg.exps.
>
> > Some tricky points. A . won't match a newline unless
> > the m modifier is at the end of the regexp.
> > .* will often match too much unless you make it
> > non-greedy by appending ? (i.e., .*?).
> > Sometimes it's best to make the regexp case-insensitive
> > by using the i modifier.
> > You may assume that your text will always have
> > <issue code=
> > but perhaps it has
> > <issue code =
>
> > Try this:
>
> > %q{
> > <issueList>
> > <issue code = "BCD" >
> > I'm Issue XIV,
> > who are you?
> > </issue>
>
> > <issueList><issue code="XYZ">
> > I'm Issue XX, are you?
> > </issue>
>
> > }.scan(
> > /<issueList>\s*<issue +code *= *"[A-Z]{3}" *>(.*?)<\/issue>/m){
> > p $1
> > }
>
> Believe me, I haven't given up. I need this to work! I really appreciate
> your perseverance, though. Here's what I have now:
>
> xmlfile.scan(/<issueList>\s*<issue +code *="[A-Z]{3}
> *">(.*?)<\/issue>\n?/mi) do |match|
> codes = $1
> puts codes
> end
>
> My xml file that I'm testing is 2087 lines deep. The first entry in this
> file is on lines 21-23. Here they are:
>
> <issueList>
> <issue code="TRD">Trade (Domestic & Foreign)</issue>
> </issueList>
>
> So, these words, "Trade (Domestic & Foreign)" should be my first
> entry in my array. But, it continues to come up with the word
> "Immigration" as the first entry in the array, and that's way down on
> line 358.
>
> Thanks,
> Peter

During the posting process, your regexp was broken into
2 lines; when I corrected that, it worked.

Here I've slightly shortened it.

%q{
<issueList>
<issue code = "BCD" >
I'm Issue XIV,
who are you?
</issue>

<issueList><issue code="XYZ">
I'm Issue XX, are you?
</issue>

<issueList>
<issue code="TRD">Trade (Domestic & Foreign)</issue>
</issueList>

}.scan(
/<issueList>\s*<issue +code="[A-Z]{3}">(.*?)<\/issue>/m){
p $1
}

==== output ====
"\nI'm Issue XX, are you?\n"
"Trade (Domestic & Foreign)"
==== end of output ====

If this still won't work on your file, could the file
be contaminated with some non-displaying characters
that appear to be whitespace but aren't?
Perhaps this would be worth a try:
/<issueList>[^<>]*<issue\W+code="[A-Z]{3}">(.*?)<\/issue>/m

Peter Bailey

9/18/2007 6:27:00 PM

> If this still won't work on your file, could the file
> be contaminated with some non-displaying characters
> that appear to be whitespace but aren't?
> Perhaps this would be worth a try:
> /<issueList>[^<>]*<issue\W+code="[A-Z]{3}">(.*?)<\/issue>/m

Still no go, William. I tried your last phrase there, too.
--
Posted via http://www.ruby-....

William James

9/18/2007 11:07:00 PM

On Sep 18, 1:26 pm, Peter Bailey <pbai...@bna.com> wrote:
> > If this still won't work on your file, could the file
> > be contaminated with some non-displaying characters
> > that appear to be whitespace but aren't?
> > Perhaps this would be worth a try:
> > /<issueList>[^<>]*<issue\W+code="[A-Z]{3}">(.*?)<\/issue>/m
>
> Still no go, William. I tried your last phrase there, too.

You've got to track down what's going on.
Copy and paste the code below into a file.
(Don't even think about typing it in.)
Run the file. Is this the output?

"\nI'm Issue XIV,\nwho are you?\n"
"\nI'm Issue XX, are you?\n"
"Trade (Domestic & Foreign)"

If it is, open both the Ruby file and the xml
file with the same editor; copy the first desired
entry from the xml file and paste it at the bottom
of the big string in the Ruby file. Even though
it looks like an exact duplicate of what's already
in the string, maybe it differs somehow.
The output should now be:

"\nI'm Issue XIV,\nwho are you?\n"
"\nI'm Issue XX, are you?\n"
"Trade (Domestic & Foreign)"
"Trade (Domestic & Foreign)"

If it isn't, edit the entry that you just pasted
into the big Ruby string; delete spaces and
line-endings and replace them with new spaces and
line-endings. (Perhaps the the xml file has some
bizarre invisible characters.)

%q{
<issueList>
<issue code = "BCD" >
I'm Issue XIV,
who are you?
</issue>

<issueList><issue code="XYZ">
I'm Issue XX, are you?
</issue>

<issueList>
<issue code="TRD">Trade (Domestic & Foreign)</issue>
</issueList>

}.scan(
# Using extended-mode regular expression for clarity.
# Whitespace and comments are ignored.
%r{
<issuelist>
\s*
<issue[ \t]+code[ \t]*=[ \t]*"[^"]*"[ \t]*>
(.*?)
</issue>
}xmi
){ p $1 }

comp.lang.ruby

finding string matches, in order, in a file

Peter Bailey

William James

Peter Bailey

Robert Klemme

Peter Bailey

William James

Robert Klemme

Peter Bailey

William James

Peter Bailey

William James

x Login to ForumsZone