Asp Forum - Text File Parsing

greg.kujawa

4/12/2006 11:40:00 PM

I am trying to create a routine that will parse a text file and break
down the various fields into an array. Here's the basic layout:

element1 | element2 | element3
element4 | element5 | element6

As you can tell it's pretty straightforward. I can just #split things
using the pipe as the delimiter. But every now and again the last
element on the line is actually thrown down to the next line, like:

element7 | element8 |
element9
element10 | element11 | element12
element13 | element14 |
element15

Can anyone suggest an easy way to parse things so that the "dangling"
elements are brought back to the preceding lines? In the example above
I would need to bring element9 up to the last pipe on the preceding
line. And same with bringing element15 to the last pipe on its
preceding line.

7 Answers

Dave Burt

4/13/2006 1:32:00 AM

gregarican wrote:
> I am trying to create a routine that will parse a text file and break
> down the various fields into an array. Here's the basic layout:
>
> element1 | element2 | element3
> element4 | element5 | element6
>
> As you can tell it's pretty straightforward. I can just #split things
> using the pipe as the delimiter. But every now and again the last
> element on the line is actually thrown down to the next line, like:
>
> element7 | element8 |
> element9
> element10 | element11 | element12
> element13 | element14 |
> element15
>
> Can anyone suggest an easy way to parse things so that the "dangling"
> elements are brought back to the preceding lines? In the example above
> I would need to bring element9 up to the last pipe on the preceding
> line. And same with bringing element15 to the last pipe on its
> preceding line.

@delim = /\s+\|\s+/
@tok = /\w+/
s.scan(/(#@tok)#@delim(#@tok)#@delim(#@tok)/m) do |a,b,c|
p [a,b,c]
end

Cheers,
Dave

Phil Robyn

4/13/2006 4:16:00 AM

gregarican wrote:

> I am trying to create a routine that will parse a text file and break
> down the various fields into an array. Here's the basic layout:
>
> element1 | element2 | element3
> element4 | element5 | element6
>
> As you can tell it's pretty straightforward. I can just #split things
> using the pipe as the delimiter. But every now and again the last
> element on the line is actually thrown down to the next line, like:
>
> element7 | element8 |
> element9
> element10 | element11 | element12
> element13 | element14 |
> element15
>
> Can anyone suggest an easy way to parse things so that the "dangling"
> elements are brought back to the preceding lines? In the example above
> I would need to bring element9 up to the last pipe on the preceding
> line. And same with bringing element15 to the last pipe on its
> preceding line.
>

c:\cmd>for /f "tokens=1-3 delims=|" %a in (
c:\temp\PipeDelimited.txt
) do @echo %a^|%b^|%c
element1 | element2 | element3
element4 | element5 | element6
element7 | element8 | element9
element10 | element11 | element12
element13 | element14 | element15

--
Phil Robyn
University of California, Berkeley

Phil Robyn

4/13/2006 4:16:00 AM

gregarican wrote:

> I am trying to create a routine that will parse a text file and break
> down the various fields into an array. Here's the basic layout:
>
> element1 | element2 | element3
> element4 | element5 | element6
>
> As you can tell it's pretty straightforward. I can just #split things
> using the pipe as the delimiter. But every now and again the last
> element on the line is actually thrown down to the next line, like:
>
> element7 | element8 |
> element9
> element10 | element11 | element12
> element13 | element14 |
> element15
>
> Can anyone suggest an easy way to parse things so that the "dangling"
> elements are brought back to the preceding lines? In the example above
> I would need to bring element9 up to the last pipe on the preceding
> line. And same with bringing element15 to the last pipe on its
> preceding line.
>

Sorry, wrong NG!

--
Phil Robyn
University of California, Berkeley

William James

4/13/2006 6:03:00 AM

gregarican wrote:
> I am trying to create a routine that will parse a text file and break
> down the various fields into an array. Here's the basic layout:
>
> element1 | element2 | element3
> element4 | element5 | element6
>
> As you can tell it's pretty straightforward. I can just #split things
> using the pipe as the delimiter. But every now and again the last
> element on the line is actually thrown down to the next line, like:
>
> element7 | element8 |
> element9
> element10 | element11 | element12
> element13 | element14 |
> element15
>
> Can anyone suggest an easy way to parse things so that the "dangling"
> elements are brought back to the preceding lines? In the example above
> I would need to bring element9 up to the last pipe on the preceding
> line. And same with bringing element15 to the last pipe on its
> preceding line.

fs = /\s+\|\s+/
rec = ""
IO.foreach("data1"){ |line|
rec += line
if line !~ /#{fs}$/
p rec.chomp.split( fs )
rec = ""
end
}

Robert Klemme

4/13/2006 9:55:00 AM

gregarican <greg.kujawa@gmail.com> wrote:
> I am trying to create a routine that will parse a text file and break
> down the various fields into an array. Here's the basic layout:
>
> element1 | element2 | element3
> element4 | element5 | element6
>
> As you can tell it's pretty straightforward. I can just #split things
> using the pipe as the delimiter. But every now and again the last
> element on the line is actually thrown down to the next line, like:
>
> element7 | element8 |
> element9
> element10 | element11 | element12
> element13 | element14 |
> element15
>
> Can anyone suggest an easy way to parse things so that the "dangling"
> elements are brought back to the preceding lines? In the example above
> I would need to bring element9 up to the last pipe on the preceding
> line. And same with bringing element15 to the last pipe on its
> preceding line.

If the file is reasonably small you could do something like this (untested):

File.read("foo.txt").scan %r{[^|]+(?:\|[^|]+){2}} do |line|
items = line.split /\|/
...
end

Kind regards

robert

greg.kujawa

4/13/2006 12:58:00 PM

Dave Burt wrote:

> @delim = /\s+\|\s+/
> @tok = /\w+/
> s.scan(/(#@tok)#@delim(#@tok)#@delim(#@tok)/m) do |a,b,c|
> p [a,b,c]
> end
>
> Cheers,
> Dave

I had to modify what you submitted a bit. Here's the version I have,
where 'infile' represents the source text file:

--------------------------
infile.readlines.collect {|line|
contents << line
}

contents.scan(/(\w+)\|(\w+)\|(\w+)/m) do |a,b,c|
p [a,b,c]
end
--------------------------

Where I run into a problem is that the third token I need to get (in
this case the local block variable 'c') can be a sentence composed of
multiple words. I will need to revisit my 'Mastering Regular
Expressions' book, as I am a bit rusty at regexes, which is likely
apparent by the trouble I am running into accomplishing the task at
hand :-/

Dave Burt

4/13/2006 3:22:00 PM

gregarican wrote:
> infile.readlines.collect {|line|
> contents << line
> }
>
> contents.scan(/(\w+)\|(\w+)\|(\w+)/m) do |a,b,c|
> p [a,b,c]
> end
> --------------------------
>
> Where I run into a problem is that the third token I need to get (in
> this case the local block variable 'c') can be a sentence composed of
> multiple words. I will need to revisit my 'Mastering Regular
> Expressions' book, as I am a bit rusty at regexes, which is likely
> apparent by the trouble I am running into accomplishing the task at
> hand :-/

OK, let me help!

First, let's look at your first block of code. It does this:
* infile: assumed to be an open input file handle
* readlines: read the file into an array of lines
* collect: produce another array consisting of entire file's data
repeated for each line in the file. (each is a little more appropriate
for this kind of use, where you don't care about the result.)
* contents: add each line successively into a single string

If all you want to do is get the file's data into a string, the
following alternative:
* avoids the need to open and close file handles
* avoids producing 2 extra arrays
* should be slightly quicker
* is shorter

contents = IO.read(filename)

Now, the regexp. If \w isn't broad enough, use . (to match any
character). That will match |, too, so we'll add ^...$ to make sure it
starts at the start of a line and ends at the end of a line. Finally, we
also need to make it non-greedy (Otherwise, for example, "a | b | c\nd |
e | f\n" would be matched as ["a | b | c\nd ", " e ", " f\n"].)

contents.scan(/^(.*?)\|(.*?)\|(.*?)$/mx) do |a,b,c|
p [a,b,c]
end

Cheers,
Dave

comp.lang.ruby

Text File Parsing

greg.kujawa

Dave Burt

Phil Robyn

Phil Robyn

William James

Robert Klemme

greg.kujawa

Dave Burt

x Login to ForumsZone