Asp Forum - How to efficiently extract information from structured text file

Imaginationworks

2/16/2010 11:48:00 PM

Hi,

I am trying to read object information from a text file (approx.
30,000 lines) with the following format, each line corresponds to a
line in the text file. Currently, the whole file was read into a
string list using readlines(), then use for loop to search the "= {"
and "};" to determine the Object, SubObject,and SubSubObject. My
questions are

1) Is there any efficient method that I can search the whole string
list to find the location of the tokens(such as '= {' or '};'

2) Is there any efficient ways to extract the object information you
may suggest?

Thanks,

- Jeremy

===== Structured text file =================
Object1 = {

....

SubObject1 = {
.....

SubSubObject1 = {
....
};
};

SubObject2 = {
.....

SubSubObject21 = {
....
};
};

SubObjectN = {
.....

SubSubObjectN = {
....
};
};
};

8 Answers

Rhodri James

2/17/2010 12:30:00 AM

On Tue, 16 Feb 2010 23:48:17 -0000, Imaginationworks <xiajunyi@gmail.com>
wrote:

> Hi,
>
> I am trying to read object information from a text file (approx.
> 30,000 lines) with the following format, each line corresponds to a
> line in the text file. Currently, the whole file was read into a
> string list using readlines(), then use for loop to search the "= {"
> and "};" to determine the Object, SubObject,and SubSubObject. My
> questions are
>
> 1) Is there any efficient method that I can search the whole string
> list to find the location of the tokens(such as '= {' or '};'

The usual idiom is to process a line at a time, which avoids the memory
overhead of reading the entire file in, creating the list, and so on.
Assuming your input file is laid out as neatly as you said, that's
straightforward to do:

for line in myfile:
if "= {" in line:
start_a_new_object(line)
elif "};" in line:
end_current_object(line)
else:
add_stuff_to_current_object(line)

You probably want more robust tests than I used there, but that depends on
how well-defined your input file is. If it can be edited by hand, you'll
need to be more defensive!

> 2) Is there any efficient ways to extract the object information you
> may suggest?

That depends on what you mean by "extract the object information". If you
mean "get the object name", just split the line at the "=" and strip off
the whitespace you don't want. If you mean "track how objects are
connected to one another, have each object keep a list of its immediate
sub-objects (which will have lists of their immediate sub-objects, and so
on); it's fairly easy to keep track of which objects are current using a
list as a stack. If you mean something else, sorry but my crystal ball is
cloudy tonight.

--
Rhodri James *-* Wildebeeste Herder to the Masses

Gary Herron

2/17/2010 1:15:00 AM

Imaginationworks wrote:
> Hi,
>
> I am trying to read object information from a text file (approx.
> 30,000 lines) with the following format, each line corresponds to a
> line in the text file. Currently, the whole file was read into a
> string list using readlines(), then use for loop to search the "= {"
> and "};" to determine the Object, SubObject,and SubSubObject. My
> questions are
>
> 1) Is there any efficient method that I can search the whole string
> list to find the location of the tokens(such as '= {' or '};'
>

Yes. Read the *whole* file into a single string using file.read()
method, and then search through the string using string methods (for
simple things) or use re, the regular expression module, (for more
complex searches).

Note: There is a point where a file becomes large enough that reading
the whole file into memory at once (either as a single string or as a
list of strings) is foolish. However, 30,000 lines doesn't push that
boundary.
> 2) Is there any efficient ways to extract the object information you
> may suggest?
>

Again, the re module has nice ways to find a pattern, and return parse
out pieces of it. Building a good regular expression takes time,
experience, and a bit of black magic... To do so for this case, we
might need more knowledge of your format. Also regular expressions have
their limits. For instance, if the sub objects can nest to any level,
then in fact, regular expressions alone can't solve the whole problem,
and you'll need a more robust parser.

> Thanks,
>
> - Jeremy
>
>
>
> ===== Structured text file =================
> Object1 = {
>
> ...
>
> SubObject1 = {
> ....
>
> SubSubObject1 = {
> ...
> };
> };
>
> SubObject2 = {
> ....
>
> SubSubObject21 = {
> ...
> };
> };
>
> SubObjectN = {
> ....
>
> SubSubObjectN = {
> ...
> };
> };
> };
>

Imaginationworks

2/17/2010 2:36:00 PM

On Feb 16, 7:14 pm, Gary Herron <gher...@islandtraining.com> wrote:
> Imaginationworks wrote:
> > Hi,
>
> > I am trying to read object information from a text file (approx.
> > 30,000 lines) with the following format, each line corresponds to a
> > line in the text file. Currently, the whole file was read into a
> > string list using readlines(), then use for loop to search the "= {"
> > and "};" to determine the Object, SubObject,and SubSubObject. My
> > questions are
>
> > 1) Is there any efficient method that I can search the whole string
> > list to find the location of the tokens(such as '= {' or '};'
>
> Yes. Read the *whole* file into a single string using file.read()
> method, and then search through the string using string methods (for
> simple things) or use re, the regular expression module, (for more
> complex searches).
>
> Note: There is a point where a file becomes large enough that reading
> the whole file into memory at once (either as a single string or as a
> list of strings) is foolish. However, 30,000 lines doesn't push that
> boundary.
>
> > 2) Is there any efficient ways to extract the object information you
> > may suggest?
>
> Again, the re module has nice ways to find a pattern, and return parse
> out pieces of it. Building a good regular expression takes time,
> experience, and a bit of black magic... To do so for this case, we
> might need more knowledge of your format. Also regular expressions have
> their limits. For instance, if the sub objects can nest to any level,
> then in fact, regular expressions alone can't solve the whole problem,
> and you'll need a more robust parser.
>
> > Thanks,
>
> > - Jeremy
>
> > ===== Structured text file =================
> > Object1 = {
>
> > ...
>
> > SubObject1 = {
> > ....
>
> > SubSubObject1 = {
> > ...
> > };
> > };
>
> > SubObject2 = {
> > ....
>
> > SubSubObject21 = {
> > ...
> > };
> > };
>
> > SubObjectN = {
> > ....
>
> > SubSubObjectN = {
> > ...
> > };
> > };
> > };
>
>

Gary and Rhodri, Thank you for the suggestions.

Paul McGuire

2/17/2010 7:40:00 PM

On Feb 16, 5:48 pm, Imaginationworks <xiaju...@gmail.com> wrote:
> Hi,
>
> I am trying to read object information from a text file (approx.
> 30,000 lines) with the following format, each line corresponds to a
> line in the text file. Currently, the whole file was read into a
> string list using readlines(), then use for loop to search the "= {"
> and "};" to determine the Object, SubObject,and SubSubObject.

If you open(filename).read() this file into a variable named data, the
following pyparsing parser will pick out your nested brace
expressions:

from pyparsing import *

EQ,LBRACE,RBRACE,SEMI = map(Suppress,"={};")
ident = Word(alphas, alphanums)
contents = Forward()
defn = Group(ident + EQ + Group(LBRACE + contents + RBRACE + SEMI))

contents << ZeroOrMore(defn | ~(LBRACE|RBRACE) + Word(printables))

results = defn.parseString(data)

print results

Prints:

[
['Object1',
['...',
['SubObject1',
['....',
['SubSubObject1',
['...']
]
]
],
['SubObject2',
['....',
['SubSubObject21',
['...']
]
]
],
['SubObjectN',
['....',
['SubSubObjectN',
['...']
]
]
]
]
]
]

-- Paul

Imaginationworks

2/17/2010 11:37:00 PM

On Feb 17, 1:40 pm, Paul McGuire <pt...@austin.rr.com> wrote:
> On Feb 16, 5:48 pm, Imaginationworks <xiaju...@gmail.com> wrote:
>
> > Hi,
>
> > I am trying to read object information from a text file (approx.
> > 30,000 lines) with the following format, each line corresponds to a
> > line in the text file. Currently, the whole file was read into a
> > string list using readlines(), then use for loop to search the "= {"
> > and "};" to determine the Object, SubObject,and SubSubObject.
>
> If you open(filename).read() this file into a variable named data, the
> following pyparsing parser will pick out your nested brace
> expressions:
>
> from pyparsing import *
>
> EQ,LBRACE,RBRACE,SEMI = map(Suppress,"={};")
> ident = Word(alphas, alphanums)
> contents = Forward()
> defn = Group(ident + EQ + Group(LBRACE + contents + RBRACE + SEMI))
>
> contents << ZeroOrMore(defn | ~(LBRACE|RBRACE) + Word(printables))
>
> results = defn.parseString(data)
>
> print results
>
> Prints:
>
> [
> ['Object1',
> ['...',
> ['SubObject1',
> ['....',
> ['SubSubObject1',
> ['...']
> ]
> ]
> ],
> ['SubObject2',
> ['....',
> ['SubSubObject21',
> ['...']
> ]
> ]
> ],
> ['SubObjectN',
> ['....',
> ['SubSubObjectN',
> ['...']
> ]
> ]
> ]
> ]
> ]
> ]
>
> -- Paul

Wow, that is great! Thanks

Jonathan Gardner

2/18/2010 1:13:00 AM

On Feb 16, 3:48 pm, Imaginationworks <xiaju...@gmail.com> wrote:
> Hi,
>
> I am trying to read object information from a text file (approx.
> 30,000 lines) with the following format, each line corresponds to a
> line in the text file. Currently, the whole file was read into a
> string list using readlines(), then use for loop to search the "= {"
> and "};" to determine the Object, SubObject,and SubSubObject. My
> questions are
>
> 1) Is there any efficient method that I can search the whole string
> list to find the location of the tokens(such as '= {' or '};'
>
> 2) Is there any efficient ways to extract the object information you
> may suggest?

Parse it!

Go full-bore with a real parser. You may want to consider one of the
many fine Pythonic implementations of modern parsers, or break out
more traditional parsing tools.

This format is nested, meaning that you can't use regexes to parse
what you want out of it. You're going to need a real, full-bore, no-
holds-barred parser for this.

Don't worry, the road is not easy but the destination is absolutely
worth it.

Once you come to appreciate and understand parsing, you have earned
the right to call yourself a red-belt programmer. To get your black-
belt, you'll need to write your own compiler. Having mastered these
two tasks, there is no problem you cannot tackle.

And once you realize that every program is really a compiler, then you
have truly mastered the Zen of Programming in Any Programming Language
That Will Ever Exist.

With this understanding, you will judge programming language utility
based solely on how hard it is to write a compiler in it, and
complexity based on how hard it is to write a compiler for it. (Notice
there are not a few parsers written in Python, as well as Jython and
PyPy and others written for Python!)

Steven D'Aprano

2/18/2010 1:38:00 AM

On Wed, 17 Feb 2010 17:13:23 -0800, Jonathan Gardner wrote:

> And once you realize that every program is really a compiler, then you
> have truly mastered the Zen of Programming in Any Programming Language
> That Will Ever Exist.

In the same way that every tool is really a screwdriver.

--
Steven

Paul McGuire

2/18/2010 1:39:00 PM

On Feb 17, 7:38 pm, Steven D'Aprano
<ste...@REMOVE.THIS.cybersource.com.au> wrote:
> On Wed, 17 Feb 2010 17:13:23 -0800, Jonathan Gardner wrote:
> > And once you realize that every program is really a compiler, then you
> > have truly mastered the Zen of Programming in Any Programming Language
> > That Will Ever Exist.
>
> In the same way that every tool is really a screwdriver.
>
> --
> Steven

The way I learned this was:
- Use the right tool for the right job.
- Every tool is a hammer.

-- Paul

comp.lang.python

How to efficiently extract information from structured text file

Imaginationworks

Rhodri James

Gary Herron

Imaginationworks

Paul McGuire

Imaginationworks

Jonathan Gardner

Steven D'Aprano

Paul McGuire

x Login to ForumsZone