Asp Forum - identifying and parsing string in text file

Bryan.Fodness@gmail.com

3/8/2008 7:50:00 PM

I have a large file that has many lines like this,

<element tag="300a,0014" vr="CS" vm="1" len="4"
name="DoseReferenceStructureType">SITE</element>

I would like to identify the line by the tag (300a,0014) and then grab
the name (DoseReferenceStructureType) and value (SITE).

I would like to create a file that would have the structure,

DoseReferenceStructureType = Site
...
...

Also, there is a possibility that there are multiple lines with the
same tag, but different values. These all need to be recorded.

So far, I have a little bit of code to look at everything that is
available,

for line in open(str(sys.argv[1])):
i_line = line.split()
if i_line:
if i_line[0] == "<element":
a = i_line[1]
b = i_line[5]
print "%s | %s" %(a, b)

but do not see a clever way of doing what I would like.

Any help or guidance would be appreciated.

Bryan

4 Answers

Bernard

3/8/2008 8:02:00 PM

Hey Brian,

It seems the text you are trying to parse is similar to XML/HTML.
So I'd use BeautifulSoup[1] if I were you :)

here's a sample code for your scraping case:

from BeautifulSoup import BeautifulSoup

<python>

# assume the s variable has your text
s = "whatever xml or html here"
# turn it into a tasty & parsable soup :)
soup = BeautifulSoup(s)
# for every element tag in the soup
for el in soup.findAll("element"):
# print out its tag & name attribute plus its inner value!
print el["tag"], el["name"], el.string

</python>

that's it!

[1] http://www.crummy.com/software/Beau...

On 8 mar, 14:49, "Bryan.Fodn...@gmail.com" <Bryan.Fodn...@gmail.com>
wrote:
> I have a large file that has many lines like this,
>
> <element tag="300a,0014" vr="CS" vm="1" len="4"
> name="DoseReferenceStructureType">SITE</element>
>
> I would like to identify the line by the tag (300a,0014) and then grab
> the name (DoseReferenceStructureType) and value (SITE).
>
> I would like to create a file that would have the structure,
>
> DoseReferenceStructureType = Site
> ...
> ...
>
> Also, there is a possibility that there are multiple lines with the
> same tag, but different values. These all need to be recorded.
>
> So far, I have a little bit of code to look at everything that is
> available,
>
> for line in open(str(sys.argv[1])):
> i_line = line.split()
> if i_line:
> if i_line[0] == "<element":
> a = i_line[1]
> b = i_line[5]
> print "%s | %s" %(a, b)
>
> but do not see a clever way of doing what I would like.
>
> Any help or guidance would be appreciated.
>
> Bryan

Nemesis

3/8/2008 8:03:00 PM

Bryan.Fodness@gmail.com wrote:

> I have a large file that has many lines like this,
>
> <element tag="300a,0014" vr="CS" vm="1" len="4"
> name="DoseReferenceStructureType">SITE</element>
>
> I would like to identify the line by the tag (300a,0014) and then grab
> the name (DoseReferenceStructureType) and value (SITE).
>
> I would like to create a file that would have the structure,
>
> DoseReferenceStructureType = Site
> ...
> ...

You should try with Regular Expressions or if it is something like xml there
is for sure a library you can you to parse it ...
anyway you can try something simpler like this:

elem_dic=dict()
for line in open(str(sys.argv[1])):
line_splitted=line.split()
for item in line_splitted:
item_splitted=item.split("=")
if len(item_splitted)>1:
elem_dic[item_splitted[0]]=item_splitted[1]

.... then you have to retrieve from the dict the items you need, for example,
with the line you posted you obtain these items splitted:

['<element']
['tag', '"300a,0014"']
['vr', '"CS"']
['vm', '"1"']
['len', '"4"']
['name', '"DoseReferenceStructureType">SITE</element>']

and elem_dic will contain the last five, with the keys
'tag','vr','vm','len','name' and teh values 300a,0014 etc etc
i.e. this:

{'vr': '"CS"', 'tag': '"300a,0014"', 'vm': '"1"', 'len': '"4"', 'name': '"DoseReferenceStructureType">SITE</element>'}

--
Age is not a particularly interesting subject. Anyone can get old. All
you have to do is live long enough.

Paul McGuire

3/8/2008 10:11:00 PM

On Mar 8, 2:02 pm, Nemesis <neme...@nowhere.invalid> wrote:
> Bryan.Fodn...@gmail.com wrote:
> > I have a large file that has many lines like this,
>
> > <element tag="300a,0014" vr="CS" vm="1" len="4"
> > name="DoseReferenceStructureType">SITE</element>
>
> > I would like to identify the line by the tag (300a,0014) and then grab
> > the name (DoseReferenceStructureType) and value (SITE).
>
> You should try with Regular Expressions or if it is something like xml there
> is for sure a library you can you to parse it ...
<snip>

When it comes to parsing HTML or XML of uncontrolled origin, regular
expressions are an iffy proposition. You'd be amazed what kind of
junk shows up inside an XML (or worse, HTML) tag.

Pyparsing includes a builtin method for constructing tag matching
parsing patterns, which you can then use to scan through the XML or
HTML source:

from pyparsing import makeXMLTags, withAttribute, SkipTo

testdata = """
<blah>
<element tag="300a,0014" vr="CS" vm="1" len="4"
name="DoseReferenceStructureType">SITE</element>
<element tag="300Z,0019" vr="CS" vm="1" len="4"
name="DoseReferenceStructureType">SITEXXX</element>
<element tag="300a,0014" vr="CS" vm="1" len="4"
name="DoseReferenceStructureType">SITE2</element>
<blahblah>
"""

elementStart,elementEnd = makeXMLTags("element")
elementStart.setParseAction(withAttribute(tag="300a,0014"))
search = elementStart + SkipTo(elementEnd)("body")

for t in search.searchString(testdata):
print t.name
print t.body

Prints:

DoseReferenceStructureType
SITE
DoseReferenceStructureType
SITE2

In this case, the parse action withAttribute filters <element> tag
matches, accepting *only* those with the attribute "tag" and the value
"300a,0014". The pattern search adds on the body of the <element></
element> tag, and gives it the name "body" so it is easily accessed
after parsing is completed.

-- Paul
(More about pyparsing at http://pyparsing.wiki....)

bruno.desthuilliers@gmail.com

3/9/2008 1:17:00 PM

On 8 mar, 20:49, "Bryan.Fodn...@gmail.com" <Bryan.Fodn...@gmail.com>
wrote:
> I have a large file that has many lines like this,
>
> <element tag="300a,0014" vr="CS" vm="1" len="4"
> name="DoseReferenceStructureType">SITE</element>
>
> I would like to identify the line by the tag (300a,0014) and then grab
> the name (DoseReferenceStructureType) and value (SITE).

It's obviously an XML file, so use a XML parser - there are SAX and
DOM parsers in the stdlib, as well as the ElementTree module.

comp.lang.python

identifying and parsing string in text file

Bryan.Fodness@gmail.com

Bernard

Nemesis

Paul McGuire

bruno.desthuilliers@gmail.com

x Login to ForumsZone