Asp Forum - regex question - comp.lang.python

mathieu

2/13/2008 12:54:00 PM

I do not understand what is wrong with the following regex expression.
I clearly mark that the separator in between group 3 and group 4
should contain at least 2 white space, but group 3 is actually reading
3 +4

Thanks
-Mathieu

import re

line = " (0021,xx0A) Siemens: Thorax/Multix FD Lab Settings
Auto Window Width SL 1 "
patt = re.compile("^\s*$([0-9A-Z]+),([0-9A-Zx]+)$\s+([A-Za-z0-9./:_
-]+)\s\s+([A-Za-z0-9 ()._,/#>-]+)\s+([A-Z][A-Z]_?O?W?)\s+([0-9n-]+)\s*
$")
m = patt.match(line)
if m:
print m.group(3)
print m.group(4)

4 Answers

Wanja Chresta

2/13/2008 1:24:00 PM

Hey Mathieu

Due to word wrap I'm not sure what you want to do. What result do you
expect? I get:
>>> print m.groups()
('0021', 'xx0A', 'Siemens: Thorax/Multix FD Lab Settings Auto Window
Width ', ' ', 'SL', '1')
But only when I insert a space in the 3rd char group (I'm not sure if
your original pattern has a space there or not). So the third group is:
([A-Za-z0-9./:_ -]+). If I do not insert the space, the pattern does not
match the line.

I also cant see how the format of your line is. If it is like this:
line = "...Siemens: Thorax/Multix FD Lab Settings Auto Window Width..."
where "Auto Window Width" should be the 4th group, you have to mark the
+ in the 3rd group as non-greedy (it's done with a "?"):
http://docs.python.org/lib/re-s...
([A-Za-z0-9./:_ -]+?)
With that I get:
>>> patt.match(line).groups()
('0021', 'xx0A', 'Siemens: Thorax/Multix FD Lab Settings', 'Auto Window
Width ', 'SL', '1')
Which probably is what you want. You can also add the non-greedy marker
in the fourth group, to get rid of the tailing spaces.

HTH
Wanja

mathieu wrote:
> I clearly mark that the separator in between group 3 and group 4
> should contain at least 2 white space, but group 3 is actually reading
> 3 +4

Bearophile

2/13/2008 1:35:00 PM

mathieu, stop writing complex REs like obfuscated toys, use the
re.VERBOSE flag and split that RE into several commented and
*indented* lines (indented just like Python code), the indentation
level has to be used to denote nesting. With that you may be able to
solve the problem by yourself. If not, you can offer us a much more
readable thing to fix.

Bye,
bearophile

Gerard Flanagan

2/13/2008 1:54:00 PM

On Feb 13, 1:53 pm, mathieu <mathieu.malate...@gmail.com> wrote:
> I do not understand what is wrong with the following regex expression.
> I clearly mark that the separator in between group 3 and group 4
> should contain at least 2 white space, but group 3 is actually reading
> 3 +4
>
> Thanks
> -Mathieu
>
> import re
>
> line = " (0021,xx0A) Siemens: Thorax/Multix FD Lab Settings
> Auto Window Width SL 1 "
> patt = re.compile("^\s*$([0-9A-Z]+),([0-9A-Zx]+)$\s+([A-Za-z0-9./:_
> -]+)\s\s+([A-Za-z0-9 ()._,/#>-]+)\s+([A-Z][A-Z]_?O?W?)\s+([0-9n-]+)\s*
> $")
> m = patt.match(line)
> if m:
> print m.group(3)
> print m.group(4)

I don't know if it solves your problem, but if you want to match a
dash (-), then it must be either escaped or be the first element in a
character class.

Gerard

Paul McGuire

2/13/2008 2:29:00 PM

On Feb 13, 6:53 am, mathieu <mathieu.malate...@gmail.com> wrote:
> I do not understand what is wrong with the following regex expression.
> I clearly mark that the separator in between group 3 and group 4
> should contain at least 2 white space, but group 3 is actually reading
> 3 +4
>
> Thanks
> -Mathieu
>
> import re
>
> line = " (0021,xx0A) Siemens: Thorax/Multix FD Lab Settings
> Auto Window Width SL 1 "
> patt = re.compile("^\s*$([0-9A-Z]+),([0-9A-Zx]+)$\s+([A-Za-z0-9./:_
> -]+)\s\s+([A-Za-z0-9 ()._,/#>-]+)\s+([A-Z][A-Z]_?O?W?)\s+([0-9n-]+)\s*
> $")
<snip>

I love the smell of regex'es in the morning!

For more legible posting (and general maintainability), try breaking
up your quoted strings like this:

line = " (0021,xx0A) Siemens: Thorax/Multix FD Lab Settings " "Auto Window Width SL 1 "

patt = re.compile(
"^\s*"
"$"
"([0-9A-Z]+),"
"([0-9A-Zx]+)"
"$\s+"
"([A-Za-z0-9./:_ -]+)\s\s+"
"([A-Za-z0-9 ()._,/#>-]+)\s+"
"([A-Z][A-Z]_?O?W?)\s+"
"([0-9n-]+)\s*$")

Of course, the problem is that you have a greedy match in the part of
the regex that is supposed to stop between "Settings" and "Auto".
Change patt to:

patt = re.compile(
"^\s*"
"$"
"([0-9A-Z]+),"
"([0-9A-Zx]+)"
"$\s+"
"([A-Za-z0-9./:_ -]+?)\s\s+"
"([A-Za-z0-9 ()._,/#>-]+)\s+"
"([A-Z][A-Z]_?O?W?)\s+"
"([0-9n-]+)\s*$")

or if you prefer:

patt = re.compile("^\s*$([0-9A-Z]+),([0-9A-Zx]+)$\s+([A-Za-z0-9./:_
-]+?)\s\s+([A-Za-z0-9 ()._,/#>-]+)\s+([A-Z][A-Z]_?O?W?)\s+([0-9n-]+)\s*
$")

It looks like you wrote this regex to process this specific input
string - it has a fragile feel to it, as if you will have to go back
and tweak it to handle other data that might come along, such as

(xx42,xx0A) Honeywell: Inverse Flitznoid (Kelvin)
80 SL 1

Just out of curiosity, I wondered what a pyparsing version of this
would look like. See below:

from pyparsing import Word,hexnums,delimitedList,printables, White,Regex,nums

line = " (0021,xx0A) Siemens: Thorax/Multix FD Lab Settings " "Auto Window Width SL 1 "

# define fields
hexint = Word(hexnums+"x")
text = delimitedList(Word(printables),
delim=White(" ",exact=1), combine=True)
type_label = Regex("[A-Z][A-Z]_?O?W?")
int_label = Word(nums+"n-")

# define line structure - give each field a name
line_defn = "(" + hexint("x") + "," + hexint("y") + ")" + text("desc") + text("window") + type_label("type") + int_label("int")

line_parts = line_defn.parseString(line)
print line_parts.dump()
print line_parts.desc

Prints:
['(', '0021', ',', 'xx0A', ')', 'Siemens: Thorax/Multix FD Lab
Settings', 'Auto Window Width', 'SL', '1']
- desc: Siemens: Thorax/Multix FD Lab Settings
- int: 1
- type: SL
- window: Auto Window Width
- x: 0021
- y: xx0A
Siemens: Thorax/Multix FD Lab Settings

I was just guessing on the field names, but you can see where they are
defined and change them to the appropriate values.

-- Paul

comp.lang.python

regex question

mathieu

Wanja Chresta

Bearophile

Gerard Flanagan

Paul McGuire

x Login to ForumsZone