Paul McGuire
2/13/2008 2:29:00 PM
On Feb 13, 6:53 am, mathieu <mathieu.malate...@gmail.com> wrote:
> I do not understand what is wrong with the following regex expression.
> I clearly mark that the separator in between group 3 and group 4
> should contain at least 2 white space, but group 3 is actually reading
> 3 +4
>
> Thanks
> -Mathieu
>
> import re
>
> line = " (0021,xx0A) Siemens: Thorax/Multix FD Lab Settings
> Auto Window Width SL 1 "
> patt = re.compile("^\s*\(([0-9A-Z]+),([0-9A-Zx]+)\)\s+([A-Za-z0-9./:_
> -]+)\s\s+([A-Za-z0-9 ()._,/#>-]+)\s+([A-Z][A-Z]_?O?W?)\s+([0-9n-]+)\s*
> $")
<snip>
I love the smell of regex'es in the morning!
For more legible posting (and general maintainability), try breaking
up your quoted strings like this:
line = " (0021,xx0A) Siemens: Thorax/Multix FD Lab Settings " "Auto Window Width SL 1 "
patt = re.compile(
"^\s*"
"\("
"([0-9A-Z]+),"
"([0-9A-Zx]+)"
"\)\s+"
"([A-Za-z0-9./:_ -]+)\s\s+"
"([A-Za-z0-9 ()._,/#>-]+)\s+"
"([A-Z][A-Z]_?O?W?)\s+"
"([0-9n-]+)\s*$")
Of course, the problem is that you have a greedy match in the part of
the regex that is supposed to stop between "Settings" and "Auto".
Change patt to:
patt = re.compile(
"^\s*"
"\("
"([0-9A-Z]+),"
"([0-9A-Zx]+)"
"\)\s+"
"([A-Za-z0-9./:_ -]+?)\s\s+"
"([A-Za-z0-9 ()._,/#>-]+)\s+"
"([A-Z][A-Z]_?O?W?)\s+"
"([0-9n-]+)\s*$")
or if you prefer:
patt = re.compile("^\s*\(([0-9A-Z]+),([0-9A-Zx]+)\)\s+([A-Za-z0-9./:_
-]+?)\s\s+([A-Za-z0-9 ()._,/#>-]+)\s+([A-Z][A-Z]_?O?W?)\s+([0-9n-]+)\s*
$")
It looks like you wrote this regex to process this specific input
string - it has a fragile feel to it, as if you will have to go back
and tweak it to handle other data that might come along, such as
(xx42,xx0A) Honeywell: Inverse Flitznoid (Kelvin)
80 SL 1
Just out of curiosity, I wondered what a pyparsing version of this
would look like. See below:
from pyparsing import Word,hexnums,delimitedList,printables, White,Regex,nums
line = " (0021,xx0A) Siemens: Thorax/Multix FD Lab Settings " "Auto Window Width SL 1 "
# define fields
hexint = Word(hexnums+"x")
text = delimitedList(Word(printables),
delim=White(" ",exact=1), combine=True)
type_label = Regex("[A-Z][A-Z]_?O?W?")
int_label = Word(nums+"n-")
# define line structure - give each field a name
line_defn = "(" + hexint("x") + "," + hexint("y") + ")" + text("desc") + text("window") + type_label("type") + int_label("int")
line_parts = line_defn.parseString(line)
print line_parts.dump()
print line_parts.desc
Prints:
['(', '0021', ',', 'xx0A', ')', 'Siemens: Thorax/Multix FD Lab
Settings', 'Auto Window Width', 'SL', '1']
- desc: Siemens: Thorax/Multix FD Lab Settings
- int: 1
- type: SL
- window: Auto Window Width
- x: 0021
- y: xx0A
Siemens: Thorax/Multix FD Lab Settings
I was just guessing on the field names, but you can see where they are
defined and change them to the appropriate values.
-- Paul