Asp Forum - returning regex matches as lists

Jonathan Lukens

2/15/2008 7:07:00 PM

I am in the last phase of building a Django app based on something I
wrote in Java a while back. Right now I am stuck on how to return the
matches of a regular expression as a list *at all*, and in particular
given that the regex has a number of groupings. The only method I've
seen that returns a list is .findall(string), but then I get back the
groups as tuples, which is sort of a problem.

Thank you,
Jonathan

13 Answers

John Machin

2/15/2008 8:25:00 PM

On Feb 16, 6:07 am, Jonathan Lukens <jonathan.luk...@gmail.com> wrote:
> I am in the last phase of building a Django app based on something I
> wrote in Java a while back. Right now I am stuck on how to return the
> matches of a regular expression as a list *at all*, and in particular
> given that the regex has a number of groupings. The only method I've
> seen that returns a list is .findall(string), but then I get back the
> groups as tuples, which is sort of a problem.
>

It would help if you explained what you want the contents of the list
to be, why you want a list as opposed to a tuple or a generator or
whatever ... we can't be expected to imagine why getting groups as
tuples is "sort of a problem".

Use a concrete example, e.g.

>>> import re
>>> regex = re.compile(r'(\w+)\s+(\d+)')
>>> text = 'python 1 junk xyzzy 42 java 666'
>>> r = regex.findall(text)
>>> r
[('python', '1'), ('xyzzy', '42'), ('java', '666')]
>>>

What would you like to see instead?

Jonathan Lukens

2/15/2008 9:26:00 PM

> What would you like to see instead?

I had mostly just expected that there was some method that would
return each entire match as an item on a list. I have this pattern:

>>> import re
>>> corporate_names = re.compile(u'(?u)\\b([?-?]{2,}\\s+)([<<"][?-??-?]+)(\\s*-?[?-??-?]+)*([>>"])')
>>> terms = corporate_names.findall(sourcetext)

Which matches a specific way that Russian company names are
formatted. I was expecting a method that would return this:

>>> terms
[u'string one', u'string two', u'string three']

...mostly because I was working it this way in Java and haven't
learned to do things the Python way yet. At the suggestion from
someone on the list, I just used list() on all the tuples like so:

>>> detupled_terms = [list(term_tuple) for term_tuple in terms]
>>> delisted_terms = [''.join(term_list) for term_list in detupled_terms]

which achieves the desired result, but I am not a programmer and so I
would still be interested to know if there is a more elegant way of
doing this.

I appreciate the help.

Jonathan

Gabriel Genellina

2/16/2008 1:18:00 AM

En Fri, 15 Feb 2008 17:07:21 -0200, Jonathan Lukens
<jonathan.lukens@gmail.com> escribió:

> I am in the last phase of building a Django app based on something I
> wrote in Java a while back. Right now I am stuck on how to return the
> matches of a regular expression as a list *at all*, and in particular
> given that the regex has a number of groupings. The only method I've
> seen that returns a list is .findall(string), but then I get back the
> groups as tuples, which is sort of a problem.

Do you want something like this?

py> re.findall(r"([a-z]+)([0-9]+)", "foo bar3 w000 no abc123")
[('bar', '3'), ('w', '000'), ('abc', '123')]
py> re.findall(r"(([a-z]+)([0-9]+))", "foo bar3 w000 no abc123")
[('bar3', 'bar', '3'), ('w000', 'w', '000'), ('abc123', 'abc', '123')]
py> groups = re.findall(r"(([a-z]+)([0-9]+))", "foo bar3 w000 no abc123")
py> groups
[('bar3', 'bar', '3'), ('w000', 'w', '000'), ('abc123', 'abc', '123')]
py> [group[0] for group in groups]
['bar3', 'w000', 'abc123']

--
Gabriel Genellina

Gabriel Genellina

2/16/2008 1:32:00 AM

En Fri, 15 Feb 2008 19:25:59 -0200, Jonathan Lukens
<jonathan.lukens@gmail.com> escribiÃ³:

>> What would you like to see instead?
>
> I had mostly just expected that there was some method that would
> return each entire match as an item on a list. I have this pattern:
>
>>>> import re
>>>> corporate_names =
>>>> re.compile(u'(?u)\\b([Ð-Ð¯]{2,}\\s+)([<<"][Ð°-ÑÐ-Ð¯]+)(\\s*-?[Ð°-ÑÐ-Ð¯]+)*([>>"])')
>>>> terms = corporate_names.findall(sourcetext)
>
> Which matches a specific way that Russian company names are
> formatted. I was expecting a method that would return this:
>
>>>> terms
> [u'string one', u'string two', u'string three']
>
> ...mostly because I was working it this way in Java and haven't
> learned to do things the Python way yet. At the suggestion from
> someone on the list, I just used list() on all the tuples like so:

The group() method of match objects does what you want:

terms = [match.group() for match in corporate_names.finditer(sourcetext)]

See http://docs.python.org/lib/match-ob...

>>>> detupled_terms = [list(term_tuple) for term_tuple in terms]
>>>> delisted_terms = [''.join(term_list) for term_list in detupled_terms]
>
> which achieves the desired result, but I am not a programmer and so I
> would still be interested to know if there is a more elegant way of
> doing this.

That ''.join(...) works equally well on tuples; you don't have to convert
tuples to lists first:

delisted_terms = [''.join(term_list) for term in terms]

--
Gabriel Genellina

Jonathan Lukens

2/16/2008 5:14:00 AM

On Feb 15, 8:31 pm, "Gabriel Genellina" <gagsl-...@yahoo.com.ar>
wrote:
> En Fri, 15 Feb 2008 19:25:59 -0200, Jonathan Lukens
> <jonathan.luk...@gmail.com> escribió:
>
>
>
> >> What would you like to see instead?
>
> > I had mostly just expected that there was some method that would
> > return each entire match as an item on a list. I have this pattern:
>
> >>>> import re
> >>>> corporate_names =
> >>>> re.compile(u'(?u)\\b([?-?]{2,}\\s+)([<<"][?-??-?]+)(\\s*-?[?-??-?]+)*([>>"])')
> >>>> terms = corporate_names.findall(sourcetext)
>
> > Which matches a specific way that Russian company names are
> > formatted. I was expecting a method that would return this:
>
> >>>> terms
> > [u'string one', u'string two', u'string three']
>
> > ...mostly because I was working it this way in Java and haven't
> > learned to do things the Python way yet. At the suggestion from
> > someone on the list, I just used list() on all the tuples like so:
>
> The group() method of match objects does what you want:
>
> terms = [match.group() for match in corporate_names.finditer(sourcetext)]
>
> Seehttp://docs.python.org/lib/match-ob...
>
> >>>> detupled_terms = [list(term_tuple) for term_tuple in terms]
> >>>> delisted_terms = [''.join(term_list) for term_list in detupled_terms]
>
> > which achieves the desired result, but I am not a programmer and so I
> > would still be interested to know if there is a more elegant way of
> > doing this.
>
> That ''.join(...) works equally well on tuples; you don't have to convert
> tuples to lists first:
>
> delisted_terms = [''.join(term_list) for term in terms]
>
> --
> Gabriel Genellina

Thanks Gabriel,

That is just what I was looking for.

Jonathan

John Machin

2/16/2008 11:44:00 AM

On Feb 16, 8:25 am, Jonathan Lukens <jonathan.luk...@gmail.com> wrote:
> > What would you like to see instead?
>
> I had mostly just expected that there was some method that would
> return each entire match as an item on a list. I have this pattern:
>
> >>> import re
> >>> corporate_names = re.compile(u'(?u)\\b([á-ñ]{2,}\\s+)([<<"][Á-Ñá-ñ]+)(\\s*-?[Á-Ñá-ñ]+)*([>>"])')
> >>> terms = corporate_names.findall(sourcetext)
>
> Which matches a specific way that Russian company names are
> formatted. I was expecting a method that would return this:
>
> >>> terms
>
> [u'string one', u'string two', u'string three']

What is the point of having parenthesised groups in the regex if you
are interested only in the whole match?

Other comments:
(1) raw string for improved legibility
ru'(?u)\b([á-ñ]{2,}\s+)([<<"][Á-Ñá-ñ]+)(\s*-?[Á-Ñá-ñ]+)*([>>"])'
(2) consider not including space at the end of a group
ru'(?u)\b([á-ñ]{2,})\s+([<<"][Á-Ñá-ñ]+)\s*(-?[Á-Ñá-ñ]+)*([>>"])'
(3) what appears between [] is a set of characters, so [<<"] is the
same as [<"] and probably isn't doing what you expect; have you tested
this regex for correctness?

>
> ...mostly because I was working it this way in Java and haven't
> learned to do things the Python way yet. At the suggestion from
> someone on the list, I just used list() on all the tuples like so:
>
> >>> detupled_terms = [list(term_tuple) for term_tuple in terms]
> >>> delisted_terms = [''.join(term_list) for term_list in detupled_terms]
>
> which achieves the desired result, but I am not a programmer and so I
> would still be interested to know if there is a more elegant way of
> doing this.

I can't imagine how "not a programmer" implies "interested to know if
there is a more elegant way". In any case, explore the correctness
axis first.

Cheers,
John

Jonathan Lukens

2/16/2008 12:28:00 PM

John,

> (1) raw string for improved legibility
> ru'(?u)\b([á-ñ]{2,}\s+)([<<"][Á-Ñá-ñ]+)(\s*-?[Á-Ñá-ñ]+)*([>>"])'

This actually escaped my notice after I had posted -- the letters with
diacritics are incorrectly decoded Cyrillic letters -- I suppose I
code use the Unicode escape sequences (the sets [á-ñ] and [Á-Ñá-ñ] are
the Cyrillic equivalents of [a-z] and [A-Za-z]) but then suddenly the
legibility goes out the window again.

> (3) what appears between [] is a set of characters, so [<<"] is the
> same as [<"] and probably isn't doing what you expect; have you tested
> this regex for correctness?

These were angled quotation marks in the original Unicode. Sorry
again. The regex matches everything it is supposed to. The extra
parentheses were because I had somehow missed the .group method and it
had only been returning what was only in the one needed set of
parentheses.

> I can't imagine how "not a programmer" implies "interested to know if
> there is a more elegant way".

More carefully stated: "I am self-taught have no real training or
experience as a programmer and would be interested in seeing how a
programmer with training
and experience would go about this."

Thank you,
Jonathan

plainolamerican

4/1/2014 8:34:00 PM

On Monday, March 31, 2014 5:10:31 PM UTC-5, The Peeler wrote:
> On Mon, 31 Mar 2014 14:52:06 -0700 (PDT), plainstupidamerican wrote:
>
>
>
>
>
> >
>
> > Yobama Sending Stinger Missiles To Syria
>
> > ---
>
> > speculation noted ... not shared.
>
> >
>
> > zionist warmongers in the US wish he would.
>
>
>
> LOL!!! Sick idiot!

jew dog fucker!

The Peeler

4/1/2014 10:01:00 PM

On Tue, 1 Apr 2014 13:33:58 -0700 (PDT), plainstupidamerican wrote:

>>
>>> zionist warmongers in the US wish he would.
>>
>> LOL!!! Sick idiot!
>
> jew dog fucker!

LOL!!! Like I said: "Sick Idiot"! Thanks for proving it on the spot!

Carolina Reb

4/2/2014 6:25:00 AM

On Wednesday, April 2, 2014 12:20:27 AM UTC-4, Michael Ejercito wrote:
>
>
>
>
>
> Michael
>
>
>

Plain Nailed Gooks:
http://ts2.mm.bing.net/th?id=H.4866925029885505&a...
http://purpleroofs.com/gay-travel-blog/wp-content/uploads/2011/03/zamboanga-philippines-crucifixion-re-enac...
http://i.telegraph.co.uk/multimedia/archive/01609/nail-cross_16...
http://farm3.staticflickr.com/2708/4388449094_afc71b...
http://img.ibtimes.com/www/data/images/full/2013/03/30/357997-good-friday-observed-around-the...
http://video.nationalgeographic.com/video/player/media/philippines_crucifixion/philippines_crucifixion_4...
http://referentiel.nouvelobs.com/file/3...
http://i126.photobucket.com/albums/p103/aidanski...
http://ts4.mm.bing.net/th?id=H.4920384434012279&a...
http://ds2.ds.static.rtbf.be/article/image/624x351/6/c/e/2f7897a8d872c34c0c8c017d6953a697-1333...
http://referentiel.nouvelobs.com/file/5579436-philippines-scenes-de-crucifixion-pour-le-vendredi...
http://media01.bigblackbag.net/15085/portfolio_media/lwsm_religion_philippines_0...
http://ts4.mm.bing.net/th?id=H.4973384354103903&a...
http://www.asianoffbeat.com/TooWeird/Philippine-Devotees-Nailed-to-C...

comp.lang.python

returning regex matches as lists

Jonathan Lukens

John Machin

Jonathan Lukens

Gabriel Genellina

Gabriel Genellina

Jonathan Lukens

John Machin

Jonathan Lukens

plainolamerican

The Peeler

Carolina Reb

x Login to ForumsZone