Asp Forum - regex select multiple words in the middle of a sentence

Raimon Fs

4/7/2009 9:50:00 AM

hello,

Given a sentence with some words, I want only some words but not all of
them ...

REGISTRO DE LA PROPIEDAD DE ALBACETE X EMISOR: MANUELA ADORACION
CEBOLLA GARCIA Padre Romano, 12 2005 ALBACETE ALBACETE NIF: 44444444P

In this case, I'm interested in the full name:

MANUELA ADORACION CEBOLLA GARCIA

I now that all names precede with the pattern EMISOR:

And after the full name, the address is lowercase except the first char.

With this patter I can find all the uppercase words: \w*[A-Z]{2}\b

But I'm only interested in the full name, so if I use: EMISOR:
\w*[A-Z]{2}\b

I only get the first name MANUELA

How I can get from there to the end of the name ?

Any help ?

thanks ...

r.
--
Posted via http://www.ruby-....

11 Answers

Robert Dober

4/7/2009 10:24:00 AM

On Tue, Apr 7, 2009 at 11:50 AM, Raimon Fs <coder@montx.com> wrote:
> hello,
>
> Given a sentence with some words, I want only some words but not all of
> them ...
>
>
> =A0REGISTRO DE LA PROPIEDAD DE ALBACETE X EMISOR: MANUELA ADORACION
> CEBOLLA GARCIA Padre Romano, 12 2005 ALBACETE ALBACETE NIF: 44444444P
>
> In this case, I'm interested in the full name:
>
> MANUELA ADORACION CEBOLLA GARCIA
>
> I now that all names precede with the pattern EMISOR:
>
> And after the full name, the address is lowercase except the first char.
>
>
> With this patter I can find all the uppercase words: \w*[A-Z]{2}\b
>
> But I'm only interested in the full name, so if I use: EMISOR:
> \w*[A-Z]{2}\b
>
> I only get the first name MANUELA
>
> How I can get from there to the end of the name ?
>
> Any help ?

With 1.9's Oniguruma (is it available for 1.8?) it's quite easy

scan( /EMISOR:\s*((?:[A-Z]+\s*)+)(?=3D[A-Z][a-z])/ ).flatten

you might want to use Unicode strings though and POSIX Character Classes

/EMISOR:\s*((?:[:upper:].... [:upper:][:lower:])/

HTH
Robert

P.S.
If you need a 1.8 version tell me I will switch to 1.8 when I find some tim=
e.

>
> thanks ...
>
> r.
> --
> Posted via http://www.ruby-....
>
>

--=20
There are some people who begin the Zoo at the beginning, called
WAYIN, and walk as quickly as they can past every cage until they get
to the one called WAYOUT, but the nicest people go straight to the
animal they love the most, and stay there. ~ A.A. Milne (from
Winnie-the-Pooh)

Robert Dober

4/7/2009 10:38:00 AM

On Tue, Apr 7, 2009 at 12:24 PM, Robert Dober <robert.dober@gmail.com> wrot=
e:
> =A0 scan( /EMISOR:\s*((?:[A-Z]+\s*)+)(?=3D[A-Z][a-z])/ ).flatten
Forgive YLHS (L for lazy)

scan( /EMISOR:\s*((?:[A-Z]+\s*?)+)(?=3D\s*[A-Z][a-z])/ ).flatten

I was too greedy and got you trailing spaces.
R.

Matthias Reitinger

4/7/2009 11:45:00 AM

Robert Dober wrote:
> With 1.9's Oniguruma (is it available for 1.8?) it's quite easy

Yes[1].

-Matthias

[1]: http://oniguruma.ruby...

Robert Dober

4/7/2009 1:23:00 PM

On Tue, Apr 7, 2009 at 1:49 PM, Matthias Reitinger
<reitinge+usenet@in.tum.de> wrote:
> Robert Dober wrote:
>> With 1.9's Oniguruma (is it available for 1.8?) it's quite easy
>
> Yes[1].
>
> -Matthias
>
> [1]: http://oniguruma.ruby...
>
>
Matthias thx a lot for the link!
Just another thing for OP, I just filed a bug report about the POSIX
character classes, thus I discourage you to use [:lower:] and
[:upper:]. However, it seems that \p{Lower} and \p{Upper} work nicely
if you need unicode support.
Cheers
Robert

--
There are some people who begin the Zoo at the beginning, called
WAYIN, and walk as quickly as they can past every cage until they get
to the one called WAYOUT, but the nicest people go straight to the
animal they love the most, and stay there. ~ A.A. Milne (from
Winnie-the-Pooh)

Mark Thomas

4/7/2009 1:58:00 PM

> With 1.9's Oniguruma (is it available for 1.8?) it's quite easy

This shorter one works in 1.8

scan(/EMISOR:\s*([\w\s]+?)(?=\s*[A-Z][a-z])/).flatten

I'm curious as to what Oniguruma-specific feature you used in yours.

-- Mark.

Robert Dober

4/7/2009 4:48:00 PM

On Tue, Apr 7, 2009 at 3:59 PM, Mark Thomas <mark@thomaszone.com> wrote:
>> With 1.9's Oniguruma (is it available for 1.8?) it's quite easy
>
>
> This shorter one works in 1.8
>
> =A0scan(/EMISOR:\s*([\w\s]+?)(?=3D\s*[A-Z][a-z])/).flatten
>
>
> I'm curious as to what Oniguruma-specific feature you used in yours.
None, apparently ;)
I though however that "(?=3D" was, never realized it was already there in 1=
8.
Does [:lower:] and [:upper:] work in 1.8?
R.

Raimon Fs

4/7/2009 6:24:00 PM

Mark Thomas wrote:
>> With 1.9's Oniguruma (is it available for 1.8?) it's quite easy
>
>
> This shorter one works in 1.8
>
> scan(/EMISOR:\s*([\w\s]+?)(?=\s*[A-Z][a-z])/).flatten
>
>
> I'm curious as to what Oniguruma-specific feature you used in yours.
>
> -- Mark.

thanks to all, at this moment I have enough with Ruby 1.8.7, so I'm with
this one, that works perfectly.

Can you explain why this works ?

:-)

/EMISOR:\s*([\w\s]+?)(?=\s*[A-Z][a-z])/

EMISOR:\s is clear to me, but why it doesn't appear later in the array,
because it hasn't () ?

The * is also clear

([\w\s]+?) means select all uppercase words/letters ?

(?=\s*[A-Z][a-z]) until you reach a space between uppercase and
uppercase with lowercase later?

thanks for your help ...

regards,

r.

--
Posted via http://www.ruby-....

Mark Thomas

4/7/2009 7:43:00 PM

On Apr 7, 2:23 pm, Raimon Fs <co...@montx.com> wrote:
> Mark Thomas wrote:
> >> With 1.9's Oniguruma (is it available for 1.8?) it's quite easy
>
> > This shorter one works in 1.8
>
> > scan(/EMISOR:\s*([\w\s]+?)(?=\s*[A-Z][a-z])/).flatten
>
> > I'm curious as to what Oniguruma-specific feature you used in yours.
>
> > -- Mark.
>
> thanks to all, at this moment I have enough with Ruby 1.8.7, so I'm with
> this one, that works perfectly.
>
> Can you explain why this works ?
>
> :-)
>
> /EMISOR:\s*([\w\s]+?)(?=\s*[A-Z][a-z])/
>
> EMISOR:\s is clear to me, but why it doesn't appear later in the array,
> because it hasn't () ?
>
> The * is also clear
>
> ([\w\s]+?) means select all uppercase words/letters ?

[\w\s] is a character class that matches "word characters" or spaces.
The + makes it one or more. The ? means make it non-greedy (only match
the minimum to make it true).

> (?=\s*[A-Z][a-z]) until you reach a space between uppercase and
> uppercase with lowercase later?

the (?= ) is a lookahead assertion. It looks for a match ahead,
without capturing it. So if you have any spaces, followed by an
uppercase then lowercase letter, the previous match will stop
matching.

-- Mark.

Kyle Schmitt

4/7/2009 8:09:00 PM

You've gotten a lot of good suggestions here, but I figured I'd toss in my own.

Instead of scanning, you can use the "[]" operator on a string, and
"," to pull out the saved portion of the regex.

"i like APPLES AND Bananas"[/[A-Z]+/]
=>"APPLES"

#now lets make it bigger
"i like APPLES AND Bananas"[/[A-Z ]+ [A-Z][a-z]/]
=> " APPLES AND Ba"

#Too much stuff so just select what we want.. using ()
#AND, use ,1 to just get the saved portion of the match
"i like APPLES AND Bananas"[/([A-Z ]+) [A-Z][a-z]/,1]
=> " APPLES AND"

#Now applying this to your situation...

regex=/EMISOR: ([A-Z ]+) [A-Z][a-z]/
#this is assuming there are no \n \r or \t chars in the middle, easy
enough to fix if there are

line="REGISTRO DE LA PROPIEDAD DE ALBACETE X EMISOR: MANUELA ADORACION
CEBOLLA GARCIA Padre Romano, 12 2005 ALBACETE ALBACETE NIF: 44444444P
"

line[regex,1]
=> "MANUELA ADORACION CEBOLLA GARCIA"

--Kyle

Raimon Fs

4/7/2009 9:13:00 PM

thanks to all, I've learned a lot today ...

:-)

regards and many, many, many thanks !!!

r.
--
Posted via http://www.ruby-....

comp.lang.ruby

regex select multiple words in the middle of a sentence

Raimon Fs

Robert Dober

Robert Dober

Matthias Reitinger

Robert Dober

Mark Thomas

Robert Dober

Raimon Fs

Mark Thomas

Kyle Schmitt

Raimon Fs

x Login to ForumsZone