[lnkForumImage]
TotalShareware - Download Free Software

Confronta i prezzi di migliaia di prodotti.
Asp Forum
 Home | Login | Register | Search 


 

Forums >

comp.lang.ruby

[newbie] upper to lower first letter of a word

yvon.thoravallist

9/23/2003 4:30:00 PM

Recently, i get a vintage list (more than 500 items) with poor typo, for
example, i've :

Côte de beaune-villages

instead of :

Côte de Beaune-Villages

Crémant d'alsace

instead of :

Crémant d'Alsace

i wonder of the way to change lower to upper case and also of

a regex able to do the trick.

something like :

every letter following a " ", "-" or "'" should be upper if not
belonging to a black list of words :

black_list = %w{d de du la le sec sur entre etc...}

--
Yvon
21 Answers

Mark J. Reed

9/23/2003 4:38:00 PM

0

On Tue, Sep 23, 2003 at 06:29:58PM +0200, Yvon Thoraval wrote:
> Recently, i get a vintage list (more than 500 items) with poor typo, for
> example, i''ve :
>
> Côte de beaune-villages
>
> instead of :
>
> Côte de Beaune-Villages
>
> Crémant d''alsace
>
> instead of :
>
> Crémant d''Alsace
>
> i wonder of the way to change lower to upper case and also of
>
> a regex able to do the trick.
>
> something like :
>
> every letter following a " ", "-" or "''" should be upper if not
> belonging to a black list of words :
>
> black_list = %w{d de du la le sec sur entre etc...}

string.gsub!(/\b[a-z]+/) { |w| black_list.include?(w) ? w : w.capitalize }

-Mark

yvon.thoravallist

9/23/2003 5:04:00 PM

0

Mark J. Reed <markjreed@mail.com> wrote:

> string.gsub!(/\b[a-z]+/) { |w| black_list.include?(w) ? w : w.capitalize }

a lot of tanxs &#176;;)
--
Yvon

yvon.thoravallist

9/23/2003 5:24:00 PM

0

Yvon Thoraval <yvon.thoravallist@-SUPPRIMEZ-free.fr.invalid> wrote:

>
> > string.gsub!(/\b[a-z]+/) { |w| black_list.include?(w) ? w : w.capitalize }
>
> a lot of tanxs &#176;;)

it seems, it''s a little bit trickier because accentuated characters are
taken as \b for example :

Vosne-roman&#233;e
becomes :
Vosne-Roman&#233;E

then instead of \b i would have to exclude a list of chars :
[&#224;|&#228;|&#226;|&#233;|&#232;|&#234;|&#238;|&#246;|&#244;|&#252;|&#249;]
--
Yvon

Mark J. Reed

9/23/2003 5:49:00 PM

0

On Tue, Sep 23, 2003 at 07:23:52PM +0200, Yvon Thoraval wrote:
> it seems, it''s a little bit trickier because accentuated characters are
> taken as \b

Really? That''s arguably a bug. What character encoding are you using?

Accented letters should be in \w, not \W, and therefore the
space between one and an adjacent letter should not match \b.
But Ruby regexes may be ASCII-only, and even if not, they''re probably
Latin-1-only. So, for instance, they wouldn''t work on UTF-8 strings.

> Vosne-roman&#233;e
> becomes :
> Vosne-Roman&#233;E
>
> then instead of \b i would have to exclude a list of chars :
> [&#224;|&#228;|&#226;|&#233;|&#232;|&#234;|&#238;|&#246;|&#244;|&#252;|&#249;]

First, you don''t need the pipes (|''s) there. Pipes are for
alternation without the [...]; basically, [abc] is short for
(a|b|c). The pipe form is most useful when the alternatives are
not all single characters, for instance (alfa|bravo|charlie).

I''m not sure whether the exclude-list or the include-list would
be shorter. You could do (^|[- '']) to match "beginning of string or
dash or space or apostrophe", but then that character would be included
in the resulting string. Which means that it would be, for instance,
" d" or "-d" or "''d" instead of "d", and therefore won''t be in the
blacklist and won''t capitalize properly (since String#capitalize operates
on the first character, which will be the space or dash or apostrophe).
The block has to compensate for that. Something like this:

string.gsub!(/(^|[- ''])([a-z]+)/) { $1 + $2.capitalize }

Except that [a-z] won''t match accented characters, so it''s more like this:

string.gsub!(/(^|[- ''])([a-z&#225;&#224;&#226;&#231;&#233;&#232;&#234;&#237;&#236;&#238;&#243;&#242;&#244;&#250;&#249;&#251;]+)/) { $1 + $2.capitalize }

And if the names aren''t limited to French, then even more special characters
creep in . . .

-Mark

Mark J. Reed

9/23/2003 5:54:00 PM

0

On Tue, Sep 23, 2003 at 05:49:24PM +0000, Mark J. Reed wrote:
> string.gsub!(/(^|[- ''])([a-z&#225;&#224;&#226;&#231;&#233;&#232;&#234;&#237;&#236;&#238;&#243;&#242;&#244;&#250;&#249;&#251;]+)/) { $1 + $2.capitalize }

Left off the blacklist check, which should be applied to $2:

string.gsub!(/(^|[- ''])([a-z&#225;&#224;&#226;&#231;&#233;&#232;&#234;&#237;&#236;&#238;&#243;&#242;&#244;&#250;&#249;&#251;]+)/) {
black_list.include?($2) ? $1 + $2 : $1 + $2.capitalize
}

-Mark

yvon.thoravallist

9/23/2003 6:21:00 PM

0

Mark J. Reed <markjreed@mail.com> wrote:

> Really? That''s arguably a bug. What character encoding are you using?

I''m (more-or-less) sure about that because even if i put :
l.gsub!(/\b[a-z&#225;&#224;&#226;&#231;&#233;&#232;&#234;&#237;&#236;&#238;&#243;&#242;&#246;&#244;&#250;&#249;&#252;&#251;]+/) { |w| black_list.include?(w) ? w
: w.capitalize }

i get :
M&#226;Con Sup&#233;Rieur
when input was :
M&#226;con sup&#233;rieur

> Accented letters should be in \w, not \W, and therefore the
> space between one and an adjacent letter should not match \b.
> But Ruby regexes may be ASCII-only, and even if not, they''re probably
> Latin-1-only. So, for instance, they wouldn''t work on UTF-8 strings.

precisely i''m using utf-8 &#176;;)
however, i''m able to do a try using iso-8859-1, my word editor (Pepper
on MacOS X) is able to transcode within 2 clicks + one cut''n paste rom
utf to iso...
sounds strange to me because Ruby is coming from Japan where "special"
chars are every-day chars ???

[snip]

> The block has to compensate for that. Something like this:
>
> string.gsub!(/(^|[- ''])([a-z]+)/) { $1 + $2.capitalize }
>
> Except that [a-z] won''t match accented characters, so it''s more like this:
>
> string.gsub!(/(^|[- ''])([a-z&#225;&#224;&#226;&#231;&#233;&#232;&#234;&#237;&#236;&#238;&#243;&#242;&#244;&#250;&#249;&#251;]+)/) { $1 + $2.capitalize }
>
> And if the names aren''t limited to French, then even more special characters
> creep in . . .

Yes, right, i know, for the time being, only about french and german
accentuated chars...

However because vintage are classified by area i might have to change
regex upon region...
--
Yvon

Robert Klemme

9/24/2003 9:53:00 AM

0



"Yvon Thoraval" <yvon.thoravallist@-SUPPRIMEZ-free.fr.invalid> schrieb im
Newsbeitrag
news:1g1r8u8.1hzv3mvupjeizN%yvon.thoravallist@-SUPPRIMEZ-free.fr.invalid...
> Mark J. Reed <markjreed@mail.com> wrote:
>
> > Really? That''s arguably a bug. What character encoding are you
using?
>
> I''m (more-or-less) sure about that because even if i put :
> l.gsub!(/\b[a-z&#225;&#224;&#226;&#231;&#233;&#232;&#234;&#237;&#236;&#238;&#243;&#242;&#246;&#244;&#250;&#249;&#252;&#251;]+/) { |w| black_list.include?(w) ? w
> : w.capitalize }

I''d omit the "\b" at the beginning since "&#233;" then still matches a word
boundry:

l.gsub!(/[a-z&#225;&#224;&#226;&#231;&#233;&#232;&#234;&#237;&#236;&#238;&#243;&#242;&#246;&#244;&#250;&#249;&#252;&#251;]+/) { |w| black_list.include?(w) ? w
: w.capitalize }

Alternatively:

l.gsub!(/[^\s!?.;:-]+/) {|w| black_list.include?(w) ? w : w.capitalize }

Regards

robert

yvon.thoravallist

9/24/2003 1:41:00 PM

0

Robert Klemme <bob.news@gmx.net> wrote:

> I''d omit the "\b" at the beginning since "&#233;" then still matches a word
> boundry:
>
> l.gsub!(/[a-z&#225;&#224;&#226;&#231;&#233;&#232;&#234;&#237;&#236;&#238;&#243;&#242;&#246;&#244;&#250;&#249;&#252;&#251;]+/) { |w| black_list.include?(w) ? w
> : w.capitalize }

yes, fine, i discovered also that capitalization don''t work on
accentuated chars (as &#233;)

then i''ve done another step for those "special" chars being as the first
letter of a xord
> Alternatively:
>
> l.gsub!(/[^\s!?.;:-]+/) {|w| black_list.include?(w) ? w : w.capitalize }

ok, however in my list no punctuation as ?!;:... only " " and "-"
--
Yvon

Carlos

9/26/2003 4:28:00 PM

0

> yes, fine, i discovered also that capitalization don''t work on
> accentuated chars (as &#195;&#169;)

You can use an old library named unicode:

irb(main):001:0> $KCODE="u"
=> "u"
irb(main):002:0> require "unicode"
=> true
irb(main):003:0> Unicode.capitalize("&#195;&#160;&#195;&#171;&#195;&#173;&#195;&#180;&#197;&#175;")
=> "&#195;?&#195;&#171;&#195;&#173;&#195;&#180;&#197;&#175;"

http://raa.ruby-lang.org/list.rhtml?na...


yvon.thoravallist

9/26/2003 5:04:00 PM

0

Carlos <angus@quovadis.com.ar> wrote:

>
> You can use an old library named unicode:
>
> irb(main):001:0> $KCODE="u"
> => "u"
> irb(main):002:0> require "unicode"
> => true
> irb(main):003:0> Unicode.capitalize("&#224;&#235;&#237;&#244;?")
> => "&#192;&#235;&#237;&#244;?"
>
> http://raa.ruby-lang.org/list.rhtml?na...

tanxs for all !
--
Yvon