Asp Forum - regexp with accent insensitive ??

Davi Barbosa

10/12/2008 8:00:00 PM

Hello,
Is there anyway to make the regexp accent-insensitive? (/a/ match with Ã£
and Ã?)

If not, can any one give a solution to my problem:
I'm making a search web page with mod_ruby, so I made an
accent/case-insensitive sql query and this works fine (with
latin1_swedish_ci). Now I want to highlight what the user searched for.
To achieve this I'm doing something *like*:
string.gsub(/search/i,'\0')
This works fine if search and the relevant part of string don't have
accents, but if there are any accents it doesn't match, so the entry is
not highlighted.

I know that with
Iconv.conv("ascii//translit","UTF-8",str)
I can remove all the accents from str, so I can remove the accents from
'search' without any problem, but if I remove some accents from string
to do the highlighting, I need to put it back later to display it to the
user.
Does anyone have any idea?

Thank you
--
Posted via http://www.ruby-....

3 Answers

Ken Bloom

10/13/2008 3:30:00 AM

On Sun, 12 Oct 2008 15:00:29 -0500, Davi Barbosa wrote:

> Hello,
> Is there anyway to make the regexp accent-insensitive? (/a/ match with Ã£
> and Ã?)
>
> If not, can any one give a solution to my problem: I'm making a search
> web page with mod_ruby, so I made an accent/case-insensitive sql query
> and this works fine (with latin1_swedish_ci). Now I want to highlight
> what the user searched for. To achieve this I'm doing something *like*:
> string.gsub(/search/i,'\0') This works
> fine if search and the relevant part of string don't have accents, but
> if there are any accents it doesn't match, so the entry is not
> highlighted.
>
> I know that with
> Iconv.conv("ascii//translit","UTF-8",str) I can remove all the accents
> from str, so I can remove the accents from 'search' without any problem,
> but if I remove some accents from string to do the highlighting, I need
> to put it back later to display it to the user.
> Does anyone have any idea?
>
> Thank you

in which case, I would try replacing the accented letters with periods
(which match any single character) when searching. This will give some
false positives. I would use gsub with a block to do a more specific
conditional test.

Suppose the search was for ole (without the accent, and the real hits
will have an accent on the e) the search is in a language that allows
accents on only the letter e.

query='ole'
pattern=Regexp.compile('ole'.gsub(/[e]/,'.')) #=> /ol./

translit=Iconv.conv("ascii//translit","UTF-8",'ole') #=> "ole"

gsub(pattern) do |match|
#use the regular expression to get close enough, and to get
#the actual text we're concerned about
if Iconv.conv("ascii//translit","UTF-8",match) == translit
#the if test does the actual exact comparison
"#{match}"
else
match
end
end

Of course, there may be some locale tricks that I'm missing that would
make this much easier.

--
Chanoch (Ken) Bloom. PhD candidate. Linguistic Cognition Laboratory.
Department of Computer Science. Illinois Institute of Technology.
http://www.iit.edu...

Davi Barbosa

10/13/2008 8:23:00 PM

Thank you for your answer, but I'm working with a lot of languages, so I
don't know where someone can put an accent.

For the moment, I just discovered that I can't remove the accents with
Iconv like I said before. Here, it works only under irb.. I described
this problem here: http://www.ruby-...topic/70...
Another problem with utf-8 under ruby is that ruby can't index correctly
the string. For example: 'Ã¡b'[2..2] gives the second half of 'Ã¡'. I
discovered how to workaround using the unicode version of regexp:
$KCODE = 'u'
'Ã¡b'.split(//m) == ["Ã¡", "b"]

Without these problems, I think that I know how to make it without false
matchs with an ugly loop.
If str and regexp are the versions without accents, str =~ regexp gives
the position of the match and str[regexp].length the length. With these
two numbers, It's possible to make the highlight in the original string.
It's something like:
ascii_string = Iconv.conv('US-ASCII//TRANSLIT','UTF-8',string)
ascii_search = Iconv.conv('US-ASCII//TRANSLIT','UTF-8',search)
regexp = Regexp.new(Regexp.escape(ascii_search),true)
position = (ascii_string =~ regexp)
size = ascii_string[regexp].length
highlighted = ascii_string[0..(position-1)]+''+ascii_string[position..(position+size-1)]+''+ascii_string[(position+size)..-1]

Of course, it need some modifications to put this in a loop (and I need
to use the vector version of the string to index correctly the string).
--
Posted via http://www.ruby-....

Ken Bloom

10/16/2008 12:39:00 AM

On Mon, 13 Oct 2008 15:23:23 -0500, Davi Barbosa wrote:

> Thank you for your answer, but I'm working with a lot of languages, so I
> don't know where someone can put an accent.
>
> For the moment, I just discovered that I can't remove the accents with
> Iconv like I said before. Here, it works only under irb.. I described
> this problem here: http://www.ruby-forum.com/topic/70... Another
> problem with utf-8 under ruby is that ruby can't index correctly the
> string. For example: 'Ã¡b'[2..2] gives the second half of 'Ã¡'. I
> discovered how to workaround using the unicode version of regexp: $KCODE
> = 'u'
> 'Ã¡b'.split(//m) == ["Ã¡", "b"]
>
> Without these problems, I think that I know how to make it without false
> matchs with an ugly loop.
> If str and regexp are the versions without accents, str =~ regexp gives
> the position of the match and str[regexp].length the length. With these
> two numbers, It's possible to make the highlight in the original string.
> It's something like:
> ascii_string = Iconv.conv('US-ASCII//TRANSLIT','UTF-8',string)
> ascii_search = Iconv.conv('US-ASCII//TRANSLIT','UTF-8',search) regexp =
> Regexp.new(Regexp.escape(ascii_search),true) position = (ascii_string =~
> regexp)
> size = ascii_string[regexp].length
> highlighted = ascii_string[0..(position-1)]+' class="highlight">'+ascii_string[position..(position+size-1)]+''+ascii_string[(position+size)..-1]
>
> Of course, it need some modifications to put this in a loop (and I need
> to use the vector version of the string to index correctly the string).

You can use a StringScanner (require 'strscan') to properly do this in a
loop, because StringScanner#pos will tell you the starting position of
the match, where String#scan will not.

Consider whether Ruby 1.9.0 is stable enough for your purposes because it
handles Unicode natively and should save you from needing to have a
vector version of the string.

--
Chanoch (Ken) Bloom. PhD candidate. Linguistic Cognition Laboratory.
Department of Computer Science. Illinois Institute of Technology.
http://www.iit.edu...

comp.lang.ruby

regexp with accent insensitive ??

Davi Barbosa

Ken Bloom

Davi Barbosa

Ken Bloom

x Login to ForumsZone