Asp Forum - separate Chinese and English! with Ruby

Nanyang Zhan

5/7/2007 7:39:00 AM

Don't get me wrong, because I just want to know how to separate English
words from a string with ruby.
There are strings (UTF-8 encoded) to record people's name,
like:

æ?©æ ¹Â·å¼?é??æ?¼ Morgan Freeman
å¸?é²æ?¯Â·å¨å?©æ?¯ Bruce Willis
æ?å°æ?? Lee xiao ming
these strings containing Chinese name(without space between characters),
separated by a space, following an English name

or
Frank Darabont
Just an English name.

Would you give me an idea how to separate these Chinese characters(if
any)?

--
Posted via http://www.ruby-....

29 Answers

akbarhome

5/7/2007 9:13:00 AM

On May 7, 2:39 pm, Nanyang Zhan <s...@hotmail.com> wrote:> Don't get me wrong, because I just want to know how to separate English> words from a string with ruby.> There are strings (UTF-8 encoded) to record people's name,> like:>> ??·??? Morgan Freeman> ???·??? Bruce Willis> ??? Lee xiao ming> these strings containing Chinese name(without space between characters),> separated by a space, following an English name>> or> Frank Darabont> Just an English name.>> Would you give me an idea how to separate these Chinese characters(if> any)?>> --> Posted viahttp://www.ruby-.... a = File.open('a.txt') a.each {|x| puts x.split(' ', 2) }Output:??·???Morgan Freeman???·???Bruce Willis???Lee xiao ming

akbarhome

5/7/2007 9:19:00 AM

On May 7, 4:12 pm, akbarhome <akbarh...@gmail.com> wrote:
> On May 7, 2:39 pm, Nanyang Zhan <s...@hotmail.com> wrote:
>
>
>
> > Don't get me wrong, because I just want to know how to separate English
> > words from a string with ruby.
> > There are strings (UTF-8 encoded) to record people's name,
> > like:
>
> > ??·??? Morgan Freeman
> > ???·??? Bruce Willis
> > ??? Lee xiao ming
> > these strings containing Chinese name(without space between characters),
> > separated by a space, following an English name
>
> > or
> > Frank Darabont
> > Just an English name.
>
> > Would you give me an idea how to separate these Chinese characters(if
> > any)?
>
> > --
> > Posted viahttp://www.ruby-....
>
> a = File.open('a.txt')
> a.each {|x| puts x.split(' ', 2) }
> Output:
> ??·???
> Morgan Freeman
> ???·???
> Bruce Willis
> ???
> Lee xiao ming

Sorry. Fixed version:
a.each {|x|
if x[0].to_i > 128 then
puts x.split(' ', 2)
else
puts x
end
}

This code is quick and dirty.

Mariusz Pekala

5/7/2007 10:04:00 AM

On 2007-05-07 16:39:12 +0900 (Mon, May), Nanyang Zhan wrote:
> Don't get me wrong, because I just want to know how to separate English
> words from a string with ruby.
> There are strings (UTF-8 encoded) to record people's name,
> like:
>
> ??·??? Morgan Freeman
> ???·??? Bruce Willis
> ??? Lee xiao ming
> these strings containing Chinese name(without space between characters),
> separated by a space, following an English name
>
> or
> Frank Darabont
> Just an English name.
>
> Would you give me an idea how to separate these Chinese characters(if
> any)?

Maybe a regexp similiar to
/^([^qazwsxedcrfvtgbyhnujmikolpQAZWSXEDCRFVTGBYHNUJMIKOLP ]+)/
would help?

Does [a-zA-Z] include Chinese characters? In Polish locale it includes
Polish non-ASCII characters, so I guess it might include Chinese ones.

I guess you want split a given string into words (separated by space),
and then check whether the first word starts or includes at least one
Chinese character.

--
No virus found in this outgoing message.
Checked by 'grep -i virus $MESSAGE'
Trust me.

Nanyang Zhan

5/7/2007 10:17:00 AM

Akbar Home wrote:
> On May 7, 4:12 pm, akbarhome <akbarh...@gmail.com> wrote:
>> > å¸?é²æ?¯Â·å¨å?©æ?¯ Bruce Willis
>>
>> æ?å°æ??
>> Lee xiao ming
>
> Sorry. Fixed version:
> a.each {|x|
> if x[0].to_i > 128 then
> puts x.split(' ', 2)
> else
> puts x
> end
> }
>
> This code is quick and dirty.
Thanks.
But I was wrong. There are more Characters than Chinese and English that
compose the strings. Now I see characters like Ã?, Ã©, Ã¡... if x is one of
these, x[0]> 128 as Chinese does, but I only want to separate Chinese.

so do you know what exactly range of the value Chinese Characters will
return? or you can tell me where I can find this kind of information.

--
Posted via http://www.ruby-....

Harry Kakueki

5/7/2007 10:20:00 AM

On 5/7/07, Nanyang Zhan <sxain@hotmail.com> wrote:> Don't get me wrong, because I just want to know how to separate English> words from a string with ruby.> There are strings (UTF-8 encoded) to record people's name,> like:>> ??·??? Morgan Freeman> ???·??? Bruce Willis> ??? Lee xiao ming> these strings containing Chinese name(without space between characters),> separated by a space, following an English name>> or> Frank Darabont> Just an English name.>> Would you give me an idea how to separate these Chinese characters(if> any)?>> --> Posted via http://www.ruby-forum.com/.&... something like this.t = str.split(//).partition {|x| x=~/[a-z]|[A-Z]/ }p t[0].joinp t[1].joinHarry-- http://www.kakueki.com/ruby/... Look into Japanese Ruby List in English

akbarhome

5/7/2007 11:30:00 AM

On May 7, 5:17 pm, Nanyang Zhan <s...@hotmail.com> wrote:
> Akbar Home wrote:
> > On May 7, 4:12 pm, akbarhome <akbarh...@gmail.com> wrote:
> >> > ???·??? Bruce Willis
>
> >> ???
> >> Lee xiao ming
>
> > Sorry. Fixed version:
> > a.each {|x|
> > if x[0].to_i > 128 then
> > puts x.split(' ', 2)
> > else
> > puts x
> > end
> > }
>
> > This code is quick and dirty.
>
> Thanks.
> But I was wrong. There are more Characters than Chinese and English that
> compose the strings. Now I see characters like Ô, é, á... if x is one of
> these, x[0]> 128 as Chinese does, but I only want to separate Chinese.
>
> so do you know what exactly range of the value Chinese Characters will
> return? or you can tell me where I can find this kind of information.
>
> --
> Posted viahttp://www.ruby-....

These:
http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-t...
http://www.khngai.com/chinese/charmap/...

should get you done.

ustr
=> +"??·???"
irb(main):027:0> ustr[0]
=> U+6469 <CJK Ideograph>
irb(main):028:0> format "%X", ustr[0].to_i.to_s
=> "6469"
irb(main):029:0>

Nanyang Zhan

5/7/2007 12:23:00 PM

Harry Kakueki wrote:
> On 5/7/07, Nanyang Zhan <sxain@hotmail.com> wrote:
>>
>>
> Try something like this.
>
> t = str.split(//).partition {|x| x=~/[a-z]|[A-Z]/ }
> p t[0].join
> p t[1].join
>
> Harry
Thanks, KaKuEKi, but:
!!!!below code were tested under Ruby on Rails console!!!
>> str1 = "ä¸æ?? English Words"
=> "ä¸æ?? English Words"
>> str2 = "Ã?kami: chi"
=> "Ã?kami: chi"
>> t = str2.split(//).partition { |x| x=~/[a-z]|[A-Z]/}
=> [["k", "a", "m", "i", "c", "h", "i"], ["Ã?", ":", " "]]
>> p t[0].join
"kamichi" ##########I want all non Chinese characters remained.
=> nil
>> t = str1.split(//).partition { |x| x=~/[a-z]|[A-Z]/}
=> [["E", "n", "g", "l", "i", "s", "h", "W", "o", "r", "d", "s"], ["ä¸",
"æ??", " ", " "]]
>> p t[0].join
"EnglishWords" #######no space
=> nil
>>

Harry Kakueki wrote:

> Or this
>
> str.split(//).partition {|x| x.length == 1 }
>
> Harry

this time spaces are kept:
>> t = str1.split(//).partition {|x| x.length == 1 }
=> [[" ", "E", "n", "g", "l", "i", "s", "h", " ", "W", "o", "r", "d",
"s"], ["ä¸", "æ??"]]
>> t[0].join
=> " English Words"
>> t = str2.split(//).partition {|x| x.length == 1 }
=> [["k", "a", "m", "i", ":", " ", "c", "h", "i"], ["Ã?"]]
>> t[0].join
=> "kami: chi"

I think "Ã?" may just like Chinese characters, so it is hard to take it
out.

--
Posted via http://www.ruby-....

John Joyce

5/7/2007 12:32:00 PM

On May 7, 2007, at 8:35 PM, akbarhome wrote:

> On May 7, 5:17 pm, Nanyang Zhan <s...@hotmail.com> wrote:
>> Akbar Home wrote:
>>> On May 7, 4:12 pm, akbarhome <akbarh...@gmail.com> wrote:
>>>>> ???·??? Bruce Willis
>>
>>>> ???
>>>> Lee xiao ming
>>
>>> Sorry. Fixed version:
>>> a.each {|x|
>>> if x[0].to_i > 128 then
>>> puts x.split(' ', 2)
>>> else
>>> puts x
>>> end
>>> }
>>
>>> This code is quick and dirty.
>>
>> Thanks.
>> But I was wrong. There are more Characters than Chinese and
>> English that
>> compose the strings. Now I see characters like Ô, é, á.. if x
>> is one of
>> these, x[0]> 128 as Chinese does, but I only want to separate
>> Chinese.
>>
>> so do you know what exactly range of the value Chinese Characters
>> will
>> return? or you can tell me where I can find this kind of information.
>>
>> --
>> Posted viahttp://www.ruby-....
>
> These:
> http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-t...
> http://www.khngai.com/chinese/charmap/...
>
> should get you done.
>
> ustr
> => +"??·???"
> irb(main):027:0> ustr[0]
> => U+6469 <CJK Ideograph>
> irb(main):028:0> format "%X", ustr[0].to_i.to_s
> => "6469"
> irb(main):029:0>
>
>
You could identify the encoding or just make it unicode, then check
if the characters fall into a range in unicode, that will identify them.
One shortcut is checking for leading zeros in the unicode character's
code.

Nanyang Zhan

5/7/2007 12:34:00 PM

Akbar Home wrote:

> These:
> http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-t...
> http://www.khngai.com/chinese/charmap/...
>
> should get you done.
>> str1 = "ä¸æ?? English Words"
=> "ä¸æ?? English Words"
>> str1[0]
=> 228
>> str2 = "Ã?kami: chi"
=> "Ã?kami: chi"
>> str2[0]
=> 195
>> str3 = "English Words"
=> "English Words"
>> str3[0]
=> 69

if only I known which number Chinese Characters start and end...

--
Posted via http://www.ruby-....

Nanyang Zhan

5/7/2007 12:44:00 PM

John Joyce wrote:
> On May 7, 2007, at 8:35 PM, akbarhome wrote:
>
>>>> if x[0].to_i > 128 then
>>> English that
>>> Posted viahttp://www.ruby-....
>> => U+6469 <CJK Ideograph>
>> irb(main):028:0> format "%X", ustr[0].to_i.to_s
>> => "6469"
>> irb(main):029:0>
>>
>>
> You could identify the encoding or just make it unicode, then check
> if the characters fall into a range in unicode, that will identify them.
> One shortcut is checking for leading zeros in the unicode character's
> code.

John Joyce, Thank you for your explanation.
Now I get akbarhome's idea. So I need to download the unicode lib here
http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-t...
Then covert the strings into unicode, and then compare the characters
with the CJK Unicode Table from here:
http://www.khngai.com/chinese/charmap/tbluni....
Yes,It must work!

but look this:
>> str1 = "ä¸æ?? English Words"
=> "ä¸æ?? English Words"
>> str1[0]
=> 228
>> str2 = "Ã?kami: chi"
=> "Ã?kami: chi"
>> str2[0]
=> 195
>> str3 = "English Words"
=> "English Words"
>> str3[0]
=> 69

may be there are numbers that are right for Chinese,
if only I known which number Chinese Characters start and end, there
will be a much simple solution.

--
Posted via http://www.ruby-....

comp.lang.ruby

separate Chinese and English! with Ruby

Nanyang Zhan

akbarhome

akbarhome

Mariusz Pekala

Nanyang Zhan

Harry Kakueki

akbarhome

Nanyang Zhan

John Joyce

Nanyang Zhan

Nanyang Zhan

x Login to ForumsZone