Asp Forum - Parsing Japanese Language and Some Ruby Trivia

Michael Sullivan

1/11/2006 3:10:00 PM

All this talk about Unicode support and HTML parsing got me to
wondering about how to parse Japanese text. There are no spaces to
separate words, and though there are some modifiers, or particles in
the Japanese language they are used sometime inconsistently. I could
quote examples, but of you can't read Kanji, Hiragana, and Katakana
they would most likely be meaningless.

So, knowing what little I do of Japanese (been studying for a while
and living in Japan for close to four years), I was wondering how
search engines like Google and Yahoo parse Japanese text, much less
web pages. There are numerous filters to extract text from web
pages, but parsing Japanese text is another matter altogether.

So, I have found one Open Source project which seems to be addressing
this, but I was wondering if there is a solution for Ruby?

Now for the trivia... I've been reading some Japanese text,
"Hiragana Times" - a magazine which prints their articles in Japanese
and English as a learning tool and my newspaper "The Japan Times"
which has a weekly section devoted to bilingual education, as well as
my class textbooks. I've also read some Manga as well. They
generally present the Kanji with tiny Hiragana characters either
above them which are the phonetic equivalent to the Kanji.

Guess what these tiny Hiragana helpers are called... you guessed it
"Ruby Annotation". Check out what I found on W3C, either click on
the link or: http://www.w3.or...

Coincidence?

Mike

--
Mobile: +81-80-3202-2599
Office: +81-3-3395-6055

"Any sufficiently advanced technology is indistinguishable from
magic..."
- A. C. Clarke

7 Answers

David Vallner

1/11/2006 4:17:00 PM

Michael Sullivan wrote:

> [snip]
>
> Coincidence?

Absolutely. I suspect the term and concept of "Ruby annotation" to be a
lot older than the programming language, and AFAIK, the name of the Ruby
programming language is a reference to its roots in Perl. The fact the
gemstone used as the language is ruby might, but doesn't have to be,
intentionally, subconsciously, or coinkydinkally related to Ruby
annotation. If you want to know for certain, submit Matz to regression
hypnosis and take him back to the time he was deciding for a name of the
language.

David Vallner

John Fry

1/11/2006 4:33:00 PM

Michael Sullivan <unixwzrd@mac.com> writes:

> I was wondering how search engines like Google and Yahoo parse
> Japanese text, much less web pages. There are numerous filters to
> extract text from web pages, but parsing Japanese text is another
> matter altogether.

I'm not sure what you mean by "parsing", but if you mean segmentation
and morphological analysis of Japanese, then two popular packages for
doing this are ChaSen and MeCab.

Best,

John

Gene Tani

1/11/2006 4:41:00 PM

David Vallner wrote:
> Michael Sullivan wrote:
>
> > [snip]
> >
> > Coincidence?
>
> Absolutely. I suspect the term and concept of "Ruby annotation" to be a
> lot older than the programming language, and AFAIK, the name of the Ruby
> programming language is a reference to its roots in Perl. The fact the
> gemstone used as the language is ruby might, but doesn't have to be,
> intentionally, subconsciously, or coinkydinkally related to Ruby
> annotation. If you want to know for certain, submit Matz to regression
> hypnosis and take him back to the time he was deciding for a name of the
> language.
>
> David Vallner

[going OT] i've noted that google has gotten much better at separating
hits of Sam Ruby's pages from pages that refer to ruby language. Must
be all that python programming they're donig ;-p

Michael Sullivan

1/12/2006 1:23:00 AM

Hi,

I posted this last night and probably didn't hit the correct
audience. I got one relevant answer and need to go check the
packages recommended. But to those in Asian Time zones, I'll ask
again about parsing Japanese text.

And to clarify, I am looking for a way to extract "words" from the
text for cataloging in a database.

Cheers,
Mike

--
Mobile: +81-80-3202-2599
Office: +81-3-3395-6055

"The two most common elements in the universe are hydrogen... and
stupidity."
- Harlan Ellison

Begin forwarded message:

> From: Michael Sullivan <unixwzrd@mac.com>
> Date: January 12, 2006 12:09:40 AM JST
> To: ruby-talk@ruby-lang.org (ruby-talk ML)
> Subject: Parsing Japanese Language and Some Ruby Trivia
> Reply-To: ruby-talk@ruby-lang.org
>
> All this talk about Unicode support and HTML parsing got me to
> wondering about how to parse Japanese text. There are no spaces to
> separate words, and though there are some modifiers, or particles
> in the Japanese language they are used sometime inconsistently. I
> could quote examples, but of you can't read Kanji, Hiragana, and
> Katakana they would most likely be meaningless.
>
> So, knowing what little I do of Japanese (been studying for a while
> and living in Japan for close to four years), I was wondering how
> search engines like Google and Yahoo parse Japanese text, much less
> web pages. There are numerous filters to extract text from web
> pages, but parsing Japanese text is another matter altogether.
>
> So, I have found one Open Source project which seems to be
> addressing this, but I was wondering if there is a solution for Ruby?
>
> Now for the trivia... I've been reading some Japanese text,
> "Hiragana Times" - a magazine which prints their articles in
> Japanese and English as a learning tool and my newspaper "The Japan
> Times" which has a weekly section devoted to bilingual education,
> as well as my class textbooks. I've also read some Manga as well.
> They generally present the Kanji with tiny Hiragana characters
> either above them which are the phonetic equivalent to the Kanji.
>
> Guess what these tiny Hiragana helpers are called... you guessed it
> "Ruby Annotation". Check out what I found on W3C, either click on
> the link or: http://www.w3.or...
>
> Coincidence?
>
> Mike
>
> --
> Mobile: +81-80-3202-2599
> Office: +81-3-3395-6055
>
> "Any sufficiently advanced technology is indistinguishable from
> magic..."
> - A. C. Clarke
>
>

Mauricio Fernández

1/12/2006 1:52:00 AM

On Thu, Jan 12, 2006 at 10:23:11AM +0900, Michael Sullivan wrote:
> Hi,
>
> I posted this last night and probably didn't hit the correct
> audience. I got one relevant answer and need to go check the
> packages recommended. But to those in Asian Time zones, I'll ask
> again about parsing Japanese text.
>
> And to clarify, I am looking for a way to extract "words" from the
> text for cataloging in a database.

http://raa.ruby-lang.org/cache/ru...
-> It looks old and the source code is not very enticing, though.

This is probably better:
http://chasen.org/~taku/software/mecab/bin...

I wrote something similar to what you want long ago; for some reason I ended
up parsing the output of mecab instead of using the bindings (can't remember
why atm., nor if it still applies).

The following (old, ugly) code collects some words (names, verbs, "i adjectives")
from a utf8 string held in 'text':

text = Iconv.iconv("eucjp", "utf-8", text).first
temp = Tempfile.new "jphints"
temp.puts text
temp.close
analysis = `mecab #{temp.path}`
output = Iconv.iconv("utf-8", "eucjp", analysis).first
temp.close!
hints = []
output.each_line do |line|
break if /\AEOS\s$/u.match line
hint, nature, canonical = /\A(.+)\s+([^,]+),[^,]+,[^,]+,[^,]+,[^,]+,[^,]+,([^,]+),/u.match(line).captures # UGH
case nature
when %w[e5 90 8d e8 a9 9e].map{|x| x.to_i(16)}.pack("c*"), # name
%w[e5 8b 95 e8 a9 9e].map{|x| x.to_i(16)}.pack("c*"), # verb
%w[ e5 bd a2 e5 ae b9 e8 a9 9e].map{|x| x.to_i(16)}.pack("c*") # adj i
puts "REG HINT #{hint} -> #{canonical}\t #{nature}"
hints << canonical
else
# puts "IGNORED #{hint}"
end
end

# now the words are in hints, as utf8 strings

Hope this helps.

--
Mauricio Fernandez

Michael Sullivan

1/12/2006 6:48:00 AM

Thanks, this looks like what I'm looking for.

Cheers,

Mike

--
Mobile: +81-80-3202-2599
Office: +81-3-3395-6055

"Haggis... uh, I was briefed on haggis.... No!"
G W Bush (dubya) - Japan Times, 12 July 2005

On Jan 12, 2006, at 10:51 , Mauricio Fernandez wrote:

> On Thu, Jan 12, 2006 at 10:23:11AM +0900, Michael Sullivan wrote:
>> Hi,
>>
>> I posted this last night and probably didn't hit the correct
>> audience. I got one relevant answer and need to go check the
>> packages recommended. But to those in Asian Time zones, I'll ask
>> again about parsing Japanese text.
>>
>> And to clarify, I am looking for a way to extract "words" from the
>> text for cataloging in a database.
>
> http://raa.ruby-lang.org/cache/ru...
> -> It looks old and the source code is not very enticing, though.
>
> This is probably better:
> http://chasen.org/~taku/software/mecab/bin...
>
>
> I wrote something similar to what you want long ago; for some
> reason I ended
> up parsing the output of mecab instead of using the bindings (can't
> remember
> why atm., nor if it still applies).
>
> The following (old, ugly) code collects some words (names, verbs,
> "i adjectives")
> from a utf8 string held in 'text':
>
> text = Iconv.iconv("eucjp", "utf-8", text).first
> temp = Tempfile.new "jphints"
> temp.puts text
> temp.close
> analysis = `mecab #{temp.path}`
> output = Iconv.iconv("utf-8", "eucjp", analysis).first
> temp.close!
> hints = []
> output.each_line do |line|
> break if /\AEOS\s$/u.match line
> hint, nature, canonical = /\A(.+)\s+([^,]+),[^,]+,[^,]+,[^,]
> +,[^,]+,[^,]+,([^,]+),/u.match(line).captures # UGH
> case nature
> when %w[e5 90 8d e8 a9 9e].map{|x| x.to_i(16)}.pack("c*"),
> # name
> %w[e5 8b 95 e8 a9 9e].map{|x| x.to_i(16)}.pack("c*"), #
> verb
> %w[ e5 bd a2 e5 ae b9 e8 a9 9e].map{|x| x.to_i(16)}.pack
> ("c*") # adj i
> puts "REG HINT #{hint} -> #{canonical}\t #{nature}"
> hints << canonical
> else
> # puts "IGNORED #{hint}"
> end
> end
>
> # now the words are in hints, as utf8 strings
>
>
>
> Hope this helps.
>
> --
> Mauricio Fernandez
>

Horacio Sanson

1/12/2006 7:35:00 AM

To delimit words in a japanese text you can use MeCab and/or Kakasi (google
them).

Kakasi has ruby bindings
http://raa.ruby-lang.org/project/ru...

MeCab also has bindings for several scripting languages (Ruby included):
http://chasen.org/~taku/software/mecab/bin...

If you want database text search support for japanese you can use Tsearch2
with Teramoto SQL function that uses Kakasi to index japanese words.

http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/tsea...

Finally MSSQL text search function works with Japanese (with MSSQL Japanese
version of course).

Hope this helps....

Horacio

Thursday 12 January 2006 15:48?Michael Sullivan ????????:
> Thanks, this looks like what I'm looking for.
>
> Cheers,
>
> Mike
>
> --
> Mobile: +81-80-3202-2599
> Office: +81-3-3395-6055
>
> "Haggis... uh, I was briefed on haggis.... No!"
> G W Bush (dubya) - Japan Times, 12 July 2005
>
> On Jan 12, 2006, at 10:51 , Mauricio Fernandez wrote:
> > On Thu, Jan 12, 2006 at 10:23:11AM +0900, Michael Sullivan wrote:
> >> Hi,
> >>
> >> I posted this last night and probably didn't hit the correct
> >> audience. I got one relevant answer and need to go check the
> >> packages recommended. But to those in Asian Time zones, I'll ask
> >> again about parsing Japanese text.
> >>
> >> And to clarify, I am looking for a way to extract "words" from the
> >> text for cataloging in a database.
> >
> > http://raa.ruby-lang.org/cache/ru...
> > -> It looks old and the source code is not very enticing, though.
> >
> > This is probably better:
> > http://chasen.org/~taku/software/mecab/bin...
> >
> >
> > I wrote something similar to what you want long ago; for some
> > reason I ended
> > up parsing the output of mecab instead of using the bindings (can't
> > remember
> > why atm., nor if it still applies).
> >
> > The following (old, ugly) code collects some words (names, verbs,
> > "i adjectives")
> > from a utf8 string held in 'text':
> >
> > text = Iconv.iconv("eucjp", "utf-8", text).first
> > temp = Tempfile.new "jphints"
> > temp.puts text
> > temp.close
> > analysis = `mecab #{temp.path}`
> > output = Iconv.iconv("utf-8", "eucjp", analysis).first
> > temp.close!
> > hints = []
> > output.each_line do |line|
> > break if /\AEOS\s$/u.match line
> > hint, nature, canonical = /\A(.+)\s+([^,]+),[^,]+,[^,]+,[^,]
> > +,[^,]+,[^,]+,([^,]+),/u.match(line).captures # UGH
> > case nature
> > when %w[e5 90 8d e8 a9 9e].map{|x| x.to_i(16)}.pack("c*"),
> > # name
> > %w[e5 8b 95 e8 a9 9e].map{|x| x.to_i(16)}.pack("c*"), #
> > verb
> > %w[ e5 bd a2 e5 ae b9 e8 a9 9e].map{|x| x.to_i(16)}.pack
> > ("c*") # adj i
> > puts "REG HINT #{hint} -> #{canonical}\t #{nature}"
> > hints << canonical
> > else
> > # puts "IGNORED #{hint}"
> > end
> > end
> >
> > # now the words are in hints, as utf8 strings
> >
> >
> >
> > Hope this helps.
> >
> > --
> > Mauricio Fernandez

comp.lang.ruby

Parsing Japanese Language and Some Ruby Trivia

Michael Sullivan

David Vallner

John Fry

Gene Tani

Michael Sullivan

Mauricio Fernández

Michael Sullivan

Horacio Sanson

x Login to ForumsZone