Mark Tolonen
3/4/2008 7:44:00 AM
"Marc 'BlackJack' Rintsch" <bj_666@gmx.net> wrote in message
news:6349rmF23qmbmU1@mid.uni-berlin.de...
> On Tue, 04 Mar 2008 10:49:54 +0530, Pradnyesh Sawant wrote:
>
>> I have a file which contains chinese characters. I just want to find out
>> all the places that these chinese characters occur.
>>
>> The following script doesn't seem to work :(
>>
>> **********************************************************************
>> class RemCh(object):
>> def __init__(self, fName):
>> self.pattern = re.compile(r'[\u2F00-\u2FDF]+')
>> fp = open(fName, 'r')
>> content = fp.read()
>> s = re.search('[\u2F00-\u2fdf]', content, re.U)
>> if s:
>> print s.group(0)
>> if __name__ == '__main__':
>> rc = RemCh('/home/pradnyesh/removeChinese/delFolder.php')
>> **********************************************************************
>>
>> the php file content is something like the following:
>>
>> **********************************************************************
>> // Check if the folder still has subscribed blogs
>> $subCount = function1($param1, $param2);
>> if ($subCount > 0) {
>> $errors['summary'] = 'Ã?¦Ã?ÂÃ?¯Ã?½Ã? Ã?¦Ã?½Ã?Â¥Ã?¤æ¤Ã?Â¥Ã?¯Ã?«Ã?Â¥Ã?©Ã?©Ã?§Ã?§Ã?²Ã?¨';
>> $errorMessage = 'Ã?¦Ã?ÂÃ?¯Ã?½Ã? Ã?¦Ã?½Ã?Â¥Ã?¤æ¤Ã?Â¥Ã?¯Ã?«Ã?Â¥Ã?©Ã?©Ã?§Ã?§Ã?²Ã?¨';
>> }
>
> Looks like an UTF-8 encoded file viewed as ISO-8859-1. Sou you should
> decode `content` to unicode before searching the chinese characters.
>
I couldn't get your data to decode into anything resembling Chinese, so I
created my own file as an example. If reading an encoded text file, it
comes
in as just a bunch of bytes:
>>> print open('chinese.txt','r').read()
æË?â??æË?¯ç¾Žåâ?ºÂ½Ã¤ÂºÂºÃ£â?¬â?? WÃ?â?? shÃ?¬ MÃ?â?ºiguÃ?³rÃ?©n. I am an American.
Garbage, because the encoding isn't known. Provide the correct encoding and
decode it to Unicode:
>>> print open('chinese.txt','r').read().decode('utf8')
æ??æ?¯ç¾?å?½äººã?? WÇ? shì MÄ?iguórén. I am an American.
Here's the Unicode string. Note the 'u' before the quotes to indicate
Unicode.
>>> s=open('chinese.txt','r').read().decode('utf8')
>>> s
u'\ufeff\u6211\u662f\u7f8e\u56fd\u4eba\u3002 W\u01d2 sh\xec
M\u011bigu\xf3r\xe9n. I am an American.'
If working with Unicode strings, the re module should be provided Unicode
strings also:
>>> print re.search(ur'[\u4E00-\u9FA5]',s).group(0)
æ??
>>> print re.findall(ur'[\u4E00-\u9FA5]',s)
[u'\u6211', u'\u662f', u'\u7f8e', u'\u56fd', u'\u4eba']
Hope that helps you.
--Mark