Asp Forum - help needed with regex and unicode

Pradnyesh Sawant

3/4/2008 5:20:00 AM

Hi all,
I have a file which contains chinese characters. I just want to find out
all the places that these chinese characters occur.

The following script doesn't seem to work :(

**********************************************************************
class RemCh(object):
def __init__(self, fName):
self.pattern = re.compile(r'[\u2F00-\u2FDF]+')
fp = open(fName, 'r')
content = fp.read()
s = re.search('[\u2F00-\u2fdf]', content, re.U)
if s:
print s.group(0)
if __name__ == '__main__':
rc = RemCh('/home/pradnyesh/removeChinese/delFolder.php')
**********************************************************************

the php file content is something like the following:

**********************************************************************
// Check if the folder still has subscribed blogs
$subCount = function1($param1, $param2);
if ($subCount > 0) {
$errors['summary'] = 'Ã¦ÂÃ¯Â½Â Ã¦Â½Ã¥Â¤æ¤Ã¥Ã¯Â«Ã¥Ã©Ã©Â§Ã§Â²Ã¨';
$errorMessage = 'Ã¦ÂÃ¯Â½Â Ã¦Â½Ã¥Â¤æ¤Ã¥Ã¯Â«Ã¥Ã©Ã©Â§Ã§Â²Ã¨';
}

if (empty($errors)) {
$ret = function2($blog_res, $yuid, $fid);
if ($ret >= 0) {
$saveFalg = TRUE;
} else {
error_log("ERROR:: ret: $ret, function1($param1, $param2)");
$errors['summary'] = "Ã¦ÂÃ¯Â½Â Ã¦Â½Ã¥Â¤æ¤Ã¨Â±Ã¥Ã£
$errorMessage = "Ã¦ÂÃ¯Â½Â Ã¦Â½Ã¥Â¤æ¤Ã¨Â±Ã¥Ã£
}
}
**********************************************************************

--
warm regards,
Pradnyesh Sawant
--
Luck is the residue of good design. --Anon

2 Answers

Marc 'BlackJack' Rintsch

3/4/2008 6:52:00 AM

On Tue, 04 Mar 2008 10:49:54 +0530, Pradnyesh Sawant wrote:

> I have a file which contains chinese characters. I just want to find out
> all the places that these chinese characters occur.
>
> The following script doesn't seem to work :(
>
> **********************************************************************
> class RemCh(object):
> def __init__(self, fName):
> self.pattern = re.compile(r'[\u2F00-\u2FDF]+')
> fp = open(fName, 'r')
> content = fp.read()
> s = re.search('[\u2F00-\u2fdf]', content, re.U)
> if s:
> print s.group(0)
> if __name__ == '__main__':
> rc = RemCh('/home/pradnyesh/removeChinese/delFolder.php')
> **********************************************************************
>
> the php file content is something like the following:
>
> **********************************************************************
> // Check if the folder still has subscribed blogs
> $subCount = function1($param1, $param2);
> if ($subCount > 0) {
> $errors['summary'] = 'Ã?Â¦Ã?ÂÃ?Â¯Ã?Â½Ã?Â Ã?Â¦Ã?Â½Ã?Â¥Ã?Â¤Ã¦ÂÂ¤Ã?Â¥Ã?Â¯Ã?Â«Ã?Â¥Ã?Â©Ã?Â©Ã?Â§Ã?Â§Ã?Â²Ã?Â¨';
> $errorMessage = 'Ã?Â¦Ã?ÂÃ?Â¯Ã?Â½Ã?Â Ã?Â¦Ã?Â½Ã?Â¥Ã?Â¤Ã¦ÂÂ¤Ã?Â¥Ã?Â¯Ã?Â«Ã?Â¥Ã?Â©Ã?Â©Ã?Â§Ã?Â§Ã?Â²Ã?Â¨';
> }

Looks like an UTF-8 encoded file viewed as ISO-8859-1. Sou you should
decode `content` to unicode before searching the chinese characters.

Ciao,
Marc 'BlackJack' Rintsch

Mark Tolonen

3/4/2008 7:44:00 AM

"Marc 'BlackJack' Rintsch" <bj_666@gmx.net> wrote in message
news:6349rmF23qmbmU1@mid.uni-berlin.de...
> On Tue, 04 Mar 2008 10:49:54 +0530, Pradnyesh Sawant wrote:
>
>> I have a file which contains chinese characters. I just want to find out
>> all the places that these chinese characters occur.
>>
>> The following script doesn't seem to work :(
>>
>> **********************************************************************
>> class RemCh(object):
>> def __init__(self, fName):
>> self.pattern = re.compile(r'[\u2F00-\u2FDF]+')
>> fp = open(fName, 'r')
>> content = fp.read()
>> s = re.search('[\u2F00-\u2fdf]', content, re.U)
>> if s:
>> print s.group(0)
>> if __name__ == '__main__':
>> rc = RemCh('/home/pradnyesh/removeChinese/delFolder.php')
>> **********************************************************************
>>
>> the php file content is something like the following:
>>
>> **********************************************************************
>> // Check if the folder still has subscribed blogs
>> $subCount = function1($param1, $param2);
>> if ($subCount > 0) {
>> $errors['summary'] = 'Ã?Â¦Ã?ÂÃ?Â¯Ã?Â½Ã? Ã?Â¦Ã?Â½Ã?Â¥Ã?Â¤Ã¦ÂÂ¤Ã?Â¥Ã?Â¯Ã?Â«Ã?Â¥Ã?Â©Ã?Â©Ã?Â§Ã?Â§Ã?Â²Ã?Â¨';
>> $errorMessage = 'Ã?Â¦Ã?ÂÃ?Â¯Ã?Â½Ã? Ã?Â¦Ã?Â½Ã?Â¥Ã?Â¤Ã¦ÂÂ¤Ã?Â¥Ã?Â¯Ã?Â«Ã?Â¥Ã?Â©Ã?Â©Ã?Â§Ã?Â§Ã?Â²Ã?Â¨';
>> }
>
> Looks like an UTF-8 encoded file viewed as ISO-8859-1. Sou you should
> decode `content` to unicode before searching the chinese characters.
>

I couldn't get your data to decode into anything resembling Chinese, so I
created my own file as an example. If reading an encoded text file, it
comes
in as just a bunch of bytes:

>>> print open('chinese.txt','r').read()
Ã¯Â»Â¿Ã¦Ë?â??Ã¦Ë?Â¯Ã§Â¾Å½Ã¥â?ºÂ½Ã¤ÂºÂºÃ£â?¬â?? WÃ?â?? shÃ?Â¬ MÃ?â?ºiguÃ?Â³rÃ?Â©n. I am an American.

Garbage, because the encoding isn't known. Provide the correct encoding and
decode it to Unicode:

>>> print open('chinese.txt','r').read().decode('utf8')
ï»¿æ??æ?¯ç¾?å?½äººã?? WÇ? shÃ¬ MÄ?iguÃ³rÃ©n. I am an American.

Here's the Unicode string. Note the 'u' before the quotes to indicate
Unicode.

>>> s=open('chinese.txt','r').read().decode('utf8')
>>> s
u'\ufeff\u6211\u662f\u7f8e\u56fd\u4eba\u3002 W\u01d2 sh\xec
M\u011bigu\xf3r\xe9n. I am an American.'

If working with Unicode strings, the re module should be provided Unicode
strings also:

>>> print re.search(ur'[\u4E00-\u9FA5]',s).group(0)
æ??
>>> print re.findall(ur'[\u4E00-\u9FA5]',s)
[u'\u6211', u'\u662f', u'\u7f8e', u'\u56fd', u'\u4eba']

Hope that helps you.

--Mark

comp.lang.python

help needed with regex and unicode

Pradnyesh Sawant

Marc 'BlackJack' Rintsch

Mark Tolonen

x Login to ForumsZone