[lnkForumImage]
TotalShareware - Download Free Software

Confronta i prezzi di migliaia di prodotti.
Asp Forum
 Home | Login | Register | Search 


 

Forums >

comp.lang.python

help needed with regex and unicode

Pradnyesh Sawant

3/4/2008 5:20:00 AM

Hi all,
I have a file which contains chinese characters. I just want to find out
all the places that these chinese characters occur.

The following script doesn't seem to work :(

**********************************************************************
class RemCh(object):
def __init__(self, fName):
self.pattern = re.compile(r'[\u2F00-\u2FDF]+')
fp = open(fName, 'r')
content = fp.read()
s = re.search('[\u2F00-\u2fdf]', content, re.U)
if s:
print s.group(0)
if __name__ == '__main__':
rc = RemCh('/home/pradnyesh/removeChinese/delFolder.php')
**********************************************************************

the php file content is something like the following:

**********************************************************************
// Check if the folder still has subscribed blogs
$subCount = function1($param1, $param2);
if ($subCount > 0) {
$errors['summary'] = 'æ­ï½ æ½å¤此åï«åéé§ç²è';
$errorMessage = 'æ­ï½ æ½å¤此åï«åéé§ç²è';
}

if (empty($errors)) {
$ret = function2($blog_res, $yuid, $fid);
if ($ret >= 0) {
$saveFalg = TRUE;
} else {
error_log("ERROR:: ret: $ret, function1($param1, $param2)");
$errors['summary'] = "æ­ï½ æ½å¤此è±åã
$errorMessage = "æ­ï½ æ½å¤此è±åã
}
}
**********************************************************************

--
warm regards,
Pradnyesh Sawant
--
Luck is the residue of good design. --Anon
2 Answers

Marc 'BlackJack' Rintsch

3/4/2008 6:52:00 AM

0

On Tue, 04 Mar 2008 10:49:54 +0530, Pradnyesh Sawant wrote:

> I have a file which contains chinese characters. I just want to find out
> all the places that these chinese characters occur.
>
> The following script doesn't seem to work :(
>
> **********************************************************************
> class RemCh(object):
> def __init__(self, fName):
> self.pattern = re.compile(r'[\u2F00-\u2FDF]+')
> fp = open(fName, 'r')
> content = fp.read()
> s = re.search('[\u2F00-\u2fdf]', content, re.U)
> if s:
> print s.group(0)
> if __name__ == '__main__':
> rc = RemCh('/home/pradnyesh/removeChinese/delFolder.php')
> **********************************************************************
>
> the php file content is something like the following:
>
> **********************************************************************
> // Check if the folder still has subscribed blogs
> $subCount = function1($param1, $param2);
> if ($subCount > 0) {
> $errors['summary'] = '�¦�­�¯�½� �¦�½�¥�¤æ­¤�¥�¯�«�¥�©�©�§�§�²�¨';
> $errorMessage = '�¦�­�¯�½� �¦�½�¥�¤æ­¤�¥�¯�«�¥�©�©�§�§�²�¨';
> }

Looks like an UTF-8 encoded file viewed as ISO-8859-1. Sou you should
decode `content` to unicode before searching the chinese characters.

Ciao,
Marc 'BlackJack' Rintsch

Mark Tolonen

3/4/2008 7:44:00 AM

0


"Marc 'BlackJack' Rintsch" <bj_666@gmx.net> wrote in message
news:6349rmF23qmbmU1@mid.uni-berlin.de...
> On Tue, 04 Mar 2008 10:49:54 +0530, Pradnyesh Sawant wrote:
>
>> I have a file which contains chinese characters. I just want to find out
>> all the places that these chinese characters occur.
>>
>> The following script doesn't seem to work :(
>>
>> **********************************************************************
>> class RemCh(object):
>> def __init__(self, fName):
>> self.pattern = re.compile(r'[\u2F00-\u2FDF]+')
>> fp = open(fName, 'r')
>> content = fp.read()
>> s = re.search('[\u2F00-\u2fdf]', content, re.U)
>> if s:
>> print s.group(0)
>> if __name__ == '__main__':
>> rc = RemCh('/home/pradnyesh/removeChinese/delFolder.php')
>> **********************************************************************
>>
>> the php file content is something like the following:
>>
>> **********************************************************************
>> // Check if the folder still has subscribed blogs
>> $subCount = function1($param1, $param2);
>> if ($subCount > 0) {
>> $errors['summary'] = '�¦�­�¯�½� �¦�½�¥�¤æ­¤�¥�¯�«�¥�©�©�§�§�²�¨';
>> $errorMessage = '�¦�­�¯�½� �¦�½�¥�¤æ­¤�¥�¯�«�¥�©�©�§�§�²�¨';
>> }
>
> Looks like an UTF-8 encoded file viewed as ISO-8859-1. Sou you should
> decode `content` to unicode before searching the chinese characters.
>

I couldn't get your data to decode into anything resembling Chinese, so I
created my own file as an example. If reading an encoded text file, it
comes
in as just a bunch of bytes:

>>> print open('chinese.txt','r').read()
æË?â??æË?¯ç¾Žåâ?ºÂ½Ã¤ÂºÂºÃ£â?¬â?? WÃ?â?? shÃ?¬ MÃ?â?ºiguÃ?³rÃ?©n. I am an American.

Garbage, because the encoding isn't known. Provide the correct encoding and
decode it to Unicode:

>>> print open('chinese.txt','r').read().decode('utf8')
æ??æ?¯ç¾?å?½äººã?? WÇ? shì MÄ?iguórén. I am an American.

Here's the Unicode string. Note the 'u' before the quotes to indicate
Unicode.

>>> s=open('chinese.txt','r').read().decode('utf8')
>>> s
u'\ufeff\u6211\u662f\u7f8e\u56fd\u4eba\u3002 W\u01d2 sh\xec
M\u011bigu\xf3r\xe9n. I am an American.'

If working with Unicode strings, the re module should be provided Unicode
strings also:

>>> print re.search(ur'[\u4E00-\u9FA5]',s).group(0)
æ??
>>> print re.findall(ur'[\u4E00-\u9FA5]',s)
[u'\u6211', u'\u662f', u'\u7f8e', u'\u56fd', u'\u4eba']

Hope that helps you.

--Mark