[lnkForumImage]
TotalShareware - Download Free Software

Confronta i prezzi di migliaia di prodotti.
Asp Forum
 Home | Login | Register | Search 


 

Forums >

comp.lang.python

Replace stop words (remove words from a string

Berlin Brown

1/17/2008 8:25:00 AM

if I have an array of "stop" words, and I want to replace those values
with something else; in a string, how would I go about doing this. I
have this code that splits the string and then does a difference but I
think there is an easier approach:

E.g.

mystr =
kljsldkfjksjdfjsdjflkdjslkf[BAD]Kkjkkkkjkkjk[BAD]LSKJFKSFJKSJF;L[BAD2]kjsldfsd;

if I have an array stop_list = [ "[BAD]", "[BAD2]" ]

I want to replace the values in that list with a zero length string.

I had this before, but I don't want to use this approach; I don't want
to use the split.

line_list = line.lower().split()
res = list(set(keywords_list).difference(set(ENTITY_IGNORE_LIST)))


6 Answers

Karthik

1/17/2008 8:42:00 AM

0

How about -

for s in stoplist:
string.replace(mystr, s, "")

Hope this should work.

-----Original Message-----
From: python-list-bounces+karthik3186=gmail.com@python.org
[mailto:python-list-bounces+karthik3186=gmail.com@python.org] On Behalf Of
BerlinBrown
Sent: Thursday, January 17, 2008 1:55 PM
To: python-list@python.org
Subject: Replace stop words (remove words from a string)

if I have an array of "stop" words, and I want to replace those values
with something else; in a string, how would I go about doing this. I
have this code that splits the string and then does a difference but I
think there is an easier approach:

E.g.

mystr =
kljsldkfjksjdfjsdjflkdjslkf[BAD]Kkjkkkkjkkjk[BAD]LSKJFKSFJKSJF;L[BAD2]kjsldf
sd;

if I have an array stop_list = [ "[BAD]", "[BAD2]" ]

I want to replace the values in that list with a zero length string.

I had this before, but I don't want to use this approach; I don't want
to use the split.

line_list = line.lower().split()
res = list(set(keywords_list).difference(set(ENTITY_IGNORE_LIST)))


--
http://mail.python.org/mailman/listinfo/p...

Gary Herron

1/17/2008 8:45:00 AM

0

BerlinBrown wrote:
> if I have an array of "stop" words, and I want to replace those values
> with something else; in a string, how would I go about doing this. I
> have this code that splits the string and then does a difference but I
> think there is an easier approach:
>
> E.g.
>
> mystr =
> kljsldkfjksjdfjsdjflkdjslkf[BAD]Kkjkkkkjkkjk[BAD]LSKJFKSFJKSJF;L[BAD2]kjsldfsd;
>
> if I have an array stop_list = [ "[BAD]", "[BAD2]" ]
>
> I want to replace the values in that list with a zero length string.
>
> I had this before, but I don't want to use this approach; I don't want
> to use the split.
>
> line_list = line.lower().split()
> res = list(set(keywords_list).difference(set(ENTITY_IGNORE_LIST)))
>
String have a replace method that will produce a new string with (all
occurrences of) one substring replaced with another. You'd have to loop
through your stop_list one word at a time.

>>> s = 'abcxyzabc'
>>> s.replace('xyz','')
'abcabc'


If either the string or the stop_list grows particularly large, this
approach won't scale very well since the whole string would be
re-created anew for each stop_list entry. In that case, I'd look into
the regular expression (re) module. You may be able to finagle a way to
find and replace all stop_list entries in one pass. (Finding them all
is easy -- not so sure you could replace them all at once though. )


Gary Herron


Gary Herron

1/17/2008 8:47:00 AM

0

Karthik wrote:
> How about -
>
> for s in stoplist:
> string.replace(mystr, s, "")
>
That will work, but the string module is long outdated. Better to use
string methods:

for s in stoplist:
mystr.replace(s, "")

Gary Herron


> Hope this should work.
>
> -----Original Message-----
> From: python-list-bounces+karthik3186=gmail.com@python.org
> [mailto:python-list-bounces+karthik3186=gmail.com@python.org] On Behalf Of
> BerlinBrown
> Sent: Thursday, January 17, 2008 1:55 PM
> To: python-list@python.org
> Subject: Replace stop words (remove words from a string)
>
> if I have an array of "stop" words, and I want to replace those values
> with something else; in a string, how would I go about doing this. I
> have this code that splits the string and then does a difference but I
> think there is an easier approach:
>
> E.g.
>
> mystr =
> kljsldkfjksjdfjsdjflkdjslkf[BAD]Kkjkkkkjkkjk[BAD]LSKJFKSFJKSJF;L[BAD2]kjsldf
> sd;
>
> if I have an array stop_list = [ "[BAD]", "[BAD2]" ]
>
> I want to replace the values in that list with a zero length string.
>
> I had this before, but I don't want to use this approach; I don't want
> to use the split.
>
> line_list = line.lower().split()
> res = list(set(keywords_list).difference(set(ENTITY_IGNORE_LIST)))
>
>
>

Raymond Hettinger

1/17/2008 8:50:00 AM

0

On Jan 17, 12:25 am, BerlinBrown <berlin.br...@gmail.com> wrote:
> if I have an array of "stop" words, and I want to replace those values
> with something else;
> mystr =
> kljsldkfjksjdfjsdjflkdjslkf[BAD]Kkjkkkkjkkjk[BAD]LSKJFKSFJKSJF;L[BAD2]kjsld­fsd;
> if I have an array stop_list = [ "[BAD]", "[BAD2]" ]
> I want to replace the values in that list with a zero length string.

Regular expressions should do the trick.

Try this:

>>> mystr = 'kljsldkfjksjdfjsdjflkdjslkf[BAD]Kkjkkkkjkkjk[BAD]LSKJFKSFJKSJF;L[BAD2]kjsld­fsd;'
>>> stoplist = ["[BAD]", "[BAD2]"]
>>> import re
>>> stoppattern = '|'.join(map(re.escape, stoplist))
>>> re.sub(stoppattern, '', mystr)
'kljsldkfjksjdfjsdjflkdjslkfKkjkkkkjkkjkLSKJFKSFJKSJF;Lkjsld\xadfsd;'

Raymond

Bruno Desthuilliers

1/17/2008 8:59:00 AM

0

BerlinBrown a écrit :
> if I have an array of "stop" words, and I want to replace those values
> with something else; in a string, how would I go about doing this. I
> have this code that splits the string and then does a difference but I
> think there is an easier approach:
>
> E.g.
>
> mystr =
> kljsldkfjksjdfjsdjflkdjslkf[BAD]Kkjkkkkjkkjk[BAD]LSKJFKSFJKSJF;L[BAD2]kjsldfsd;
>

<ot>you forgot the quotes</ot>

> if I have an array stop_list = [ "[BAD]", "[BAD2]" ]

s/array/list/

> I want to replace the values in that list with a zero length string.
>
> I had this before, but I don't want to use this approach; I don't want
> to use the split.
>
> line_list = line.lower().split()
> res = list(set(keywords_list).difference(set(ENTITY_IGNORE_LIST)))

res = mystr
for stop_word in stop_list:
res = res.replace(stop_word, '')


Bearophile

1/17/2008 2:37:00 PM

0

Raymond Hettinger:
> Regular expressions should do the trick.
> >>> stoppattern = '|'.join(map(re.escape, stoplist))
> >>> re.sub(stoppattern, '', mystr)

If the stop words are many (and similar) then that RE can be optimized
with a trie-based strategy, like this one called "List":
http://search.cpan.org/~dankogai/Regexp-Optimizer-0.15/lib/Rege...

"List" is used by something more complex called "Optimizer" that's
overkill for the OP problem:
http://search.cpan.org/~dankogai/Regexp-Optimizer-0.15/lib/Regexp/Op...

I don't know if a Python module similar to "List" is available, I may
write it :-)

Bye,
bearophile