[lnkForumImage]
TotalShareware - Download Free Software

Confronta i prezzi di migliaia di prodotti.
Asp Forum
 Home | Login | Register | Search 


 

Forums >

comp.lang.python

Parsing for email addresses

galileo228

2/15/2010 11:35:00 PM

Hey all,

I'm trying to write python code that will open a textfile and find the
email addresses inside it. I then want the code to take just the
characters to the left of the "@" symbol, and place them in a list.
(So if galileo228@gmail.com was in the file, 'galileo228' would be
added to the list.)

Any suggestions would be much appeciated!

Matt
6 Answers

Jonathan Gardner

2/15/2010 11:50:00 PM

0

On Feb 15, 3:34 pm, galileo228 <mattbar...@gmail.com> wrote:
>
> I'm trying to write python code that will open a textfile and find the
> email addresses inside it. I then want the code to take just the
> characters to the left of the "@" symbol, and place them in a list.
> (So if galileo...@gmail.com was in the file, 'galileo228' would be
> added to the list.)
>
> Any suggestions would be much appeciated!
>

You may want to use regexes for this. For every match, split on '@'
and take the first bit.

Note that the actual specification for email addresses is far more
than a single regex can handle. However, for almost every single case
out there nowadays, a regex will get what you need.

Tim Chase

2/16/2010 12:35:00 AM

0

Jonathan Gardner wrote:
> On Feb 15, 3:34 pm, galileo228 <mattbar...@gmail.com> wrote:
>> I'm trying to write python code that will open a textfile and find the
>> email addresses inside it. I then want the code to take just the
>> characters to the left of the "@" symbol, and place them in a list.
>> (So if galileo...@gmail.com was in the file, 'galileo228' would be
>> added to the list.)
>>
>> Any suggestions would be much appeciated!
>>
>
> You may want to use regexes for this. For every match, split on '@'
> and take the first bit.
>
> Note that the actual specification for email addresses is far more
> than a single regex can handle. However, for almost every single case
> out there nowadays, a regex will get what you need.

You can even capture the part as you find the regexps. As
Jonathan mentions, finding RFC-compliant email addresses can be a
hairy/intractable problem. But you can get a pretty close
approximation:

import re

r = re.compile(r'([-\w._+]+)@(?:[-\w]+\.)+(?:\w{2,5})', re.I)
# ^
# if you want to allow local domains like
# user@localhost
# then change the "+" marked with the "^"
# to a "*" and the "{2,5}" to "+" to unlimit
# the TLD. This will change the outcome
# of the last test "jim@com" to True

for test, expected in (
('jim@example.com', True),
('jim@sub.example.com', True),
('@example.com', False),
('@sub.example.com', False),
('@com', False),
('jim@com', False),
):
m = r.match(test)
if bool(m) ^ expected:
print "Failed: %r should be %s" % (test, expected)

emails = set()
for line in file('test.txt'):
for match in r.finditer(line):
emails.add(match.group(1))
print "All the emails:",
print ', '.join(emails)

-tkc






Ben Finney

2/16/2010 1:01:00 AM

0

galileo228 <mattbarkan@gmail.com> writes:

> I'm trying to write python code that will open a textfile and find the
> email addresses inside it. I then want the code to take just the
> characters to the left of the "@" symbol, and place them in a list.

Email addresses can have more than one â??@â?? character. In fact, the
quoting rules allow the local-part to contain *any ASCII character* and
remain valid.

> Any suggestions would be much appeciated!

For a brief but thorough treatment of parsing email addresses, see RFC
3696, â??Application Techniques for Checking and Transformation of Namesâ?
<URL:http://www.ietf.org/rfc/rfc36..., specifically section 3.

--
\ â??What I have to do is see, at any rate, that I do not lend |
`\ myself to the wrong which I condemn.â? â??Henry Thoreau, _Civil |
_o__) Disobedience_ |
Ben Finney

galileo228

2/16/2010 6:58:00 PM

0

Hey all, thanks as always for the quick responses.

I actually found a very simple way to do what I needed to do. In
short, I needed to take an email which had a large number of addresses
in the 'to' field, and place just the identifiers (everything to the
left of @domain.com), in a python list.

I simply highlighted all the addresses and placed them in a text file
called emails.txt. Then I had the following code which placed each
line in the file into the list 'names':

[code]
fileHandle = open('/Users/Matt/Documents/python/results.txt','r')
names = fileHandle.readlines()
[/code]

Now, the 'names' list has values looking like this: ['aaa12@domain.com
\n', 'bbb34@domain.com\n', etc]. So I ran the following code:

[code]
for x in names:
st_list.append(x.replace('@domain.com\n',''))
[/code]

And that did the trick! 'Names' now has ['aaa12', 'bbb34', etc].

Obviously this only worked because all of the domain names were the
same. If they were not then based on your comments and my own
research, I would've had to use regex and the split(), which looked
massively complicated to learn.

Thanks all.

Matt

On Feb 15, 8:01 pm, Ben Finney <ben+pyt...@benfinney.id.au> wrote:
> galileo228 <mattbar...@gmail.com> writes:
> > I'm trying to write python code that will open a textfile and find the
> > email addresses inside it. I then want the code to take just the
> > characters to the left of the "@" symbol, and place them in a list.
>
> Email addresses can have more than one ‘@’ character. In fact, the
> quoting rules allow the local-part to contain *any ASCII character* and
> remain valid.
>
> > Any suggestions would be much appeciated!
>
> For a brief but thorough treatment of parsing email addresses, see RFC
> 3696, “Application Techniques for Checking and Transformation of Names”
> <URL:http://www.ietf.org/rfc/rfc36..., specifically section 3.
>
> --
>  \          “What I have to do is see, at any rate, that I do not lend |
>   `\      myself to the wrong which I condemn.” —Henry Thoreau, _Civil |
> _o__)                                                    Disobedience_ |
> Ben Finney

Tim Chase

2/16/2010 8:16:00 PM

0

galileo228 wrote:
> [code]
> fileHandle = open('/Users/Matt/Documents/python/results.txt','r')
> names = fileHandle.readlines()
> [/code]
>
> Now, the 'names' list has values looking like this: ['aaa12@domain.com
> \n', 'bbb34@domain.com\n', etc]. So I ran the following code:
>
> [code]
> for x in names:
> st_list.append(x.replace('@domain.com\n',''))
> [/code]
>
> And that did the trick! 'Names' now has ['aaa12', 'bbb34', etc].
>
> Obviously this only worked because all of the domain names were the
> same. If they were not then based on your comments and my own
> research, I would've had to use regex and the split(), which looked
> massively complicated to learn.

The complexities stemmed from several factors that, with more
details, could have made the solutions less daunting:

(a) you mentioned "finding" the email addresses -- this makes
it sound like there's other junk in the file that has to be
sifted through to find "things that look like an email address".
If the sole content of the file is lines containing only email
addresses, then "find the email address" is a bit like [1]

(b) you omitted the detail that the domains are all the same.
Even if they're not the same, (a) reduces the problem to a much
easier task:

s = set()
for line in file('results.txt'):
s.add(line.rsplit('@', 1)[0].lower())
print s

If it was previously a CSV or tab-delimited file, Python offers
batteries-included processing to make it easy:

import csv
f = file('results.txt', 'rb')
r = csv.DictReader(f) # CSV
# r = csv.DictReader(f, delimiter='\t') # tab delim
s = set()
for row in r:
s.add(row['Email'].lower())
f.close()

or even

f = file(...)
r = csv.DictReader(...)
s = set(row['Email'].lower() for row in r)
f.close()

Hope this gives you more ideas to work with.

-tkc

[1]
http://jacksmix.files.wordpress.com/2007/05...



galileo228

2/17/2010 12:08:00 AM

0

Tim -

Thanks for this. I actually did intend to have to sift through other
junk in the file, but then figured I could just cut and paste emails
directly from the 'to' field, thus making life easier.

Also, in this particular instance, the domain names were the same, and
thus I was able to figure out my solution, but I do need to know how
to handle the same situation when the domain names are different, so
your response was most helpful.

Apologies for leaving out some details.

Matt

On Feb 16, 3:15 pm, Tim Chase <python.l...@tim.thechases.com> wrote:
> galileo228 wrote:
> > [code]
> > fileHandle = open('/Users/Matt/Documents/python/results.txt','r')
> > names = fileHandle.readlines()
> > [/code]
>
> > Now, the 'names' list has values looking like this: ['aa...@domain.com
> > \n', 'bb...@domain.com\n', etc]. So I ran the following code:
>
> > [code]
> > for x in names:
> >     st_list.append(x.replace('...@domain.com\n',''))
> > [/code]
>
> > And that did the trick! 'Names' now has ['aaa12', 'bbb34', etc].
>
> > Obviously this only worked because all of the domain names were the
> > same. If they were not then based on your comments and my own
> > research, I would've had to use regex and the split(), which looked
> > massively complicated to learn.
>
> The complexities stemmed from several factors that, with more
> details, could have made the solutions less daunting:
>
>    (a) you mentioned "finding" the email addresses -- this makes
> it sound like there's other junk in the file that has to be
> sifted through to find "things that look like an email address".
> If the sole content of the file is lines containing only email
> addresses, then "find the email address" is a bit like [1]
>
>    (b) you omitted the detail that the domains are all the same.
>   Even if they're not the same, (a) reduces the problem to a much
> easier task:
>
>    s = set()
>    for line in file('results.txt'):
>      s.add(line.rsplit('@', 1)[0].lower())
>    print s
>
> If it was previously a CSV or tab-delimited file, Python offers
> batteries-included processing to make it easy:
>
>    import csv
>    f = file('results.txt', 'rb')
>    r = csv.DictReader(f)  # CSV
>    # r = csv.DictReader(f, delimiter='\t') # tab delim
>    s = set()
>    for row in r:
>      s.add(row['Email'].lower())
>    f.close()
>
> or even
>
>    f = file(...)
>    r = csv.DictReader(...)
>    s = set(row['Email'].lower() for row in r)
>    f.close()
>
> Hope this gives you more ideas to work with.
>
> -tkc
>
> [1]http://jacksmix.files.wordpress.com/2007/05...