Asp Forum - Evaluate my first python script, please

Pete Emerson

3/4/2010 6:40:00 PM

I've written my first python program, and would love suggestions for
improvement.

I'm a perl programmer and used a perl version of this program to guide
me. So in that sense, the python is "perlesque"

This script parses /etc/hosts for hostnames, and based on terms given
on the command line (argv), either prints the list of hostnames that
match all the criteria, or uses ssh to connect to the host if the
number of matches is unique.

I am looking for advice along the lines of "an easier way to do this"
or "a more python way" (I'm sure that's asking for trouble!) or
"people commonly do this instead" or "here's a slick trick" or "oh,
interesting, here's my version to do the same thing".

I am aware that there are performance improvements and error checking
that could be made, such as making sure the file exists and is
readable and precompiling the regular expressions and not calculating
how many sys.argv arguments there are more than once. I'm not hyper
concerned with performance or idiot proofing for this particular
script.

Thanks in advance.

########################################################
#!/usr/bin/python

import sys, fileinput, re, os

filename = '/etc/hosts'

hosts = []

for line in open(filename, 'r'):
match = re.search('\d+\.\d+\.\d+\.\d+\s+(\S+)', line)
if match is None or re.search('^(?:float|localhost)\.', line):
continue
hostname = match.group(1)
count = 0
for arg in sys.argv[1:]:
for section in hostname.split('.'):
if section == arg:
count = count + 1
break
if count == len(sys.argv) - 1:
hosts.append(hostname)

if len(hosts) == 1:
os.system("ssh -A " + hosts[0])
else:
print '\n'.join(hosts)

15 Answers

Tim Wintle

3/4/2010 7:07:00 PM

On Thu, 2010-03-04 at 10:39 -0800, Pete Emerson wrote:
> I am looking for advice along the lines of "an easier way to do this"
> or "a more python way" (I'm sure that's asking for trouble!) or
> "people commonly do this instead" or "here's a slick trick" or "oh,
> interesting, here's my version to do the same thing".

(1) I would wrap it all in a function

def main():
# your code here

if __name__ == "__main__":
main()

(2) PEP8 (python style guidelines) suggests one import per line

(3) I'd use four spaces as tab width

(4)
I'd change this:

> for arg in sys.argv[1:]:
> for section in hostname.split('.'):
> if section == arg:
> count = count + 1
> break

to something more like:

for section in hostname.split("."):
if section in sys.argv[1:]:
count += 1

(although as you suggested I'd only calculate sys.argv[1:] once)

.... or you could replace whole section between the for loop and
hosts.append with:

if sorted(hostname.split(".")) == sorted(sys.argv[1:]):
host.append(hostname)

, at a slight loss of clarity - but I think I'd stick with the more
verbose version personally.

Tim

MRAB

3/4/2010 7:31:00 PM

Pete Emerson wrote:
> I've written my first python program, and would love suggestions for
> improvement.
>
> I'm a perl programmer and used a perl version of this program to guide
> me. So in that sense, the python is "perlesque"
>
> This script parses /etc/hosts for hostnames, and based on terms given
> on the command line (argv), either prints the list of hostnames that
> match all the criteria, or uses ssh to connect to the host if the
> number of matches is unique.
>
> I am looking for advice along the lines of "an easier way to do this"
> or "a more python way" (I'm sure that's asking for trouble!) or
> "people commonly do this instead" or "here's a slick trick" or "oh,
> interesting, here's my version to do the same thing".
>
> I am aware that there are performance improvements and error checking
> that could be made, such as making sure the file exists and is
> readable and precompiling the regular expressions and not calculating
> how many sys.argv arguments there are more than once. I'm not hyper
> concerned with performance or idiot proofing for this particular
> script.
>
> Thanks in advance.
>
> ########################################################
> #!/usr/bin/python
>
> import sys, fileinput, re, os
>
> filename = '/etc/hosts'
>
> hosts = []
>
> for line in open(filename, 'r'):
> match = re.search('\d+\.\d+\.\d+\.\d+\s+(\S+)', line)

Use 'raw' strings for regular expressions.

'Normal' Python string literals use backslashes in escape sequences (eg,
'\n'), but so do regular expressions, and regular expressions are passed
to the 're' module in strings, so that can lead to a profusion of
backslashes!

You could either escape the backslashes:

match = re.search('\\d+\\.\\d+\\.\\d+\\.\\d+\s+(\\S+)', line)

or use a raw string:

match = re.search(r'\d+\.\d+\.\d+\.\d+\s+(\S+)', line)

A raw string literal is just like a normal string literal, except that
backslashes aren't special.

> if match is None or re.search('^(?:float|localhost)\.', line):
> continue

It would be more 'Pythonic' to say "not match".

if not match or re.search(r'^(?:float|localhost)\.', line):

> hostname = match.group(1)
> count = 0
> for arg in sys.argv[1:]:
> for section in hostname.split('.'):
> if section == arg:
> count = count + 1

Shorter is:

count += 1

> break
> if count == len(sys.argv) - 1:
> hosts.append(hostname)
>
> if len(hosts) == 1:
> os.system("ssh -A " + hosts[0])
> else:
> print '\n'.join(hosts)

You're splitting the hostname repeatedly. You could split it just once,
outside the outer loop, and maybe make it into a set. You could then
have:

host_parts = set(hostname.split('.'))
count = 0
for arg in sys.argv[1:]:
if arg in host_parts:
count += 1

Incidentally, the re module contains a cache, so the regular expressions
won't be recompiled every time (unless you have a lot of them and the re
module flushes the cache).

nn

3/4/2010 7:55:00 PM

On Mar 4, 2:30 pm, MRAB <pyt...@mrabarnett.plus.com> wrote:
> Pete Emerson wrote:
> > I've written my first python program, and would love suggestions for
> > improvement.
>
> > I'm a perl programmer and used a perl version of this program to guide
> > me. So in that sense, the python is "perlesque"
>
> > This script parses /etc/hosts for hostnames, and based on terms given
> > on the command line (argv), either prints the list of hostnames that
> > match all the criteria, or uses ssh to connect to the host if the
> > number of matches is unique.
>
> > I am looking for advice along the lines of "an easier way to do this"
> > or "a more python way" (I'm sure that's asking for trouble!) or
> > "people commonly do this instead" or "here's a slick trick" or "oh,
> > interesting, here's my version to do the same thing".
>
> > I am aware that there are performance improvements and error checking
> > that could be made, such as making sure the file exists and is
> > readable and precompiling the regular expressions and not calculating
> > how many sys.argv arguments there are more than once. I'm not hyper
> > concerned with performance or idiot proofing for this particular
> > script.
>
> > Thanks in advance.
>
> > ########################################################
> > #!/usr/bin/python
>
> > import sys, fileinput, re, os
>
> > filename = '/etc/hosts'
>
> > hosts = []
>
> > for line in open(filename, 'r'):
> > match = re.search('\d+\.\d+\.\d+\.\d+\s+(\S+)', line)
>
> Use 'raw' strings for regular expressions.
>
> 'Normal' Python string literals use backslashes in escape sequences (eg,
> '\n'), but so do regular expressions, and regular expressions are passed
> to the 're' module in strings, so that can lead to a profusion of
> backslashes!
>
> You could either escape the backslashes:
>
> match = re.search('\\d+\\.\\d+\\.\\d+\\.\\d+\s+(\\S+)', line)
>
> or use a raw string:
>
> match = re.search(r'\d+\.\d+\.\d+\.\d+\s+(\S+)', line)
>
> A raw string literal is just like a normal string literal, except that
> backslashes aren't special.
>
> > if match is None or re.search('^(?:float|localhost)\.', line):
> > continue
>
> It would be more 'Pythonic' to say "not match".
>
> if not match or re.search(r'^(?:float|localhost)\.', line):
>
> > hostname = match.group(1)
> > count = 0
> > for arg in sys.argv[1:]:
> > for section in hostname.split('.'):
> > if section == arg:
> > count = count + 1
>
> Shorter is:
>
> count += 1
>
> > break
> > if count == len(sys.argv) - 1:
> > hosts.append(hostname)
>
> > if len(hosts) == 1:
> > os.system("ssh -A " + hosts[0])
> > else:
> > print '\n'.join(hosts)
>
> You're splitting the hostname repeatedly. You could split it just once,
> outside the outer loop, and maybe make it into a set. You could then
> have:
>
> host_parts = set(hostname.split('.'))
> count = 0
> for arg in sys.argv[1:]:
> if arg in host_parts:
> count += 1
>
> Incidentally, the re module contains a cache, so the regular expressions
> won't be recompiled every time (unless you have a lot of them and the re
> module flushes the cache).

I think that you really just need a set intersection for the count
part and this should work:

host_parts = set(hostname.split('.'))
count = len(host_parts.intersection(sys.argv[1:]))

Jonathan Gardner

3/4/2010 9:55:00 PM

On Thu, Mar 4, 2010 at 10:39 AM, Pete Emerson <pemerson@gmail.com> wrote:
>
> #!/usr/bin/python
>

More common:

#!/usr/bin/env python

> import sys, fileinput, re, os
>
> filename = '/etc/hosts'
>
> hosts = []
>
> for line in open(filename, 'r'):
> match = re.search('\d+\.\d+\.\d+\.\d+\s+(\S+)', line)
> if match is None or re.search('^(?:float|localhost)\.', line):
> continue
> hostname = match.group(1)

In Python, we use REs as a last resort. See "dir("")" or "help("")"
for why we don't use REs if we can avoid them.

The above is better represented with:

try:
ipaddr, hostnames = line.split(None, 1)
except IndexError:
continue

if line.find('float') >=0 or line.find('localhost') >= 0:
continue

> count = 0
> for arg in sys.argv[1:]:
> for section in hostname.split('.'):
> if section == arg:
> count = count + 1
> break
> if count == len(sys.argv) - 1:
> hosts.append(hostname)
>

You can use the "else" in a "for". as well.

for arg in sys.argv[1:]:
for section in hostname.split('.'):
if section == arg:
break
else: # Run only if for completes naturally
hosts.append(hostname)

It may be clearer to do set arithmetic as well.

> if len(hosts) == 1:
> os.system("ssh -A " + hosts[0])

You probably want one of the os.exec* methods instead, since you
aren't going to do anything after this.

> else:
> print '\n'.join(hosts)

Rather, the idiom is usually:

for host in hosts:
print host

--
Jonathan Gardner
jgardner@jonathangardner.net

sjdevnull@yahoo.com

3/4/2010 10:55:00 PM

On Mar 4, 1:39 pm, Pete Emerson <pemer...@gmail.com> wrote:
> I've written my first python program, and would love suggestions for
> improvement.
>
> I'm a perl programmer and used a perl version of this program to guide
> me. So in that sense, the python is "perlesque"
>
> This script parses /etc/hosts for hostnames, and based on terms given
> on the command line (argv), either prints the list of hostnames that
> match all the criteria, or uses ssh to connect to the host if the
> number of matches is unique.
>
> I am looking for advice along the lines of "an easier way to do this"
> or "a more python way" (I'm sure that's asking for trouble!) or
> "people commonly do this instead" or "here's a slick trick" or "oh,
> interesting, here's my version to do the same thing".
>
> I am aware that there are performance improvements and error checking
> that could be made, such as making sure the file exists and is
> readable and precompiling the regular expressions and not calculating
> how many sys.argv arguments there are more than once. I'm not hyper
> concerned with performance or idiot proofing for this particular
> script.
>
> Thanks in advance.
>
> ########################################################
> #!/usr/bin/python
>
> import sys, fileinput, re, os

'Some people, when confronted with a problem, think "I know, I’ll use
regular expressions." Now they have two problems.' — Jamie Zawinski

Seriously, regexes can be very useful but there's no need for them
here. Simpler is usually better, and easier to understand.

> filename = '/etc/hosts'
>
> hosts = []
>
> for line in open(filename, 'r'):
> match = re.search('\d+\.\d+\.\d+\.\d+\s+(\S+)', line)
> if match is None or re.search('^(?:float|localhost)\.', line):
> continue
> hostname = match.group(1)

I find this much clearer without regexes:

try:
ip, hostname = line.strip().split(None, 1)
except IndexError:
continue
# I think this is equivalent to your re, but I'm not sure it's what
# you actually meant...
#if line.startswith("float.") or line.startswith("localhost."):
# continue
# I'm going with:
if hostname.startswith("float.") or hostname.startswith("localhost"):
continue

> count = 0
> for arg in sys.argv[1:]:
> for section in hostname.split('.'):
> if section == arg:
> count = count + 1
> break
> if count == len(sys.argv) - 1:
> hosts.append(hostname)

A perfect application of sets.

#initialize at program outset
args = set(sys.argv[1:])
....
hostparts = set(hostname.split("."))
if hostparts & args:
hosts.append(hostname)

Full program:

import sys
import os
filename = '/etc/hosts'

hosts = []
args = set(sys.argv[1:])
for line in open(filename, 'r'):
# Parse line into ip address and hostname, skipping bogus lines
try:
ipaddr, hostname = line.strip().split(None, 1)
except ValueError:
continue
if hostname.startswith("float.") or
hostname.startswith("localhost."):
continue

# Add to hosts if it matches at least one argument
hostparts = set(hostname.split("."))
if hostparts & args:
hosts.append(hostname)

# If there's only one match, ssh to it--otherwise print out the
matches
if len(hosts) == 1:
os.system("ssh -A %s"%hosts[0])
else:
for host in hosts:
print host

Pete Emerson

3/5/2010

Great responses, thank you all very much. I read Jonathan Gardner's
solution first and investigated sets. It's clearly superior to my
first cut.

I love the comment about regular expressions. In perl, I've reached
for regexes WAY too much. That's a big lesson learned too, and from my
point of view not because of performance (although that's most likely
a bonus) but because of code readability.

Here's the version I have now. It looks quite similar to sjdevnull's.
Further comments appreciated, and thanks again.

#!/usr/bin/env python

import sys, fileinput, os

filename = '/etc/hosts'

hosts = []

search_terms = set(sys.argv[1:])

for line in open(filename, 'r'):
if line.startswith('#'): continue
try:
hostname = line.strip().split('\t')[2] # The host I want is always
at the 3rd tab.
except IndexError:
continue
if hostname.startswith('float.') or
hostname.startswith('localhost.'):
continue
if search_terms <= set(hostname.split('.')): # same as if
search_terms.issubset(hostname.split('.')):
hosts.append(hostname)

if len(hosts) == 1:
os.execl("/usr/bin/ssh", '-A', hosts[0])
else:
for host in hosts:
print host

Jean-Michel Pichavant

3/5/2010 10:45:00 AM

Pete Emerson wrote:
> I've written my first python program, and would love suggestions for
> improvement.
>
> I'm a perl programmer and used a perl version of this program to guide
> me. So in that sense, the python is "perlesque"
>
> This script parses /etc/hosts for hostnames, and based on terms given
> on the command line (argv), either prints the list of hostnames that
> match all the criteria, or uses ssh to connect to the host if the
> number of matches is unique.
>
> I am looking for advice along the lines of "an easier way to do this"
> or "a more python way" (I'm sure that's asking for trouble!) or
> "people commonly do this instead" or "here's a slick trick" or "oh,
> interesting, here's my version to do the same thing".
>
> I am aware that there are performance improvements and error checking
> that could be made, such as making sure the file exists and is
> readable and precompiling the regular expressions and not calculating
> how many sys.argv arguments there are more than once. I'm not hyper
> concerned with performance or idiot proofing for this particular
> script.
>
> Thanks in advance.
>
> ########################################################
> #!/usr/bin/python
>
> import sys, fileinput, re, os
>
> filename = '/etc/hosts'
>
> hosts = []
>
> for line in open(filename, 'r'):
> match = re.search('\d+\.\d+\.\d+\.\d+\s+(\S+)', line)
> if match is None or re.search('^(?:float|localhost)\.', line):
> continue
> hostname = match.group(1)
> count = 0
> for arg in sys.argv[1:]:
> for section in hostname.split('.'):
> if section == arg:
> count = count + 1
> break
> if count == len(sys.argv) - 1:
> hosts.append(hostname)
>
> if len(hosts) == 1:
> os.system("ssh -A " + hosts[0])
> else:
> print '\n'.join(hosts)
>
You've already been given good advices.
Unlike some of those, I don't think using regexp an issue in any way.
Yes they are not readable, for sure, I 100% agree to that statement, but
you can make them readable very easily.

# not readable
match = re.search('\d+\.\d+\.\d+\.\d+\s+(\S+)', line)

# readable
match = re.search('\d+\.\d+\.\d+\.\d+\s+(\S+)', line) # match
'192.168.200.1 (foo123)'

I gladly understand ppl that are not willing to use regexp (or as last
resort), but for a perl guy, that should not be an issue.

JM

Duncan Booth

3/5/2010 1:39:00 PM

Jean-Michel Pichavant <jeanmichel@sequans.com> wrote:

> You've already been given good advices.
> Unlike some of those, I don't think using regexp an issue in any way.
> Yes they are not readable, for sure, I 100% agree to that statement, but
> you can make them readable very easily.
>
> # not readable
> match = re.search('\d+\.\d+\.\d+\.\d+\s+(\S+)', line)
>
>
> # readable
> match = re.search('\d+\.\d+\.\d+\.\d+\s+(\S+)', line) # match
> '192.168.200.1 (foo123)'
>
> I gladly understand ppl that are not willing to use regexp (or as last
> resort), but for a perl guy, that should not be an issue.

The problem with your comment is that just by existing it means people will
be misled into thinking the regex is being used to match strings like
'192.168.200.1 (foo123)'
when in fact the OP is expecting to match strings like
'192.168.200.1 foo123'

i.e. you may mislead some people into thinking the parentheses are part of
what is being matched when they're actually grouping within the regex.

Another problem with either regular expression line is that it makes it
less easy to see that it doesn't do what the OP wanted. /etc/hosts contains
lines with an IP address followed by multiple hostnames. It may also
contain comments either as complete lines or at the end of lines. The code
given will miss hostnames after the first and will include hostnames in
commented out lines.

--
Duncan Booth http://kupuguy.bl...

Jean-Michel Pichavant

3/5/2010 2:29:00 PM

Duncan Booth wrote:
> Jean-Michel Pichavant <jeanmichel@sequans.com> wrote:
>
>
>> You've already been given good advices.
>> Unlike some of those, I don't think using regexp an issue in any way.
>> Yes they are not readable, for sure, I 100% agree to that statement, but
>> you can make them readable very easily.
>>
>> # not readable
>> match = re.search('\d+\.\d+\.\d+\.\d+\s+(\S+)', line)
>>
>>
>> # readable
>> match = re.search('\d+\.\d+\.\d+\.\d+\s+(\S+)', line) # match
>> '192.168.200.1 (foo123)'
>>
>> I gladly understand ppl that are not willing to use regexp (or as last
>> resort), but for a perl guy, that should not be an issue.
>>
>
> The problem with your comment is that just by existing it means people will
> be misled into thinking the regex is being used to match strings like
> '192.168.200.1 (foo123)'
> when in fact the OP is expecting to match strings like
> '192.168.200.1 foo123'
>
> i.e. you may mislead some people into thinking the parentheses are part of
> what is being matched when they're actually grouping within the regex.
>
> Another problem with either regular expression line is that it makes it
> less easy to see that it doesn't do what the OP wanted. /etc/hosts contains
> lines with an IP address followed by multiple hostnames. It may also
> contain comments either as complete lines or at the end of lines. The code
> given will miss hostnames after the first and will include hostnames in
> commented out lines.
>
>
And tell me how not using regexp will ensure the /etc/hosts processing
is correct ? The non regexp solutions provided in this thread did not
handled what you rightfully pointed out about host list and commented lines.

And FYI, the OP pattern does match '192.168.200.1 (foo123)'
....
Ok that's totally unfair :D You're right I made a mistake. Still the
comment is absolutely required (provided it's correct).

JM

Duncan Booth

3/5/2010 3:00:00 PM

Jean-Michel Pichavant <jeanmichel@sequans.com> wrote:

> And tell me how not using regexp will ensure the /etc/hosts processing
> is correct ? The non regexp solutions provided in this thread did not
> handled what you rightfully pointed out about host list and commented
> lines.

It won't make is automatically correct, but I'd guess that written without
being so dependent on regexes might have made someone point out those
deficiencies sooner. The point being that casual readers of the code won't
take the time to decode the regex, they'll glance over it and assume it
does something or other sensible.

If I was writing that code, I'd read each line, strip off comments and
leading whitespace (so you can use re.match instead of re.search), split on
whitespace and take all but the first field. I might check that the field
I'm ignoring it something like a numeric ip address, but if I did want to
do then I'd include range checking for valid octets so still no regex.

The whole of that I'd wrap in a generator so what you get back is a
sequence of host names.

However that's just me. I'm not averse to regular expressions, I've written
some real mammoths from time to time, but I do avoid them when there are
simpler clearer alternatives.

> And FYI, the OP pattern does match '192.168.200.1 (foo123)'
> ...
> Ok that's totally unfair :D You're right I made a mistake. Still the
> comment is absolutely required (provided it's correct).
>
Yes, the comment would have been good had it been correct. I'd also go for
a named group as that provides additional context within the regex.

Also if there are several similar regular expressions in the code, or if
they get too complex I'd build them up in parts. e.g.

OCTET = r'(?:\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])'
ADDRESS = (OCTET + r'\.') * 3 + OCTET
HOSTNAME = r'[-a-zA-Z0-9]+(?:\.[-a-zA-Z0-9]+)*'
# could use \S+ but my Linux manual says
# alphanumeric, dash and dots only
.... and so on ...

which provides another way of documenting the intentions of the regex.

BTW, I'm not advocating that here, the above patterns would be overkill,
but in more complex situations thats what I'd do.

--
Duncan Booth http://kupuguy.bl...

comp.lang.python

Evaluate my first python script, please

Pete Emerson

Tim Wintle

MRAB

nn

Jonathan Gardner

sjdevnull@yahoo.com

Pete Emerson

Jean-Michel Pichavant

Duncan Booth

Jean-Michel Pichavant

Duncan Booth

x Login to ForumsZone