Asp Forum - Re: speed question, reading csv using takewhile() and dropwhile

MRAB

2/19/2010 8:02:00 PM

Vincent Davis wrote:
> I have some some (~50) text files that have about 250,000 rows each. I
> am reading them in using the following which gets me what I want. But it
> is not fast. Is there something I am missing that should help. This is
> mostly an question to help me learn more about python. It takes about 4
> min right now.
>
> def read_data_file(filename):
> reader = csv.reader(open(filename, "U"),delimiter='\t')
> read = list(reader)
> data_rows = takewhile(lambda trow: '[MASKS]' not in trow, [x for x
> in read])

'takewhile' accepts an iterable, so "[x for x in read]" can be
simplified to "read".

> data = [x for x in data_rows][1:]
>
data = data_rows[1:]

> mask_rows = takewhile(lambda trow: '[OUTLIERS]' not in trow,
> list(dropwhile(lambda drow: '[MASKS]' not in drow, read)))
> mask = [row for row in mask_rows if row][3:]
>
No need to convert the result of 'dropwhile' to list.

> outlier_rows = dropwhile(lambda drows: '[OUTLIERS]' not in drows, read)
> outlier = [row for row in outlier_rows if row][3:]
>
The problem, as I see it, is that you're scanning the rows more than
once.

Is this any better?

def read_data_file(filename):
reader = csv.reader(open(filename, "U"),delimiter='\t')
data = []
for row in reader:
if '[MASKS]' in row:
break
data.append(row)
data = data[1:]
mask = []
if '[MASKS]' in row:
mask.append(row)
for row in reader:
if '[OUTLIERS]' in row:
break
if row:
mask.append(row)
mask = mask[3:]
outlier = []
if '[OUTLIERS]' in row:
outlier.append(row)
outliter.extend(row for row in outlier if row)
outlier = outlier[3:]

1 Answer

John Posner

2/19/2010 9:01:00 PM

On 2/19/2010 3:02 PM, MRAB wrote:
> Is this any better?
>
> def read_data_file(filename):
> reader = csv.reader(open(filename, "U"),delimiter='\t')
> data = []
> for row in reader:
> if '[MASKS]' in row:
> break
> data.append(row)

As noted in another thread recently, you can save time by *not* looking
up the "append" method of the list object each time through the FOR loop:

data = []
app_method = data.append
for row in reader:
if '[MASKS]' in row:
break
app_method(row)

Similarly in the rest of the code. This technique improved performance
about 31% in this test:

#--------------------
import timeit
tt = timeit.repeat("for i in xrange(1000000): mylist.append(i)",
"mylist=[]",
number=25)
print "look up append() method each time:", min(tt)

tt = timeit.repeat("for i in xrange(1000000): app(i)",
"mylist=[]; app = mylist.append",
number=25)
print "look up append() method just once:", min(tt)
#--------------------

output:

look up append() method each time: 8.45481741783
look up append() method just once: 5.84429637887

-John

comp.lang.python

Re: speed question, reading csv using takewhile() and dropwhile

MRAB

John Posner

x Login to ForumsZone