Asp Forum - Looking for suggestions on improving numpy code

David Lees

2/23/2008 6:38:00 AM

I am starting to use numpy and have written a hack for reading in a
large data set that has 8 columns and millions of rows. I want to read
and process a single column. I have written the very ugly hack below,
but am sure there is a more efficient and pythonic way to do this. The
file is too big to read by brute force and select a column, so it is
read in chunks and the column selected. Things I don't like in the code:
1. Performing a transpose on a large array
2. Uncertainty about numpy append efficiency

Is there a way to directly read every n'th element from the file into an
array?

david

from numpy import *
from scipy.io.numpyio import fread

fd = open('testcase.bin', 'rb')
datatype = 'h'
byteswap = 0
M = 1000000
N = 8
size = M*N
shape = (M,N)
colNum = 2
sf =1.645278e-04*10
z=array([])
for i in xrange(50):
data = fread(fd, size, datatype,datatype,byteswap)
data = data.reshape(shape)
data = data.transpose()
z = append(z,data[colNum]*sf)

print z.mean()

fd.close()

2 Answers

7stud --

2/23/2008 11:39:00 AM

On Feb 22, 11:37 pm, David Lees <debl2NoS...@verizon.net> wrote:
> I want to read
> and process a single column.

Then why won't a list suffice?

Robert Kern

2/25/2008 5:12:00 PM

David Lees wrote:
> I am starting to use numpy and have written a hack for reading in a
> large data set that has 8 columns and millions of rows. I want to read
> and process a single column. I have written the very ugly hack below,
> but am sure there is a more efficient and pythonic way to do this. The
> file is too big to read by brute force and select a column, so it is
> read in chunks and the column selected. Things I don't like in the code:
> 1. Performing a transpose on a large array

Transposition is trivially fast in numpy. It does not copy any memory.

> 2. Uncertainty about numpy append efficiency

Rest assured that it's slow. Appending to lists is fast since lists preallocate
memory according to a scheme such that the amortized cost of appending elements
is O(1). We don't quite have that luxury in numpy.

> Is there a way to directly read every n'th element from the file into an
> array?

Since this is a regular binary file, you can memory map the file.

import numpy

M = 1000000
N = 8
column = 2
sf =1.645278e-04*10

m = numpy.memmap('testcase.bin', dtype=numpy.int16, shape=(M,N))
z = m[:,column] * sf

You may want to ask future numpy questions on the numpy mailing list.

http://www.scipy.org/Mai...

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco

comp.lang.python

Looking for suggestions on improving numpy code

David Lees

7stud --

Robert Kern

x Login to ForumsZone