Asp Forum - should writing Unicode files be so slow

djc

3/18/2010 10:30:00 PM

I have a simple program to read a text (.csv) file and split it into
several smaller files. Tonight I decided to write a unicode variant and was
surprised at the difference in performance. Is there a better way?

> from __future__ import with_statement
> import codecs
>
> def _rowreader(filename, separator='\t'):
> """Generator for iteration over potentially large file."""
> with codecs.open(filename, 'rU', 'utf-8', 'backslashreplace') as tabfile:
> for row in tabfile:
> yield [v.strip() for v in row.split(separator)]
>
> def generator_of_output(source_of_lines):
> for line in source_of_lines:
> for result in some_function(line):
> yield result
>
> def coroutine(outfile_prefix, outfile_suffix, sep='\t'):
> outfile = '%s_%s.txt'% (outfile_prefix, outfile_suffix)
> with codecs.open(outfile, 'w', 'utf-8') as out_part:
> while True:
> line = (yield)
> out_part.write(sep.join(line) + '\n')
>
> def _file_to_files(infile, outfile_prefix, column, sep):
> column_values = dict()
> for line in _rowreader(infile, sep):
> outfile_suffix = line[column].strip('\'\"')
> if outfile_suffix in column_values:
> column_values[outfile_suffix].send(line)
> else:
> file_writer = coroutine(outfile_prefix, outfile_suffix, sep)
> file_writer.next()
> file_writer.send(line)
> column_values[outfile_suffix] = file_writer
> for file_writer in column_values.itervalues():
> file_writer.close()

the plain version is the same except for
> with open(filename, 'rU') as tabfile:
> with open(outfile, 'wt') as out_part:

The difference:
> "uid","timestamp","taskid","inputid","value"
> "15473178739336026589","2010-02-18T20:50:15+0000","11696870405","73827093507","83523277829"
> "15473178739336026589","2010-02-18T20:50:15+0000","11696870405","11800677379","12192844803"
> "15473178739336026589","2010-02-18T20:50:15+0000","11696870405","31231839235","52725552133"
>
> sysweb@Bembo:~/UCLC/bbc/wb2$ wc -l wb.csv
> 9293271 wb.csv
>
> normal version
> sysweb@Bembo:~/UCLC$ time ~/UCL/toolkit/file_splitter.py -o tt --separator comma -k 2 wb.csv
>
> real 0m43.714s
> user 0m37.370s
> sys 0m2.732s
>
> unicode version
> sysweb@Bembo:~/UCLC$ time ./file_splitter.py -o t --separator comma -k 2 wb.csv
>
> real 4m8.695s
> user 3m19.236s
> sys 0m39.262s

--
David Clark, MSc, PhD. UCL Centre for Publishing
Gower Str London WCIE 6BT
What sort of web animal are you?
<https://www.bbc.co.uk/labuk/experiments/webbeh...

9 Answers

Ben Finney

3/19/2010 1:50:00 AM

djc <slais-www@ucl.ac.uk> writes:

> I have a simple program to read a text (.csv) file

Could you please:

* simplify it further: make a minimal version that demonstrates the
difference you're seeing, without any extraneous stuff that doesn't
appear to affect the result.

* make it complete: the code you've shown doesn't do anything except
define some functions.

In other words: please reduce it to a complete, minimal example that we
can run to see the same behaviour you're seeing.

--
\ â??If we ruin the Earth, there is no place else to go. This is |
`\ not a disposable world, and we are not yet able to re-engineer |
_o__) other planets.â? â??Carl Sagan, _Cosmos_, 1980 |
Ben Finney

djc

3/19/2010 12:08:00 PM

Ben Finney wrote:
> djc <slais-www@ucl.ac.uk> writes:
>
>> I have a simple program to read a text (.csv) file
>
> Could you please:
>
> * simplify it further: make a minimal version that demonstrates the
> difference you're seeing, without any extraneous stuff that doesn't
> appear to affect the result.
>
> * make it complete: the code you've shown doesn't do anything except
> define some functions.
>
> In other words: please reduce it to a complete, minimal example that we
> can run to see the same behaviour you're seeing.
>

It is the minimal example. The only thing omited is the opt.parse code that
calls _file_to_files(infile, outfile_prefix, column, sep):

--
David Clark, MSc, PhD. UCL Centre for Publishing
Gower Str London WCIE 6BT
What sort of web animal are you?
<https://www.bbc.co.uk/labuk/experiments/webbeh...

Ben Finney

3/19/2010 12:57:00 PM

djc <slais-www@ucl.ac.uk> writes:

> Ben Finney wrote:
> > Could you please:
> >
> > * simplify it further: make a minimal version that demonstrates the
> > difference you're seeing, without any extraneous stuff that doesn't
> > appear to affect the result.
> >
> > * make it complete: the code you've shown doesn't do anything except
> > define some functions.
> >
> > In other words: please reduce it to a complete, minimal example that we
> > can run to see the same behaviour you're seeing.
>
> It is the minimal example. The only thing omited is the opt.parse code
> that calls _file_to_files(infile, outfile_prefix, column, sep):

What happens, then, when you make a smaller program that deals with only
one file?

What happens when you make a smaller program that only reads the file,
and doesn't write any? Or a different program that only writes a file,
and doesn't read any?

It's these sort of reductions that will help narrow down exactly what
the problem is. Do make sure that each example is also complete (i.e.
can be run as is by someone who uses only that code with no additions).

--
\ â??To have the choice between proprietary software packages, is |
`\ being able to choose your master. Freedom means not having a |
_o__) master.â? â??Richard M. Stallman, 2007-05-16 |
Ben Finney

djc

3/19/2010 5:18:00 PM

Ben Finney wrote:

> What happens, then, when you make a smaller program that deals with only
> one file?
>
> What happens when you make a smaller program that only reads the file,
> and doesn't write any? Or a different program that only writes a file,
> and doesn't read any?
>
> It's these sort of reductions that will help narrow down exactly what
> the problem is. Do make sure that each example is also complete (i.e.
> can be run as is by someone who uses only that code with no additions).
>

The program reads one csv file of 9,293,271 lines.
869M wb.csv
It creates set of files containing the same lines but where each
output file in the set contains only those lines where the value of a
particular column is the same, the number of output files will depend on
the number of distinct values in that column In the example that results
in 19 files

74M tt_11696870405.txt
94M tt_18762175493.txt
15M tt_28668070915.txt
12M tt_28673313795.txt
15M tt_28678556675.txt
11M tt_28683799555.txt
12M tt_28689042435.txt
15M tt_28694285315.txt
7.3M tt_28835845125.txt
6.8M tt_28842136581.txt
12M tt_28848428037.txt
11M tt_28853670917.txt
12M tt_28858913797.txt
15M tt_28864156677.txt
11M tt_28869399557.txt
11M tt_28874642437.txt
283M tt_31002203141.txt
259M tt_33335282691.txt
45 2010-03-19 17:00 tt_taskid.txt

changing
with open(filename, 'rU') as tabfile:
to
with codecs.open(filename, 'rU', 'utf-8', 'backslashreplace') as tabfile:

and
with open(outfile, 'wt') as out_part:
to
with codecs.open(outfile, 'w', 'utf-8') as out_part:

causes a program that runs in
43 seconds to take 4 minutes to process the same data. In this particular
case that is not very important, any unicode strings in the data are not
worth troubling over and I have already spent more time satisfying
curiousity that will ever be required to process the dataset in
future. But I have another project in hand where not only is the
unicode significant but the files are very much larger. Scale up the
problem and the difference between 4 hours and 24 become a matter worth
some attention.

--
David Clark, MSc, PhD. UCL Centre for Publishing
Gower Str London WCIE 6BT
What sort of web animal are you?
<https://www.bbc.co.uk/labuk/experiments/webbeh...

Gabriel Genellina

3/19/2010 10:24:00 PM

En Fri, 19 Mar 2010 14:18:17 -0300, djc <slais-www@ucl.ac.uk> escribió:
> Ben Finney wrote:
>
>> What happens, then, when you make a smaller program that deals with only
>> one file?
>>
>> What happens when you make a smaller program that only reads the file,
>> and doesn't write any? Or a different program that only writes a file,
>> and doesn't read any?
>>
>> It's these sort of reductions that will help narrow down exactly what
>> the problem is. Do make sure that each example is also complete (i.e.
>> can be run as is by someone who uses only that code with no additions).
>>
>
>
> The program reads one csv file of 9,293,271 lines.
> 869M wb.csv
> It creates set of files containing the same lines but where
> each
> output file in the set contains only those lines where the value of a
> particular column is the same, the number of output files will depend on
> the number of distinct values in that column In the example that results
> in 19 files
>
> changing
> with open(filename, 'rU') as tabfile:
> to
> with codecs.open(filename, 'rU', 'utf-8', 'backslashreplace') as tabfile:
>
> and
> with open(outfile, 'wt') as out_part:
> to
> with codecs.open(outfile, 'w', 'utf-8') as out_part:
>
> causes a program that runs in
> 43 seconds to take 4 minutes to process the same data. In this particular
> case that is not very important, any unicode strings in the data are
> not
> worth troubling over and I have already spent more time satisfying
> curiousity that will ever be required to process the dataset in
> future. But I have another project in hand where not only is the
> unicode significant but the files are very much larger. Scale up the
> problem and the difference between 4 hours and 24 become a matter worth
> some attention.

Ok. Your test program is too large to determine what's going on. Try to
determine first *which* part is slow:

- reading: measure the time it takes only to read a file, with open() and
codecs.open()
It might be important the density of non-ascii characters and their
relative code points (as utf-8 is much more efficient for ASCII data than,
say, Hanzi)
- processing: measure the time it takes the processing part (fed with str
vs unicode data)
- writing: measure the time it takes only to write a file, with open() and
codecs.open()

Only then one can focus on optimizing the bottleneck.

--
Gabriel Genellina

Ben Finney

3/19/2010 11:02:00 PM

"Gabriel Genellina" <gagsl-py2@yahoo.com.ar> writes:

> Ok. Your test program is too large to determine what's going on. Try
> to determine first *which* part is slow:

Right. This is done by the diagnostic technique of writing *new*,
minimal, complete programs that exercise each piece of the functionality
separately.

You're not tinkering with the existing program that's misbehaving;
you're trying to *recreate* the misbehaviour under a simpler
environment.

Hope that helps. Feel free to continue posting complete minimal programs
that exercise one thing and show behaviour you're unsure about.

--
\ â??We now have access to so much information that we can find |
`\ support for any prejudice or opinion.â? â??David Suzuki, 2008-06-27 |
_o__) |
Ben Finney

Antoine Pitrou

3/20/2010 12:19:00 AM

Le Fri, 19 Mar 2010 17:18:17 +0000, djc a Ã©critÂ :
>
> changing
> with open(filename, 'rU') as tabfile: to
> with codecs.open(filename, 'rU', 'utf-8', 'backslashreplace') as
> tabfile:
>
> and
> with open(outfile, 'wt') as out_part: to
> with codecs.open(outfile, 'w', 'utf-8') as out_part:
>
> causes a program that runs in
> 43 seconds to take 4 minutes to process the same data.

codecs.open() (and the object it returns) is slow as it is written in
pure Python.

Accelerated reading and writing of unicode files is available in Python
2.7 and 3.1, using the new `io` module.

Regards

Antoine.

djc

3/21/2010 2:30:00 PM

Antoine Pitrou wrote:
> Le Fri, 19 Mar 2010 17:18:17 +0000, djc a Ã©crit :
>> changing
>> with open(filename, 'rU') as tabfile: to
>> with codecs.open(filename, 'rU', 'utf-8', 'backslashreplace') as
>> tabfile:
>>
>> and
>> with open(outfile, 'wt') as out_part: to
>> with codecs.open(outfile, 'w', 'utf-8') as out_part:
>>
>> causes a program that runs in
>> 43 seconds to take 4 minutes to process the same data.
>
> codecs.open() (and the object it returns) is slow as it is written in
> pure Python.
>
> Accelerated reading and writing of unicode files is available in Python
> 2.7 and 3.1, using the new `io` module.

Thank you, for a clear and to the point explanation. I shall concentrate on
finding an optimal time to upgrade from Python 2.6.

--
David Clark, MSc, PhD. UCL Centre for Publishing
Gower Str London WCIE 6BT
What sort of web animal are you?
<https://www.bbc.co.uk/labuk/experiments/webbeh...

Ben Finney

3/21/2010 10:51:00 PM

djc <slais-www@ucl.ac.uk> writes:

> I shall concentrate on finding an optimal time to upgrade from Python
> 2.6.

Note that Python 2.7, though nearly ready, is not yet released
<URL:http://www.python.org/download/rel....

--
\ â??â?¦ Nature â?¦ is seen to do all things Herself and through |
`\ herself of own accord, rid of all gods.â? â??Titus Lucretius |
_o__) Carus, c. 40 BCE |
Ben Finney

comp.lang.python

should writing Unicode files be so slow

djc

Ben Finney

djc

Ben Finney

djc

Gabriel Genellina

Ben Finney

Antoine Pitrou

djc

Ben Finney

x Login to ForumsZone