Asp Forum - Transpose a large file(>2GB

Raghu Go

4/27/2008 10:33:00 AM

Hi -

What is the fast and ruby way to transpose a large file(>2GB)??

I can't read the whole file into memory due to the file sizes..
--
Posted via http://www.ruby-....

17 Answers

Xavier Noria

4/27/2008 10:59:00 AM

On Apr 27, 2008, at 12:32 , Ams Lo wrote:

> What is the fast and ruby way to transpose a large file(>2GB)??
>
> I can't read the whole file into memory due to the file sizes..

Interesting.

If the file is

12
3
456

You want

134
2 5
6

?

-- fxn

Xavier Noria

4/27/2008 11:03:00 AM

On Apr 27, 2008, at 12:59 , Xavier Noria wrote:

> On Apr 27, 2008, at 12:32 , Ams Lo wrote:
>
>> What is the fast and ruby way to transpose a large file(>2GB)??
>>
>> I can't read the whole file into memory due to the file sizes..
>
> Interesting.
>
> If the file is
>
> 12
> 3
> 456
>
> You want
>
> 134
> 2 5
> 6

Sorry, I don't know why Mail.app deletes a single leading whitespace
when there's one, last line should be

space space six

-- fxn

Raghu Go

4/27/2008 11:15:00 AM

Exactly.. Sorry, I should have explained it better..

Xavier Noria wrote:
> On Apr 27, 2008, at 12:59 , Xavier Noria wrote:
>
>> 12
>> 3
>> 456
>>
>> You want
>>
>> 134
>> 2 5
>> 6
>
> Sorry, I don't know why Mail.app deletes a single leading whitespace
> when there's one, last line should be
>
> space space six
>
> -- fxn

--
Posted via http://www.ruby-....

Pascal J. Bourguignon

4/27/2008 11:35:00 AM

Ams Lo <rgowka1@gmail.com> writes:

> Exactly.. Sorry, I should have explained it better..

Wrong. It should be below.

> Xavier Noria wrote:
>> On Apr 27, 2008, at 12:59 , Xavier Noria wrote:
>>
>>> 12
>>> 3
>>> 456
>>>
>>> You want
>>>
>>> 134
>>> 2 5
>>> 6

Like this:

> Exactly.. Sorry, I should have explained it better..

Either have a 64-bit machine, and do it naively,

or have a vector of file positions pointing to the beginning of each
line O(file_size), and then write the character pointed to by these
file position (or a space if they reached the end of line) on a single
output line, and advance the file positions (unless they reach the end
of line). Repeat until all have reached end of line.

Note: this may be bad on the disk, if the input file has a lot of
lines. In this case, you may want to implement a recursive
transposition algorithm:

to transpose a matrix of dimension (c r) do
if (c=1) and (r=1)
then return the matrix itself
else
return the matrix made of:
| transpose of submatrix [0..(c/2)[,[0..(r/2)[ transpose of submatrix [0..(c/2)[,[(r/2)..r[ |
| transpose of submatrix [(c/2)..c[,[0..(r/2)[ transpose of submatrix [(c/2)..c[,[(r/2)..r[ |

that is, you cut the matrix in four, transpose each part
independently, and combine the four parts, exchanging the parts on the
anti-diagonal.

With this algorithm disk accesses should be more localized, so you may
get better speed on very big files with very big lines.

--
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
A: Top-posting.
Q: What is the most annoying thing on usenet and in e-mail?
----------> http://www.netmeister.org/news/learn2... <-----------
---> http://homepage.ntlworld.com/g.mccaughan/g/remarks/u... <---

__Pascal Bourguignon__ http://www.informat...

Roger Pack

4/28/2008 4:03:00 PM

Save pieces of it to disk.

On Sun, Apr 27, 2008 at 4:32 AM, Ams Lo <rgowka1@gmail.com> wrote:
> Hi -
>
> What is the fast and ruby way to transpose a large file(>2GB)??
>
> I can't read the whole file into memory due to the file sizes..
> --
> Posted via http://www.ruby-....
>
>

Xavier Noria

4/28/2008 4:11:00 PM

On Apr 28, 2008, at 18:02 , Roger Pack wrote:

> Save pieces of it to disk.

Since seeking and inserting naively in text files is expensive I
wondered whether it could pay to use an intermediate specialized
software like DBM or whatever to build some structure from a line-
oriented reading of the input file, and then use it to output the
transposition in a line-oriented way.

Just off the top of my head without measuring, perhaps it is just much
worse.

Avdi Grimm

4/28/2008 4:48:00 PM

On Sun, Apr 27, 2008 at 6:32 AM, Ams Lo <rgowka1@gmail.com> wrote:
> Hi -
>
> What is the fast and ruby way to transpose a large file(>2GB)??

fseek to the end of the file. fseek backwards by a nice round number
like 4k. Read in that exact number of bytes. Transpose it. Write
the transposed block to your output file. Then fseek back 2*4k.
Read, transpose, write. fseek back 3*4k. Etc. Rinse and repeat to
the beginning of the file.

--
Avdi

Home: http:...
Developer Blog: http:.../devblog/
Twitter: http://twitte...
Journal: http://avdi.livej...

Avdi Grimm

4/28/2008 4:57:00 PM

On Mon, Apr 28, 2008 at 12:47 PM, Avdi Grimm <avdi@avdi.org> wrote:
> fseek to the end of the file. fseek backwards by a nice round number
> like 4k. Read in that exact number of bytes. Transpose it. Write
> the transposed block to your output file. Then fseek back 2*4k.
> Read, transpose, write. fseek back 3*4k. Etc. Rinse and repeat to
> the beginning of the file.

Of course, if you're transposing lines, that's trickier.

--
Avdi

Home: http:...
Developer Blog: http:.../devblog/
Twitter: http://twitte...
Journal: http://avdi.livej...

Lionel Bouton

4/28/2008 5:17:00 PM

Pascal Bourguignon wrote:

> [...] to transpose a matrix of dimension (c r) do
> if (c=1) and (r=1)
> then return the matrix itself
> else
> return the matrix made of:
> | transpose of submatrix [0..(c/2)[,[0..(r/2)[ transpose of submatrix [0..(c/2)[,[(r/2)..r[ |
> | transpose of submatrix [(c/2)..c[,[0..(r/2)[ transpose of submatrix [(c/2)..c[,[(r/2)..r[ |
>
> that is, you cut the matrix in four, transpose each part
> independently, and combine the four parts, exchanging the parts on the
> anti-diagonal.
>
> With this algorithm disk accesses should be more localized, so you may
> get better speed on very big files with very big lines.
>

Here's an algorithm that should minimize the amount of seeks.

Given that the file doesn't fit in memory, you'll have to fight against
the disk cache as it won't be effective (caching data that won't ever be
reused) so you want to load as much useful data in memory at once and
only useful data.

If your program can allocate an amount <M> of memory, compute the number
<n> of element that fit in <M> / matrix_line_count. You probably want to
test that with actual Ruby code as it's difficult to guess correctly.

Be creative on how you store data in memory to gain space, you shoud try
to store it pre-transposed in memory and already compacted, but it might
be difficult/slow to write (although disk seeks are slow enough that you
might be able to benefit from relatively heavy computations).

On each line, read the <n> first elements (effectively reading the <n>
first columns), then write the <n> first lines in the transposed matrix.
Continue <n> by <n> columns.

With this method you'll always read/write sequential data on disk when
it's possible with as large data chunks as possible without hitting the
swap.

You can test the opposite method (reading <n> lines, writing <n>
columns, it may be faster on your hardware but my gutt feeling is that
you should avoid seeking on writes).

Lionel

Xavier Noria

4/28/2008 5:26:00 PM

On Apr 28, 2008, at 19:20 , Lionel Bouton wrote:

> With this method you'll always read/write sequential data on disk
> when it's possible with as large data chunks as possible without
> hitting the swap.

In what sense is it sequential? These approeaches require prepending/
appending *rectangular* chunks of text in the ouput file?

comp.lang.ruby

Transpose a large file(>2GB

Raghu Go

Xavier Noria

Xavier Noria

Raghu Go

Pascal J. Bourguignon

Roger Pack

Xavier Noria

Avdi Grimm

Avdi Grimm

Lionel Bouton

Xavier Noria

x Login to ForumsZone