Asp Forum - Transforming ascii file (pseduo database) into proper database

p.

1/21/2008 8:52:00 PM

I need to take a series of ascii files and transform the data
contained therein so that it can be inserted into an existing
database. The ascii files are just a series of lines, each line
containing fields separated by '|' character. Relations amongst the
data in the various files are denoted through an integer identifier, a
pseudo key if you will. Unfortunately, the relations in the ascii file
do not match up with those in the database in which i need to insert
the data, i.e., I need to transform the data from the files before
inserting into the database. Now, this would all be relatively simple
if not for the following fact: The ascii files are each around 800MB,
so pulling everything into memory and matching up the relations before
inserting the data into the database is impossible.

My questions are:
1. Has anyone done anything like this before, and if so, do you have
any advice?
2. In the abstract, can anyone think of a way of amassing all the
related data for a specific identifier from all the individual files
without pulling all of the files into memory and without having to
repeatedly open, search, and close the files over and over again?

10 Answers

Tim Chase

1/21/2008 9:11:00 PM

> I need to take a series of ascii files and transform the data
> contained therein so that it can be inserted into an existing
> database.
[snip]
> I need to transform the data from the files before inserting
> into the database. Now, this would all be relatively simple if
> not for the following fact: The ascii files are each around
> 800MB,
[snip]
> My questions are:
> 1. Has anyone done anything like this before, and if so, do
> you have any advice?

Yes, I regularly do ETL on files from cellular providers to
transform hundreds of megs worth (some approach a gig) of data
into our internal system.

> 2. In the abstract, can anyone think of a way of amassing all
> the related data for a specific identifier from all the
> individual files without pulling all of the files into memory
> and without having to repeatedly open, search, and close the
> files over and over again?

if the file is sorted by something you can use, you can iterate
over it and just deal with one grouping at a time. In my case,
iterating over gobs of call-detail, the file happens to be sorted
by the phone-number on the account. So I iterate over the file
maintaining a list of calls for the given phonenumber, and when
the phonenumber changes, I deal with the previous cache of data,
then re-initialize with the new phone's data.

Other ideas:

1) create a temp DB (such as sqlite), skim through the file
inserting all your data into a table in this DB, then use DB
functionality on it

2) in a light-weight way, assuming there's lots of data per row,
and that you have multiple rows associated with a given ID (in my
case, such as a phonenumber), you can create a dictionary of an
ID to a list of file-offsets in which that ID is used. You can
then skim through the file once gathering all the offsets with
calls to tell() and then when you want to process an item, you
can seek to that particular offset and read in the line. Not
greatly efficient, but hackable.

But mostly, it helps if you have a sorted field that's useful to
you :)

-tkc

John Machin

1/21/2008 9:40:00 PM

On Jan 22, 7:51 am, "p." <ppetr...@gmail.com> wrote:
> I need to take a series of ascii files and transform the data
> contained therein so that it can be inserted into an existing
> database. The ascii files are just a series of lines, each line
> containing fields separated by '|' character. Relations amongst the
> data in the various files are denoted through an integer identifier, a
> pseudo key if you will. Unfortunately, the relations in the ascii file
> do not match up with those in the database in which i need to insert
> the data, i.e., I need to transform the data from the files before
> inserting into the database. Now, this would all be relatively simple

I'm glad you think so ... we don't have a clear understanding of the
problem.

> if not for the following fact: The ascii files are each around 800MB,
> so pulling everything into memory and matching up the relations before
> inserting the data into the database is impossible.
>
> My questions are:
> 1. Has anyone done anything like this before, and if so, do you have
> any advice?
> 2. In the abstract, can anyone think of a way of amassing all the
> related data for a specific identifier from all the individual files
> without pulling all of the files into memory and without having to
> repeatedly open, search, and close the files over and over again?

It would help enormously if you clarified whether the "identifier"
identifies entities or relationships between entities. Is there a
different series of "identifiers" for each (say) type of relationship?

1. Sort each of the N ascii files into "identifier" order. Do an N-way
merge to get all the data for each "identifier" in the same place at
the same time, rearrange it, and insert it into the final database.

OR

2. Load each of the N ascii files into a table in a project-temporary
database. Create a suitably-structured index on the "identifier"
column in each table. Do an N-way merge using SQL inner/outer
joins ...

How big is N anyway? How many unique "identifiers"?

Paul Rubin

1/21/2008 9:46:00 PM

"p." <ppetrick@gmail.com> writes:
> 1. Has anyone done anything like this before, and if so, do you have
> any advice?

Sort all the files with an external sort utility (e.g. unix sort), so
that records with the same key are all brought together. Then you can
process the files sequentially.

Bruno Desthuilliers

1/21/2008 10:16:00 PM

p. a écrit :
> I need to take a series of ascii files and transform the data
> contained therein so that it can be inserted into an existing
> database. The ascii files are just a series of lines, each line
> containing fields separated by '|' character. Relations amongst the
> data in the various files are denoted through an integer identifier, a
> pseudo key if you will. Unfortunately, the relations in the ascii file
> do not match up with those in the database in which i need to insert
> the data, i.e., I need to transform the data from the files before
> inserting into the database. Now, this would all be relatively simple
> if not for the following fact: The ascii files are each around 800MB,
> so pulling everything into memory and matching up the relations before
> inserting the data into the database is impossible.
>
> My questions are:
> 1. Has anyone done anything like this before,

More than once, yes.

> and if so, do you have
> any advice?

1/ use the csv module to parse your text files

2/ use a temporary database (which schema will mimic the one in the flat
files), so you can work with the appropriate tools - ie: the RDBMS will
take care of disk/memory management, and you'll have a specialized,
hi-level language (namely, SQL) to reassemble your data the right way.

> 2. In the abstract, can anyone think of a way of amassing all the
> related data for a specific identifier from all the individual files
> without pulling all of the files into memory and without having to
> repeatedly open, search, and close the files over and over again?

Answer above.

George Sakkis

1/21/2008 10:34:00 PM

On Jan 21, 4:45 pm, Paul Rubin <http://phr...@NOSPAM.invalid> wrote:
> "p." <ppetr...@gmail.com> writes:
> > 1. Has anyone done anything like this before, and if so, do you have
> > any advice?
>
> Sort all the files with an external sort utility (e.g. unix sort), so
> that records with the same key are all brought together. Then you can
> process the files sequentially.

Seconded. Unix sort can do external sorting [1] so your program will
work even if the files don't fit in memory. Once they are sorted,
itertools (especially groupby) is your friend.

George

[1] http://en.wikipedia.org/wiki/Ext...

p.

1/21/2008 10:41:00 PM

So in answer to some of the questions:
- There are about 15 files, each roughly representing a table.
- Within the files, each line represents a record.
- The formatting for the lines is like so:

File1:
somval1|ID|someval2|someval3|etc.

File2:
ID|someval1|someval2|somewal3|etc.

Where ID is the one and only value linking "records" from one file to
"records" in another file - moreover, as far as I can tell, the
relationships are all 1:1 (or 1:0) (I don't have the full dataset yet,
just a sampling, so I'm flying a bit in the dark).
- I believe that individual "records" within each of the files is
unique with respect to the identifier (again, not certain because I'm
only working with sample data).
- As the example shows, the position of the ID is not the same for all
files.
- I don't know how big N is since I only have a sample to work with,
and probably won't get the full dataset anytime soon. (Lets just take
it as a given that I won't get that information until AFTER a first
implementation...politics.)
- I don't know how many identifiers either, although it has to be at
least as large as the number of lines in the largest file (again, I
don't have the actual data yet).

So as an exercise, lets assume 800MB file, each line of data taking up
roughly 150B (guesstimate - based on examination of sample data)...so
roughly 5.3 million unique IDs.

With that size, I'll have to load them into temp db. I just can't see
holding that much data in memory...

Paul Rubin

1/21/2008 11:09:00 PM

"p." <ppetrick@gmail.com> writes:
> So as an exercise, lets assume 800MB file, each line of data taking up
> roughly 150B (guesstimate - based on examination of sample data)...so
> roughly 5.3 million unique IDs.

I still don't understand what the problem is. Are you familiar with
the concept of external sorting? What OS are you using? If you're
using a Un*x-like system, the built-in sort command should do what you
need. "Internal" sorting means reading a file into memory and sorting
it in memory with something like the .sort() function. External
sorting is what you do when the file won't fit in memory. Basically
you read sequential chunks of the file where each chunk fits in
memory, sort each chunk internally and write it to a temporary disk
file, then merge all the disk files. You can sort inputs of basically
unlimited size this way. The unix sort command knows how to do this.

It's often a good exercise with this type of problem, to ask yourself
how an old-time mainframe programmer would have done it. A "big"
computer of the 1960's might have had 128 kbytes of memory and a few
MB of disk, but a bunch of magtape drives that held a few dozen MB
each. With computers like that, they managed to process the phone
bills for millions of people. The methods that they used are still
relevant with today's much bigger and faster computers.

If you watch old movies that tried to get a high tech look by showing
computer machine rooms with pulsating tape drives, external sorting is
what those computers spent most of their time doing.

Finally, 800MB isn't all that big a file by today's standards. Memory
for desktop computers costs around 25 dollars per gigabyte so having
8GB of ram on your desk to crunch those 800MB files with is not at all
unreasonable.

p.

1/22/2008 1:44:00 AM

Thanks to all for the ideas. I am familiar with external sorting.
Hadn't considered it though. Will definitely be giving that a go, and
then merging. Again, thanks all.

Dennis Lee Bieber

1/22/2008 7:11:00 AM

On Mon, 21 Jan 2008 12:51:59 -0800 (PST), "p." <ppetrick@gmail.com>
declaimed the following in comp.lang.python:

> I need to take a series of ascii files and transform the data
> contained therein so that it can be inserted into an existing
> database. The ascii files are just a series of lines, each line
> containing fields separated by '|' character. Relations amongst the
> data in the various files are denoted through an integer identifier, a
> pseudo key if you will. Unfortunately, the relations in the ascii file
> do not match up with those in the database in which i need to insert
> the data, i.e., I need to transform the data from the files before
> inserting into the database. Now, this would all be relatively simple
> if not for the following fact: The ascii files are each around 800MB,
> so pulling everything into memory and matching up the relations before
> inserting the data into the database is impossible.
>
Unclear: Do you mean you currently have multiple files and need to
combine records from each to create a single record for insertion into a
single table? Or do you perhaps mean the ID field can not be used as-is,
due to conflicts with data already existing in the database.

For the former, a classical sort-merge algorithm would apply {let me
guess, in this world of Java and Web applications, colleges don't teach
how to handle sequential processing of a transaction file against a
master file}

For the latter, I'd probably do a prepass building a structure that
just contained record IDs to be updated during regular processing.
--
Wulfraed Dennis Lee Bieber KD6MOG
wlfraed@ix.netcom.com wulfraed@bestiaria.com
HTTP://wlfraed.home.netcom.com/
(Bestiaria Support Staff: web-asst@bestiaria.com)
HTTP://www.bestiaria.com/

Albert van der Horst

1/28/2008 9:51:00 AM

In article <9b6a9a56-2ea6-4dd6-9420-afe9a2fdc8d8@e32g2000prn.googlegroups.com>,
p. <ppetrick@gmail.com> wrote:
>I need to take a series of ascii files and transform the data
>contained therein so that it can be inserted into an existing
>database. The ascii files are just a series of lines, each line
>containing fields separated by '|' character. Relations amongst the
>data in the various files are denoted through an integer identifier, a
>pseudo key if you will. Unfortunately, the relations in the ascii file
>do not match up with those in the database in which i need to insert
>the data, i.e., I need to transform the data from the files before
>inserting into the database. Now, this would all be relatively simple
>if not for the following fact: The ascii files are each around 800MB,
>so pulling everything into memory and matching up the relations before
>inserting the data into the database is impossible.

In this case good old fashioned batch processing (line by line)
may be appropriate.
Read up on tools like sort and join.

These tools are present on all Unix-like systems, and on windows
in open-source toolkits.

>
>My questions are:
>1. Has anyone done anything like this before, and if so, do you have
>any advice?

Puzzling question. Computers weren't invented for GUI's. They were
invented for precisely this kind of thing. So, yes, it is a sure bet.

>2. In the abstract, can anyone think of a way of amassing all the
>related data for a specific identifier from all the individual files
>without pulling all of the files into memory and without having to
>repeatedly open, search, and close the files over and over again?

As long as you don't use Excell, it is not up to it ;-)

Groetjes Albert

--
--
Albert van der Horst, UTRECHT,THE NETHERLANDS
Economic growth -- like all pyramid schemes -- ultimately falters.
albert@spe&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van...

comp.lang.python

Transforming ascii file (pseduo database) into proper database

p.

Tim Chase

John Machin

Paul Rubin

Bruno Desthuilliers

George Sakkis

p.

Paul Rubin

p.

Dennis Lee Bieber

Albert van der Horst

x Login to ForumsZone