Clifford Heath
2/19/2008 4:43:00 AM
Trans wrote:
>> A zip archive compresses
>> each file indivually and then adds them to the archive, this makes it
>> easy to extract indivdual files later [1]. A tar.gz archive adds all
>> the files to the tar and then gzips everything at once, this takes
>> advantage of cross-file redundancy for a better overall compression
>> ratio [2].
>
> That's interesting. I created a utility a while back called rtar
> (recursive tar).
Microsoft's CAB file format has a mixed approach, using a proprietry
LZ compression. Basically they compress small files together in groups,
which gives you most of the advantages of zip and tar in one.
Another approach that could be taken is to flush the compressor for
every 32kB of input (since deflate can't repeat input from further ago
anyhow), then append a manifest recording the output byte offset of the
flush point that preceeds each file. To grab a file in the middle, seek
to the offset that preceeds that file by *two* blocks, decompress the
32kB to use as history, then continue decompressing until you get to
your file.
The nice thing is that flushing the compressor leaves the output as a
valid deflate() stream, even for decompressors that don't know why you've
flushed. If the manifest looks like a normal file, a standard tar utility
could still extract the whole archive.
I've done these sort of games with the zlib compressor - it's easier
than you might think. With a bit of cunning, you can make the resultant
file rsync-able even in the face of localized changes, a fact I discussed
with Andrew Tridgell some years back. I also wanted to add a long-range
predictor to zlib so deflate could repeat blocks from far back... you can
use an rsync-style approach to do the prediction rather than an LZ suffix
tree.
Clifford Heath.