Asp Forum - OT: Why are .zip files so much bigger than tar.gz files?

Trans

2/18/2008 11:54:00 PM

So I has started to offer .zip packages of my projects to make life a
little easier for Windows folks --seeing as it's no sweat off the
backs of Linux folks either (unzip) --but then I noticed that zip
files are huge. I have a 1MB tar.gz that's over 3MB as a .zip. Can
that really be right?

T.

7 Answers

Arlen Cuss

2/19/2008 12:11:00 AM

[Note: parts of this message were removed to make it a legal post.]

On Feb 19, 2008 10:53 AM, Trans <transfire@gmail.com> wrote:

> So I has started to offer .zip packages of my projects to make life a
> little easier for Windows folks --seeing as it's no sweat off the
> backs of Linux folks either (unzip) --but then I noticed that zip
> files are huge. I have a 1MB tar.gz that's over 3MB as a .zip. Can
> that really be right?

Certainly can. The ZIP algorithm isn't as good as compressing things as the
gzip algorithm. Have a look at bz2 -- it's typically even better than gzip.

>
>
> T.
>
>

Arlen

brabuhr

2/19/2008 12:22:00 AM

On Feb 18, 2008 6:53 PM, Trans <transfire@gmail.com> wrote:
> So I has started to offer .zip packages of my projects to make life a
> little easier for Windows folks --seeing as it's no sweat off the
> backs of Linux folks either (unzip) --but then I noticed that zip
> files are huge. I have a 1MB tar.gz that's over 3MB as a .zip. Can
> that really be right?

Yes, it sounds like it can really be right. A zip archive compresses
each file indivually and then adds them to the archive, this makes it
easy to extract indivdual files later [1]. A tar.gz archive adds all
the files to the tar and then gzips everything at once, this takes
advantage of cross-file redundancy for a better overall compression
ratio [2].

[1] http://en.wikipedia.org...
[2] http://en.wikipedia.org... http://en.wikipedia.or...

Marc Heiler

2/19/2008 12:50:00 AM

Personally I am using mostly .tar.bz2 these days, simply because it
compresses better than gz, even though gz is faster and nicer to your
CPU. I have around 15 GIG of archivable material, if I would move this
to gz I assume I would come to around 18 GIG, and when transferring on
USB (via a ruby script automatically) every byte that is not transferred
matters - some computers only have USB 1.1 and transfer is boringly slow
already. But its mostly archival reasons here. :)

PS: There recently was a comment about 7zip compressing even better than
bzip.
Just unfortunately, 7zip is hardly known and I will stick to bzip for
now since its much more supported, known and also easy to handle (and
doesnt look as attached to company-development as is 7zip)
--
Posted via http://www.ruby-....

Trans

2/19/2008 1:49:00 AM

On Feb 18, 7:22 pm, brab...@gmail.com wrote:
> On Feb 18, 2008 6:53 PM, Trans <transf...@gmail.com> wrote:
>
> > So I has started to offer .zip packages of my projects to make life a
> > little easier for Windows folks --seeing as it's no sweat off the
> > backs of Linux folks either (unzip) --but then I noticed that zip
> > files are huge. I have a 1MB tar.gz that's over 3MB as a .zip. Can
> > that really be right?
>
> Yes, it sounds like it can really be right. A zip archive compresses
> each file indivually and then adds them to the archive, this makes it
> easy to extract indivdual files later [1]. A tar.gz archive adds all
> the files to the tar and then gzips everything at once, this takes
> advantage of cross-file redundancy for a better overall compression
> ratio [2].

That's interesting. I created a utility a while back called rtar
(recursive tar). It drills-down to the bottom of each directory and
tars & compresses each directory and compressed each file, working
it's way back up to the top. You end up with a compressed archive
similar to your explanation of zip in accessibility, but still with
the overall compression of a single tar. I thought it was pretty cool,
but basically trivial to implement. So I emailed the GNU maintainers
of tar asking them if it might make a nice option to add to tar
itself. Of course, they never responded :(

T.

Joe

2/19/2008 4:23:00 AM

I saw this on the front page of the programming section of reddit
tonight. Thought it fit with the conversation. Seems someone
benchmarked the 3 algorithms being discussed.

http://blogs.reucon.com/srt/2008/02/18/compression_gzip_vs_bzip2_vs_...

On Feb 18, 2008 7:50 PM, Marc Heiler <shevegen@linuxmail.org> wrote:
> Personally I am using mostly .tar.bz2 these days, simply because it
> compresses better than gz, even though gz is faster and nicer to your
> CPU. I have around 15 GIG of archivable material, if I would move this
> to gz I assume I would come to around 18 GIG, and when transferring on
> USB (via a ruby script automatically) every byte that is not transferred
> matters - some computers only have USB 1.1 and transfer is boringly slow
> already. But its mostly archival reasons here. :)
>
> PS: There recently was a comment about 7zip compressing even better than
> bzip.
> Just unfortunately, 7zip is hardly known and I will stick to bzip for
> now since its much more supported, known and also easy to handle (and
> doesnt look as attached to company-development as is 7zip)
> --
> Posted via http://www.ruby-....
>
>

Clifford Heath

2/19/2008 4:43:00 AM

Trans wrote:
>> A zip archive compresses
>> each file indivually and then adds them to the archive, this makes it
>> easy to extract indivdual files later [1]. A tar.gz archive adds all
>> the files to the tar and then gzips everything at once, this takes
>> advantage of cross-file redundancy for a better overall compression
>> ratio [2].
>
> That's interesting. I created a utility a while back called rtar
> (recursive tar).

Microsoft's CAB file format has a mixed approach, using a proprietry
LZ compression. Basically they compress small files together in groups,
which gives you most of the advantages of zip and tar in one.

Another approach that could be taken is to flush the compressor for
every 32kB of input (since deflate can't repeat input from further ago
anyhow), then append a manifest recording the output byte offset of the
flush point that preceeds each file. To grab a file in the middle, seek
to the offset that preceeds that file by *two* blocks, decompress the
32kB to use as history, then continue decompressing until you get to
your file.

The nice thing is that flushing the compressor leaves the output as a
valid deflate() stream, even for decompressors that don't know why you've
flushed. If the manifest looks like a normal file, a standard tar utility
could still extract the whole archive.

I've done these sort of games with the zlib compressor - it's easier
than you might think. With a bit of cunning, you can make the resultant
file rsync-able even in the face of localized changes, a fact I discussed
with Andrew Tridgell some years back. I also wanted to add a long-range
predictor to zlib so deflate could repeat blocks from far back... you can
use an rsync-style approach to do the prediction rather than an LZ suffix
tree.

Clifford Heath.

Robert Klemme

2/19/2008 10:37:00 AM

2008/2/19, Trans <transfire@gmail.com>:
> So I has started to offer .zip packages of my projects to make life a
> little easier for Windows folks --seeing as it's no sweat off the
> backs of Linux folks either (unzip) --but then I noticed that zip
> files are huge. I have a 1MB tar.gz that's over 3MB as a .zip. Can
> that really be right?

The reason is that in a ZIP all entries are compressed individually
while in a TGZ or TBZ the whole stream is compressed. The effect
shows especially when there are many small files with similar content
as is typical for source code.

However, I do have seen ZIP files that were similarly sized -
certainly not as much difference as you have observed. This may also
depend on the compression algorithm used in a ZIP (I believe IZArc for
example supports three different compression algorithms for ZIP).

Kind regards

robert

--
use.inject do |as, often| as.you_can - without end

comp.lang.ruby

OT: Why are .zip files so much bigger than tar.gz files?

Trans

Arlen Cuss

brabuhr

Marc Heiler

Trans

Joe

Clifford Heath

Robert Klemme

x Login to ForumsZone