notes-computer-xz gzip compression tests

some tests on how to compress for archival storage. As a test i used 4 large .mat files from my academic data analysis work. Each file contains a variety of matrices. There is some between-file duplication and some possibly some fuzzy duplication of content between matrices within each file, since each file contains a number of matrices that are analyses of the same dataset (a 3-D raster of gene expression data over the cerebral cortex), and possibly some fuzzy duplication of content within each matrix, but at large distances within the file (since the expression of many genes 'look similar' to each other when you display the data as an image).

$ ls -l t total 2480440 -rw-rw-r-- 1 bshanks bshanks 634949776 Apr 22 11:47 data_try8.7_scaled_left_aligned_gs_2.2_10_.1_1.1_.5_.5_130829_start3_divmean_particip0.90.mat -rw-r--r-- 1 bshanks bshanks 635031136 Apr 22 11:47 data_try8.7_scaled_left_aligned_gs_2.2_10_.1_1.1_.5_.5_start3_divmean.mat -rw-rw-r-- 1 bshanks bshanks 634949776 Apr 22 11:47 data_try8.7_scaled_left_aligned_gs_2.2_10_.1_1.1_.5_.5_start3_divmean_particip0.90.mat -rw-r--r-- 1 bshanks bshanks 635031136 Apr 22 11:48 data_try8.7_scaled_left_aligned_gs_2.2_10_.1_1.1_.5_.5_start3.mat

$ xz --version xz (XZ Utils) 5.1.0alpha liblzma 5.1.0alpha

$ gzip --version gzip 1.4 Copyright (C) 2007 Free Software Foundation, Inc. Copyright (C) 1993 Jean-loup Gailly. This is free software. You may redistribute copies of it under the terms of the GNU General Public License <http://www.gnu.org/licenses/gpl.html>. There is NO WARRANTY, to the extent permitted by law.

Written by Jean-loup Gailly.

  1. i have a 2 core (4 virtual core) Intel Core i5 CPU, model U560: $ cat /proc/cpuinfo
model name : Intel(R) Core(TM) i5 CPU U 560 @ 1.33GHz model name : Intel(R) Core(TM) i5 CPU U 560 @ 1.33GHz model name : Intel(R) Core(TM) i5 CPU U 560 @ 1.33GHz model name : Intel(R) Core(TM) i5 CPU U 560 @ 1.33GHz
grep name
  1. with (L3, i think) cache size of about 3 MB $ cat /proc/cpuinfo
cache size : 3072 KB
grep 'cache size'head -n 1

according to http://www.cpu-world.com/CPUs/Core_i5/Intel-Core%20i5%20Mobile%20I5-560UM%20CN80617005190AH.html

the L1 cache is 32k icache and 32k dcache, and the l2 cache is 256k.

$ time tar -cvf foo.tar t/ t/ t/data_try8.7_scaled_left_aligned_gs_2.2_10_.1_1.1_.5_.5_130829_start3_divmean_particip0.90.mat t/data_try8.7_scaled_left_aligned_gs_2.2_10_.1_1.1_.5_.5_start3_divmean_particip0.90.mat t/data_try8.7_scaled_left_aligned_gs_2.2_10_.1_1.1_.5_.5_start3_divmean.mat t/data_try8.7_scaled_left_aligned_gs_2.2_10_.1_1.1_.5_.5_start3.mat

user 0m0.240s sys 0m9.121s

$ ls -l foo.tar -rw-rw-r-- 1 bshanks bshanks 2539970560 Apr 21 23:00 foo.tar

$ ls -h foo.tar -rw-rw-r-- 1 bshanks bshanks 2.4G Apr 21 23:00 foo.tar

$ mv foo.tar foo_whole.tar $ time gzip foo_whole.tar real 4m4.779s user 2m31.601s sys 0m13.409s

$ ls -l foo_whole.tar.gz -rw-rw-r-- 1 bshanks bshanks 1569160284 Apr 21 23:00 foo_whole.tar.gz

$ ls -lh foo_whole.tar.gz -rw-rw-r-- 1 bshanks bshanks 1.5G Apr 21 23:00 foo_whole.tar.gz

  1. now with (per-file?) tar gzip:

$ time tar -czvf foo.tar.gz t/ t/ t/data_try8.7_scaled_left_aligned_gs_2.2_10_.1_1.1_.5_.5_130829_start3_divmean_particip0.90.mat t/data_try8.7_scaled_left_aligned_gs_2.2_10_.1_1.1_.5_.5_start3_divmean_particip0.90.mat t/data_try8.7_scaled_left_aligned_gs_2.2_10_.1_1.1_.5_.5_start3_divmean.mat t/data_try8.7_scaled_left_aligned_gs_2.2_10_.1_1.1_.5_.5_start3.mat

real 3m25.622s user 2m35.318s sys 0m14.645s

$ ls -l foo.tar.gz -rw-rw-r-- 1 bshanks bshanks 1569160270 Apr 21 23:01 foo.tar.gz

$ ls -lh foo.tar.gz -rw-rw-r-- 1 bshanks bshanks 1.5G Apr 21 23:01 foo.tar.gz

  1. so not much difference between per-file? or not for gzip
  2. now with xz tar:

time tar -cJvf foo.tar.xz t/ real 34m3.550s user 33m8.496s sys 0m28.282s

$ ls -lh foo.tar.xz -rw-rw-r-- 1 bshanks bshanks 1.2G Apr 21 23:29 foo.tar.xz $ ls -l foo.tar.xz -rw-rw-r-- 1 bshanks bshanks 1279025948 Apr 21 23:29 foo.tar.xz

  1. let's doublecheck that that was one file at a time:

$ cp -r t t2 $ xz t2/* $ du -hs t2 1.2G t2

$ du -s t2 1249060 t2

  1. and with xz of the whole .tar at once:

$ tar -cvf foo.tar t/ $ mv foo.tar foo_whole.tar $ time xz foo_whole.tar

real 39m1.588s user 35m44.770s sys 0m47.055s

$ ls -lh foo_whole.tar.xz -rw-rw-r-- 1 bshanks bshanks 1.2G Apr 21 23:12 foo_whole.tar.xz $ ls -l foo_whole.tar.xz -rw-rw-r-- 1 bshanks bshanks 1279025948 Apr 21 23:12 foo_whole.tar.xz

ls t2/* -lh -rw-rw-r-- 1 bshanks bshanks 339M Apr 22 00:02 t2/data_try8.7_scaled_left_aligned_gs_2.2_10_.1_1.1_.5_.5_130829_start3_divmean_particip0.90.mat.xz -rw-r--r-- 1 bshanks bshanks 339M Apr 22 00:03 t2/data_try8.7_scaled_left_aligned_gs_2.2_10_.1_1.1_.5_.5_start3_divmean.mat.xz -rw-rw-r-- 1 bshanks bshanks 339M Apr 22 00:02 t2/data_try8.7_scaled_left_aligned_gs_2.2_10_.1_1.1_.5_.5_start3_divmean_particip0.90.mat.xz -rw-r--r-- 1 bshanks bshanks 205M Apr 22 00:03 t2/data_try8.7_scaled_left_aligned_gs_2.2_10_.1_1.1_.5_.5_start3.mat.xz

  1. how much worse is -0?

$ time xz -0 foo-3.tar

real 14m30.803s user 13m19.734s sys 0m15.293s

$ ls -lh foo-3.tar.xz -rw-rw-r-- 1 bshanks bshanks 1.3G Apr 21 23:34 foo-3.tar.xz bshanks@bshanks:/tmp$ ls -l foo-3.tar.xz -rw-rw-r-- 1 bshanks bshanks 1374903676 Apr 21 23:34 foo-3.tar.xz

  1. about 4x slower than gzip, and 15% better compression
  2. ok, how about a bigger block size?

time xz -vv --lzma2=preset=0,dict=384MiB? foo-0big.tar xz: Filter chain: --lzma2=dict=384MiB?,lc=3,lp=0,pb=2,mode=fast,nice=128,mf=hc3,depth=4 xz: 2177 MiB? of memory is required. The limit is 17592186044416 MiB?. xz: Decompression will need 385 MiB? of memory. foo-0big.tar (1/1) 100 % 1098.4 MiB? / 2422.3 MiB? = 0.453 1.7 MiB?/s 23:42

real 23m43.710s user 22m8.259s sys 0m26.786s

$ ls -lh foo-0big.tar.xz -rw-rw-r-- 1 bshanks bshanks 1.1G Apr 22 01:03 foo-0big.tar.xz

  1. now we're at 6x gzip time, and 25% better compression
  2. what if we turn up the compression on gzip?

$ time gzip -9 foo-9.tar

real 3m45.709s user 3m31.901s sys 0m4.184s

$ ls -lh foo-9.tar.gz -rw-rw-r-- 1 bshanks bshanks 1.5G Apr 22 01:37 foo-9.tar.gz bshanks@bshanks:/tmp$ ls -l foo-9.tar.gz -rw-rw-r-- 1 bshanks bshanks 1566444690 Apr 22 01:37 foo-9.tar.gz

  1. not much different from ordinary gzip
  2. what about xz -9

$ time xz -9 foo-p9.tar ls -lh foo-p9.tar.xz ls -l foo-p9.tar.xz

real 45m58.665s user 44m16.414s sys 0m26.974s

$ ls -lh foo-p9.tar.xz -rw-rw-r-- 1 bshanks bshanks 1.2G Apr 22 01:49 foo-p9.tar.xz

$ ls -l foo-p9.tar.xz -rw-rw-r-- 1 bshanks bshanks 1260502980 Apr 22 01:49 foo-p9.tar.xz

$ tar cvf foo_delta.tar t/ $ time xz --delta=dist=3 foo_delta.tar

  1. does a delta filter help? $ time xz --delta --lzma2=preset=6 foo_delta.tar

$ time xz --delta --lzma2=preset=6 foo_delta.tar

real 32m3.797s user 31m1.332s sys 0m24.658s

$ ls -lh foo_delta.tar.xz -rw-rw-r-- 1 bshanks bshanks 1.4G Apr 22 11:48 foo_delta.tar.xz

$ ls -l foo_delta.tar.xz -rw-rw-r-- 1 bshanks bshanks 1444302428 Apr 22 11:48 foo_delta.tar.xz

  1. it's worse
  2. what about delta with a higher 'dist'?

$ time xz --delta=dist=256 --lzma2=preset=6 foo_delta_256.tar

real 33m16.342s user 32m23.657s sys 0m17.637s

$ ls -l foo_delta_256.tar.xz -rw-rw-r-- 1 bshanks bshanks 1578127300 Apr 22 13:00 foo_delta_256.tar.xz

$ ls -lh foo_delta_256.tar.xz -rw-rw-r-- 1 bshanks bshanks 1.5G Apr 22 13:00 foo_delta_256.tar.xz

  1. it's worse too (even worse)
  2. ok so normal xv looks ok for compression, let's test settings in between -0 and -6:

notes on what the presets are:

-0: --lzma2=dict=256KiB?,lc=3,lp=0,pb=2,mode=fast,nice=128,mf=hc3,depth=4 -1: --lzma2=dict=1MiB?,lc=3,lp=0,pb=2,mode=fast,nice=128,mf=hc4,depth=8 -2: --lzma2=dict=2MiB?,lc=3,lp=0,pb=2,mode=fast,nice=273,mf=hc4,depth=24 -3: --lzma2=dict=4MiB?,lc=3,lp=0,pb=2,mode=fast,nice=273,mf=hc4,depth=48 -4: --lzma2=dict=4MiB?,lc=3,lp=0,pb=2,mode=normal,nice=16,mf=bt4,depth=0 -5: --lzma2=dict=8MiB?,lc=3,lp=0,pb=2,mode=normal,nice=32,mf=bt4,depth=0 -6: --lzma2=dict=8MiB?,lc=3,lp=0,pb=2,mode=normal,nice=64,mf=bt4,depth=0 -7: --lzma2=dict=16MiB?,lc=3,lp=0,pb=2,mode=normal,nice=64,mf=bt4,depth=0 -8: --lzma2=dict=32MiB?,lc=3,lp=0,pb=2,mode=normal,nice=64,mf=bt4,depth=0 -9: --lzma2=dict=64MiB?,lc=3,lp=0,pb=2,mode=normal,nice=64,mf=bt4,depth=0

after the first 2 mins, -0, -1, -2, -3 show different speeds and compression levels, and curiously, -3 is not the best compression level, suggesting that an adaptive dictionary size algorithm could do a lot better than lzma2:

bshanks@bshanks:/tmp$ xz -vv -k -0 foo_1.tar xz: Filter chain: --lzma2=dict=256KiB?,lc=3,lp=0,pb=2,mode=fast,nice=128,mf=hc3,depth=4 xz: 3 MiB? of memory is required. The limit is 17592186044416 MiB?. xz: Decompression will need 1 MiB? of memory. foo_1.tar (1/1) 27.5 % 309.2 MiB? / 667.8 MiB? = 0.463 5.3 MiB?/s 2:06 5 min 40 s

bshanks@bshanks:/tmp$ xz -vv -k -1 foo_1.tar xz: Filter chain: --lzma2=dict=1MiB?,lc=3,lp=0,pb=2,mode=fast,nice=128,mf=hc4,depth=8 xz: 9 MiB? of memory is required. The limit is 17592186044416 MiB?. xz: Decompression will need 2 MiB? of memory. foo_1.tar (1/1) 16.3 % 145.7 MiB? / 394.4 MiB? = 0.369 3.1 MiB?/s 2:05 11 min

bshanks@bshanks:/tmp$ xz -vv -k -2 foo_1.tar xz: Filter chain: --lzma2=dict=2MiB?,lc=3,lp=0,pb=2,mode=fast,nice=273,mf=hc4,depth=24 xz: 17 MiB? of memory is required. The limit is 17592186044416 MiB?. xz: Decompression will need 3 MiB? of memory. foo_1.tar (1/1) 12.4 % 109.0 MiB? / 300.6 MiB? = 0.363 2.3 MiB?/s 2:09 16 min

bshanks@bshanks:/tmp$ xz -vv -k -3 foo_1.tar xz: Filter chain: --lzma2=dict=4MiB?,lc=3,lp=0,pb=2,mode=fast,nice=273,mf=hc4,depth=48 xz: 32 MiB? of memory is required. The limit is 17592186044416 MiB?. xz: Decompression will need 5 MiB? of memory. foo_1.tar (1/1) 6.2 % 87.4 MiB? / 149.1 MiB? = 0.586 1.3 MiB?/s 1:59 31 min

To get a fairer picture, we need to get to the same place in the original file, so let's go for the first 400MB:

$ xz -vv -k -0 foo_1.tar xz: Filter chain: --lzma2=dict=256KiB?,lc=3,lp=0,pb=2,mode=fast,nice=128,mf=hc3,depth=4 xz: 3 MiB? of memory is required. The limit is 17592186044416 MiB?. xz: Decompression will need 1 MiB? of memory. foo_1.tar (1/1) 16.7 % 165.7 MiB? / 405.5 MiB? = 0.409 6.0 MiB?/s 1:08 5 min 40 s

$ xz -vv -k -1 foo_1.tar xz: Filter chain: --lzma2=dict=1MiB?,lc=3,lp=0,pb=2,mode=fast,nice=128,mf=hc4,depth=8 xz: 9 MiB? of memory is required. The limit is 17592186044416 MiB?. xz: Decompression will need 2 MiB? of memory. foo_1.tar (1/1) 16.6 % 148.0 MiB? / 401.4 MiB? = 0.369 3.1 MiB?/s 2:07 11 min

$ xz -vv -k -2 foo_1.tar xz: Filter chain: --lzma2=dict=2MiB?,lc=3,lp=0,pb=2,mode=fast,nice=273,mf=hc4,depth=24 xz: 17 MiB? of memory is required. The limit is 17592186044416 MiB?. xz: Decompression will need 3 MiB? of memory. foo_1.tar (1/1) 16.6 % 145.4 MiB? / 402.2 MiB? = 0.361 2.1 MiB?/s 3:12 17 min

$ xz -vv -k -3 foo_1.tar xz: Filter chain: --lzma2=dict=4MiB?,lc=3,lp=0,pb=2,mode=fast,nice=273,mf=hc4,depth=48 xz: 32 MiB? of memory is required. The limit is 17592186044416 MiB?. xz: Decompression will need 5 MiB? of memory. foo_1.tar (1/1) 16.5 % 143.5 MiB? / 400.4 MiB? = 0.358 1.4 MiB?/s 4:52 25 min

$ xz -vv -k -4 foo_1.tar xz: Filter chain: --lzma2=dict=4MiB?,lc=3,lp=0,pb=2,mode=normal,nice=16,mf=bt4,depth=0 xz: 48 MiB? of memory is required. The limit is 17592186044416 MiB?. xz: Decompression will need 5 MiB? of memory. foo_1.tar (1/1) 16.5 % 140.0 MiB? / 399.6 MiB? = 0.350 1.7 MiB?/s 4:01 21 min

$ xz -vv -k -6 foo_1.tar xz: Filter chain: --lzma2=dict=8MiB?,lc=3,lp=0,pb=2,mode=normal,nice=64,mf=bt4,depth=0 xz: 94 MiB? of memory is required. The limit is 17592186044416 MiB?. xz: Decompression will need 9 MiB? of memory. foo_1.tar (1/1) 16.5 % 138.7 MiB? / 400.8 MiB? = 0.346 1.4 MiB?/s 4:55 25 min

$ xz -vv -k -9 foo_1.tar xz: Filter chain: --lzma2=dict=64MiB?,lc=3,lp=0,pb=2,mode=normal,nice=64,mf=bt4,depth=0 xz: 674 MiB? of memory is required. The limit is 17592186044416 MiB?. xz: Decompression will need 65 MiB? of memory. foo_1.tar (1/1) 16.5 % 135.5 MiB? / 400.4 MiB? = 0.338 1.0 MiB?/s 6:41 34 min

  1. to see how much of this comes from dictionary size and how much comes from other stuff, let's try the e presets,
  2. which have the same dictionary size as the non-e version, but crank up the other parameters to getting better but
  3. slower compression:

$ xz -vv -k -1e foo_1.tar xz: Filter chain: --lzma2=dict=1MiB?,lc=3,lp=0,pb=2,mode=normal,nice=273,mf=bt4,depth=512 xz: 13 MiB? of memory is required. The limit is 17592186044416 MiB?. xz: Decompression will need 2 MiB? of memory. foo_1.tar (1/1) 16.5 % 143.2 MiB? / 401.0 MiB? = 0.357 1.0 MiB?/s 6:50 35 min

  1. and let's also try increasing the dictionary size while using -1 presets
  2. (we saw earlier that this seemed to help when applied to the entire archive, let's try it out on the
  3. first 400MB):

$ time xz -vv --lzma2=preset=1,dict=64MiB? foo_1.tar xz: Filter chain: --lzma2=dict=64MiB?,lc=3,lp=0,pb=2,mode=fast,nice=128,mf=hc4,depth=8 xz: 418 MiB? of memory is required. The limit is 17592186044416 MiB?. xz: Decompression will need 65 MiB? of memory. foo_1.tar (1/1) 16.7 % 146.6 MiB? / 404.0 MiB? = 0.363 2.0 MiB?/s 3:23 17 min

  1. plot available data points in Octave: $ octave --eval 'x = [409 369 361 358 350 346 338 357 363]; y = [6 3.1 2.1 1.4 1.7 1.4 1.0 1.0 2.0]; scatter(y,x); pause()' $ octave --eval 'x = [409 369 361 358 350 346 338 357 363]; y = [6 3.1 2.1 1.4 1.7 1.4 1.0 1.0 2.0]; scatter(1./y,-log(x./1000)); pause()'
  2. so, it looks like the xz presets are well-chosen for this dataset; and that -0, -4, -6, -9 all make a lot of sense
  3. lets include the gzip results from earlier as well as the xz -6 and -0 results from earlier on the whole 2.4GB test:

$ octave --eval 'x = [409 369 361 358 350 346 338 357 363 650 580 500]; y = [6 3.1 2.1 1.4 1.7 1.4 1.0 1.0 2.0 16.3 3.03 1.2]; scatter(1./y,-log(x./1000)); pause()'

  1. we put time on the horizontal axis b/c that is what you have to give, and compression on the vertical axis because that is what you want to get. Compression ratio is logged because it varies between 1 and 0, but if you go from .5 to .25 compression ratio, you are twice as happy (you are twice as happy when you halve it). The horizontal axis is s/MiB? not MiB?/s because you can easily get MiB?/s to infinity (don't compress at all; zero time taken), but that's not so impressive; also because you have a set number of MiB? that you need to compress (like the reason that gallons per mile would be more informative than miles per gallon as a measure of vehicle fuel efficiency).

xz: Filter chain: --lzma2=dict=8MiB?,lc=3,lp=0,pb=2,mode=normal,nice=64,mf=bt4,depth=0 xz: 94 MiB? of memory is required. The limit is 17592186044416 MiB?. xz: Decompression will need 9 MiB? of memory.

$ xz -k -3 foo_1.tar

  1. summary: the best compression for this use-case is probably -9 with about a 384MiB? block size
  2. however, it'a not fault-tolerant to put everything in one compressed .tar, and it's also a pain
  3. xv -6, the default, is less than 2% worse (1260502980 vs 1279025948) at 36 mins vs 44 mins (18% faster)
  4. xv -0 is 9% worse (1260502980 vs 1374903676) at 13 mins vs 44 mins (70% faster)
  5. gzip -9 is 24% worse (1260502980 vs 1566444690) at 3.5 mins vs 44 mins (92% faster)
  6. gzip is also 24% worse (1260502980 vs 1569160284) at 2.5 mins vs 44 mins (94% faster)

for a 200GB dataset, we can expect maximal compression ratio of about .34, or 68 GB, with time of 56 hours (1 MiB?/s, for (200*1024)/(60*60)) if we used xz -6 we'd expect 70 GB, with time of 40 hours if we used xz -0 we'd expect 81 GB, with time of 10.5 hours

however based on actually compressing the whole 2.4GB test, instead of just the 400MB, using xz -6 we'd expect .5 compression ratio, at a speed of 1.2 MiB?/s using xz -0 we'd expect .58 compression ratio, at a speed of 3.03 MiB?/s using gzip we'd expect .65 compression ratio, at a speed of 16.3 MiB?/s

meaning:

100 GB in 47 hrs 116 GB in 18.5 hrs 130 GB in 3.5 hrs