some tests on how to compress for archival storage. As a test i used 4 large .mat files from my academic data analysis work. Each file contains a variety of matrices. There is some between-file duplication and some possibly some fuzzy duplication of content between matrices within each file, since each file contains a number of matrices that are analyses of the same dataset (a 3-D raster of gene expression data over the cerebral cortex), and possibly some fuzzy duplication of content within each matrix, but at large distances within the file (since the expression of many genes 'look similar' to each other when you display the data as an image).
$ ls -l t total 2480440 -rw-rw-r-- 1 bshanks bshanks 634949776 Apr 22 11:47 data_try8.7_scaled_left_aligned_gs_2.2_10_.1_1.1_.5_.5_130829_start3_divmean_particip0.90.mat -rw-r--r-- 1 bshanks bshanks 635031136 Apr 22 11:47 data_try8.7_scaled_left_aligned_gs_2.2_10_.1_1.1_.5_.5_start3_divmean.mat -rw-rw-r-- 1 bshanks bshanks 634949776 Apr 22 11:47 data_try8.7_scaled_left_aligned_gs_2.2_10_.1_1.1_.5_.5_start3_divmean_particip0.90.mat -rw-r--r-- 1 bshanks bshanks 635031136 Apr 22 11:48 data_try8.7_scaled_left_aligned_gs_2.2_10_.1_1.1_.5_.5_start3.mat
$ xz --version xz (XZ Utils) 5.1.0alpha liblzma 5.1.0alpha
$ gzip --version gzip 1.4 Copyright (C) 2007 Free Software Foundation, Inc. Copyright (C) 1993 Jean-loup Gailly. This is free software. You may redistribute copies of it under the terms of the GNU General Public License <http://www.gnu.org/licenses/gpl.html>. There is NO WARRANTY, to the extent permitted by law.
Written by Jean-loup Gailly.
grep name |
grep 'cache size' | head -n 1 |
according to http://www.cpu-world.com/CPUs/Core_i5/Intel-Core%20i5%20Mobile%20I5-560UM%20CN80617005190AH.html
the L1 cache is 32k icache and 32k dcache, and the l2 cache is 256k.
$ time tar -cvf foo.tar t/ t/ t/data_try8.7_scaled_left_aligned_gs_2.2_10_.1_1.1_.5_.5_130829_start3_divmean_particip0.90.mat t/data_try8.7_scaled_left_aligned_gs_2.2_10_.1_1.1_.5_.5_start3_divmean_particip0.90.mat t/data_try8.7_scaled_left_aligned_gs_2.2_10_.1_1.1_.5_.5_start3_divmean.mat t/data_try8.7_scaled_left_aligned_gs_2.2_10_.1_1.1_.5_.5_start3.mat
user 0m0.240s sys 0m9.121s
$ ls -l foo.tar -rw-rw-r-- 1 bshanks bshanks 2539970560 Apr 21 23:00 foo.tar
$ ls -h foo.tar -rw-rw-r-- 1 bshanks bshanks 2.4G Apr 21 23:00 foo.tar
$ mv foo.tar foo_whole.tar $ time gzip foo_whole.tar real 4m4.779s user 2m31.601s sys 0m13.409s
$ ls -l foo_whole.tar.gz -rw-rw-r-- 1 bshanks bshanks 1569160284 Apr 21 23:00 foo_whole.tar.gz
$ ls -lh foo_whole.tar.gz -rw-rw-r-- 1 bshanks bshanks 1.5G Apr 21 23:00 foo_whole.tar.gz
$ time tar -czvf foo.tar.gz t/ t/ t/data_try8.7_scaled_left_aligned_gs_2.2_10_.1_1.1_.5_.5_130829_start3_divmean_particip0.90.mat t/data_try8.7_scaled_left_aligned_gs_2.2_10_.1_1.1_.5_.5_start3_divmean_particip0.90.mat t/data_try8.7_scaled_left_aligned_gs_2.2_10_.1_1.1_.5_.5_start3_divmean.mat t/data_try8.7_scaled_left_aligned_gs_2.2_10_.1_1.1_.5_.5_start3.mat
real 3m25.622s user 2m35.318s sys 0m14.645s
$ ls -l foo.tar.gz -rw-rw-r-- 1 bshanks bshanks 1569160270 Apr 21 23:01 foo.tar.gz
$ ls -lh foo.tar.gz -rw-rw-r-- 1 bshanks bshanks 1.5G Apr 21 23:01 foo.tar.gz
time tar -cJvf foo.tar.xz t/ real 34m3.550s user 33m8.496s sys 0m28.282s
$ ls -lh foo.tar.xz -rw-rw-r-- 1 bshanks bshanks 1.2G Apr 21 23:29 foo.tar.xz $ ls -l foo.tar.xz -rw-rw-r-- 1 bshanks bshanks 1279025948 Apr 21 23:29 foo.tar.xz
$ cp -r t t2 $ xz t2/* $ du -hs t2 1.2G t2
$ du -s t2 1249060 t2
$ tar -cvf foo.tar t/ $ mv foo.tar foo_whole.tar $ time xz foo_whole.tar
real 39m1.588s user 35m44.770s sys 0m47.055s
$ ls -lh foo_whole.tar.xz -rw-rw-r-- 1 bshanks bshanks 1.2G Apr 21 23:12 foo_whole.tar.xz $ ls -l foo_whole.tar.xz -rw-rw-r-- 1 bshanks bshanks 1279025948 Apr 21 23:12 foo_whole.tar.xz
ls t2/* -lh -rw-rw-r-- 1 bshanks bshanks 339M Apr 22 00:02 t2/data_try8.7_scaled_left_aligned_gs_2.2_10_.1_1.1_.5_.5_130829_start3_divmean_particip0.90.mat.xz -rw-r--r-- 1 bshanks bshanks 339M Apr 22 00:03 t2/data_try8.7_scaled_left_aligned_gs_2.2_10_.1_1.1_.5_.5_start3_divmean.mat.xz -rw-rw-r-- 1 bshanks bshanks 339M Apr 22 00:02 t2/data_try8.7_scaled_left_aligned_gs_2.2_10_.1_1.1_.5_.5_start3_divmean_particip0.90.mat.xz -rw-r--r-- 1 bshanks bshanks 205M Apr 22 00:03 t2/data_try8.7_scaled_left_aligned_gs_2.2_10_.1_1.1_.5_.5_start3.mat.xz
$ time xz -0 foo-3.tar
real 14m30.803s user 13m19.734s sys 0m15.293s
$ ls -lh foo-3.tar.xz -rw-rw-r-- 1 bshanks bshanks 1.3G Apr 21 23:34 foo-3.tar.xz bshanks@bshanks:/tmp$ ls -l foo-3.tar.xz -rw-rw-r-- 1 bshanks bshanks 1374903676 Apr 21 23:34 foo-3.tar.xz
time xz -vv --lzma2=preset=0,dict=384MiB? foo-0big.tar xz: Filter chain: --lzma2=dict=384MiB?,lc=3,lp=0,pb=2,mode=fast,nice=128,mf=hc3,depth=4 xz: 2177 MiB? of memory is required. The limit is 17592186044416 MiB?. xz: Decompression will need 385 MiB? of memory. foo-0big.tar (1/1) 100 % 1098.4 MiB? / 2422.3 MiB? = 0.453 1.7 MiB?/s 23:42
real 23m43.710s user 22m8.259s sys 0m26.786s
$ ls -lh foo-0big.tar.xz -rw-rw-r-- 1 bshanks bshanks 1.1G Apr 22 01:03 foo-0big.tar.xz
$ time gzip -9 foo-9.tar
real 3m45.709s user 3m31.901s sys 0m4.184s
$ ls -lh foo-9.tar.gz -rw-rw-r-- 1 bshanks bshanks 1.5G Apr 22 01:37 foo-9.tar.gz bshanks@bshanks:/tmp$ ls -l foo-9.tar.gz -rw-rw-r-- 1 bshanks bshanks 1566444690 Apr 22 01:37 foo-9.tar.gz
$ time xz -9 foo-p9.tar ls -lh foo-p9.tar.xz ls -l foo-p9.tar.xz
real 45m58.665s user 44m16.414s sys 0m26.974s
$ ls -lh foo-p9.tar.xz -rw-rw-r-- 1 bshanks bshanks 1.2G Apr 22 01:49 foo-p9.tar.xz
$ ls -l foo-p9.tar.xz -rw-rw-r-- 1 bshanks bshanks 1260502980 Apr 22 01:49 foo-p9.tar.xz
$ tar cvf foo_delta.tar t/ $ time xz --delta=dist=3 foo_delta.tar
$ time xz --delta --lzma2=preset=6 foo_delta.tar
real 32m3.797s user 31m1.332s sys 0m24.658s
$ ls -lh foo_delta.tar.xz -rw-rw-r-- 1 bshanks bshanks 1.4G Apr 22 11:48 foo_delta.tar.xz
$ ls -l foo_delta.tar.xz -rw-rw-r-- 1 bshanks bshanks 1444302428 Apr 22 11:48 foo_delta.tar.xz
$ time xz --delta=dist=256 --lzma2=preset=6 foo_delta_256.tar
real 33m16.342s user 32m23.657s sys 0m17.637s
$ ls -l foo_delta_256.tar.xz -rw-rw-r-- 1 bshanks bshanks 1578127300 Apr 22 13:00 foo_delta_256.tar.xz
$ ls -lh foo_delta_256.tar.xz -rw-rw-r-- 1 bshanks bshanks 1.5G Apr 22 13:00 foo_delta_256.tar.xz
notes on what the presets are:
-0: --lzma2=dict=256KiB?,lc=3,lp=0,pb=2,mode=fast,nice=128,mf=hc3,depth=4 -1: --lzma2=dict=1MiB?,lc=3,lp=0,pb=2,mode=fast,nice=128,mf=hc4,depth=8 -2: --lzma2=dict=2MiB?,lc=3,lp=0,pb=2,mode=fast,nice=273,mf=hc4,depth=24 -3: --lzma2=dict=4MiB?,lc=3,lp=0,pb=2,mode=fast,nice=273,mf=hc4,depth=48 -4: --lzma2=dict=4MiB?,lc=3,lp=0,pb=2,mode=normal,nice=16,mf=bt4,depth=0 -5: --lzma2=dict=8MiB?,lc=3,lp=0,pb=2,mode=normal,nice=32,mf=bt4,depth=0 -6: --lzma2=dict=8MiB?,lc=3,lp=0,pb=2,mode=normal,nice=64,mf=bt4,depth=0 -7: --lzma2=dict=16MiB?,lc=3,lp=0,pb=2,mode=normal,nice=64,mf=bt4,depth=0 -8: --lzma2=dict=32MiB?,lc=3,lp=0,pb=2,mode=normal,nice=64,mf=bt4,depth=0 -9: --lzma2=dict=64MiB?,lc=3,lp=0,pb=2,mode=normal,nice=64,mf=bt4,depth=0
after the first 2 mins, -0, -1, -2, -3 show different speeds and compression levels, and curiously, -3 is not the best compression level, suggesting that an adaptive dictionary size algorithm could do a lot better than lzma2:
bshanks@bshanks:/tmp$ xz -vv -k -0 foo_1.tar xz: Filter chain: --lzma2=dict=256KiB?,lc=3,lp=0,pb=2,mode=fast,nice=128,mf=hc3,depth=4 xz: 3 MiB? of memory is required. The limit is 17592186044416 MiB?. xz: Decompression will need 1 MiB? of memory. foo_1.tar (1/1) 27.5 % 309.2 MiB? / 667.8 MiB? = 0.463 5.3 MiB?/s 2:06 5 min 40 s
bshanks@bshanks:/tmp$ xz -vv -k -1 foo_1.tar xz: Filter chain: --lzma2=dict=1MiB?,lc=3,lp=0,pb=2,mode=fast,nice=128,mf=hc4,depth=8 xz: 9 MiB? of memory is required. The limit is 17592186044416 MiB?. xz: Decompression will need 2 MiB? of memory. foo_1.tar (1/1) 16.3 % 145.7 MiB? / 394.4 MiB? = 0.369 3.1 MiB?/s 2:05 11 min
bshanks@bshanks:/tmp$ xz -vv -k -2 foo_1.tar xz: Filter chain: --lzma2=dict=2MiB?,lc=3,lp=0,pb=2,mode=fast,nice=273,mf=hc4,depth=24 xz: 17 MiB? of memory is required. The limit is 17592186044416 MiB?. xz: Decompression will need 3 MiB? of memory. foo_1.tar (1/1) 12.4 % 109.0 MiB? / 300.6 MiB? = 0.363 2.3 MiB?/s 2:09 16 min
bshanks@bshanks:/tmp$ xz -vv -k -3 foo_1.tar xz: Filter chain: --lzma2=dict=4MiB?,lc=3,lp=0,pb=2,mode=fast,nice=273,mf=hc4,depth=48 xz: 32 MiB? of memory is required. The limit is 17592186044416 MiB?. xz: Decompression will need 5 MiB? of memory. foo_1.tar (1/1) 6.2 % 87.4 MiB? / 149.1 MiB? = 0.586 1.3 MiB?/s 1:59 31 min
To get a fairer picture, we need to get to the same place in the original file, so let's go for the first 400MB:
$ xz -vv -k -0 foo_1.tar xz: Filter chain: --lzma2=dict=256KiB?,lc=3,lp=0,pb=2,mode=fast,nice=128,mf=hc3,depth=4 xz: 3 MiB? of memory is required. The limit is 17592186044416 MiB?. xz: Decompression will need 1 MiB? of memory. foo_1.tar (1/1) 16.7 % 165.7 MiB? / 405.5 MiB? = 0.409 6.0 MiB?/s 1:08 5 min 40 s
$ xz -vv -k -1 foo_1.tar xz: Filter chain: --lzma2=dict=1MiB?,lc=3,lp=0,pb=2,mode=fast,nice=128,mf=hc4,depth=8 xz: 9 MiB? of memory is required. The limit is 17592186044416 MiB?. xz: Decompression will need 2 MiB? of memory. foo_1.tar (1/1) 16.6 % 148.0 MiB? / 401.4 MiB? = 0.369 3.1 MiB?/s 2:07 11 min
$ xz -vv -k -2 foo_1.tar xz: Filter chain: --lzma2=dict=2MiB?,lc=3,lp=0,pb=2,mode=fast,nice=273,mf=hc4,depth=24 xz: 17 MiB? of memory is required. The limit is 17592186044416 MiB?. xz: Decompression will need 3 MiB? of memory. foo_1.tar (1/1) 16.6 % 145.4 MiB? / 402.2 MiB? = 0.361 2.1 MiB?/s 3:12 17 min
$ xz -vv -k -3 foo_1.tar xz: Filter chain: --lzma2=dict=4MiB?,lc=3,lp=0,pb=2,mode=fast,nice=273,mf=hc4,depth=48 xz: 32 MiB? of memory is required. The limit is 17592186044416 MiB?. xz: Decompression will need 5 MiB? of memory. foo_1.tar (1/1) 16.5 % 143.5 MiB? / 400.4 MiB? = 0.358 1.4 MiB?/s 4:52 25 min
$ xz -vv -k -4 foo_1.tar xz: Filter chain: --lzma2=dict=4MiB?,lc=3,lp=0,pb=2,mode=normal,nice=16,mf=bt4,depth=0 xz: 48 MiB? of memory is required. The limit is 17592186044416 MiB?. xz: Decompression will need 5 MiB? of memory. foo_1.tar (1/1) 16.5 % 140.0 MiB? / 399.6 MiB? = 0.350 1.7 MiB?/s 4:01 21 min
$ xz -vv -k -6 foo_1.tar xz: Filter chain: --lzma2=dict=8MiB?,lc=3,lp=0,pb=2,mode=normal,nice=64,mf=bt4,depth=0 xz: 94 MiB? of memory is required. The limit is 17592186044416 MiB?. xz: Decompression will need 9 MiB? of memory. foo_1.tar (1/1) 16.5 % 138.7 MiB? / 400.8 MiB? = 0.346 1.4 MiB?/s 4:55 25 min
$ xz -vv -k -9 foo_1.tar xz: Filter chain: --lzma2=dict=64MiB?,lc=3,lp=0,pb=2,mode=normal,nice=64,mf=bt4,depth=0 xz: 674 MiB? of memory is required. The limit is 17592186044416 MiB?. xz: Decompression will need 65 MiB? of memory. foo_1.tar (1/1) 16.5 % 135.5 MiB? / 400.4 MiB? = 0.338 1.0 MiB?/s 6:41 34 min
$ xz -vv -k -1e foo_1.tar xz: Filter chain: --lzma2=dict=1MiB?,lc=3,lp=0,pb=2,mode=normal,nice=273,mf=bt4,depth=512 xz: 13 MiB? of memory is required. The limit is 17592186044416 MiB?. xz: Decompression will need 2 MiB? of memory. foo_1.tar (1/1) 16.5 % 143.2 MiB? / 401.0 MiB? = 0.357 1.0 MiB?/s 6:50 35 min
$ time xz -vv --lzma2=preset=1,dict=64MiB? foo_1.tar xz: Filter chain: --lzma2=dict=64MiB?,lc=3,lp=0,pb=2,mode=fast,nice=128,mf=hc4,depth=8 xz: 418 MiB? of memory is required. The limit is 17592186044416 MiB?. xz: Decompression will need 65 MiB? of memory. foo_1.tar (1/1) 16.7 % 146.6 MiB? / 404.0 MiB? = 0.363 2.0 MiB?/s 3:23 17 min
$ octave --eval 'x = [409 369 361 358 350 346 338 357 363 650 580 500]; y = [6 3.1 2.1 1.4 1.7 1.4 1.0 1.0 2.0 16.3 3.03 1.2]; scatter(1./y,-log(x./1000)); pause()'
xz: Filter chain: --lzma2=dict=8MiB?,lc=3,lp=0,pb=2,mode=normal,nice=64,mf=bt4,depth=0 xz: 94 MiB? of memory is required. The limit is 17592186044416 MiB?. xz: Decompression will need 9 MiB? of memory.
$ xz -k -3 foo_1.tar
for a 200GB dataset, we can expect maximal compression ratio of about .34, or 68 GB, with time of 56 hours (1 MiB?/s, for (200*1024)/(60*60)) if we used xz -6 we'd expect 70 GB, with time of 40 hours if we used xz -0 we'd expect 81 GB, with time of 10.5 hours
however based on actually compressing the whole 2.4GB test, instead of just the 400MB, using xz -6 we'd expect .5 compression ratio, at a speed of 1.2 MiB?/s using xz -0 we'd expect .58 compression ratio, at a speed of 3.03 MiB?/s using gzip we'd expect .65 compression ratio, at a speed of 16.3 MiB?/s
meaning:
100 GB in 47 hrs 116 GB in 18.5 hrs 130 GB in 3.5 hrs