|
I must be doing something wrong: I am writing out avro files with three options:
a. no codec
b. deflate codec
c. snappy codec
I am measuring size of final avro file. In my observation, the snappy file is larger than the original avro file? duh?
code snippet:
File fs = new File("$DATA/log_snappy.avro");
DatumWriter
|
|
How big is the original data you are trying to compress? Sent from my iPhone
|
|
The original data file (a text file) is 40GB, the avro file is about 12GB, avro snappy is 13GB!
Thanks, Nikhil |
|
Hello All,
I think I figured our where I goofed up. I was flushing on every record, so basically this was compression per record, so it had a meta data with each record. This was adding more data to the output when compared to avro. So now I have better figures: atleast looks realistic, still need to find out of it is map-reduceable. Avro= 12G Avro+Defalte= 4.5G Avro+Snappy = 5.5G Have others tried Avro + LZO? Thanks, Nikhil On 3/30/12 12:54 AM, "Shirahatti, Nikhil" <[hidden email]> wrote: >The original data file (a text file) is 40GB, the avro file is about 12GB, >avro snappy is 13GB! > >Thanks, >Nikhil > >-- >View this message in context: >http://apache-avro.679487.n3.nabble.com/avro-compression-using-snappy-and- >deflate-tp3870167p3870184.html >Sent from the Avro - Users mailing list archive at Nabble.com. |
|
On Fri, Mar 30, 2012 at 12:08 PM, Shirahatti, Nikhil
<[hidden email]> wrote: > Hello All, > > I think I figured our where I goofed up. > > I was flushing on every record, so basically this was compression per > record, so it had a meta data with each record. This was adding more data > to the output when compared to avro. > > So now I have better figures: atleast looks realistic, still need to find > out of it is map-reduceable. > Avro= 12G > Avro+Defalte= 4.5G > Avro+Snappy = 5.5G > > Have others tried Avro + LZO? Have you checked out jvm-compressor-benchmark page? (https://github.com/ning/jvm-compressor-benchmark) It has comparison of quite a few native open source compression codecs. While test data does not include Avro, I would not expect results to differ all that much. LZO isn't a particularly compelling codec in any of combinations tested. Snappy, LZF and LZ4 (not yet included in public results, but there's code, and preliminary results are very good) are the fastest Java codecs. Gzip (deflate) produces more compact results, and is fastest of "high compression" codecs (although significantly lower than lzf/snappy/lz4) -+ Tatu +- ps. If anyone has publically available set of Avro data, it would be quite easy to add Avro-data test to jvm compressor benchmark |
|
In reply to this post by snikhil0
On 3/30/12 12:08 PM, "Shirahatti, Nikhil" <[hidden email]> wrote: >Hello All, > >I think I figured our where I goofed up. > >I was flushing on every record, so basically this was compression per >record, so it had a meta data with each record. This was adding more data >to the output when compared to avro. > >So now I have better figures: atleast looks realistic, still need to find >out of it is map-reduceable. >Avro= 12G >Avro+Defalte= 4.5G Deflate is affected quite a bit by the compression level selected (1 to 9) in both performance and level of compression. However, in my experience anything past level 6 is only very slightly smaller and much slower, while the difference between levels 1 to 3 is large on both fronts. >Avro+Snappy = 5.5G > >Have others tried Avro + LZO? I have not heard of anyone doing this. LZO is not Apache license compatible, and there are now several alternatives that are in the same class of compression algorithm available, including Snappy. > >Thanks, >Nikhil > > >On 3/30/12 12:54 AM, "Shirahatti, Nikhil" <[hidden email]> wrote: > >>The original data file (a text file) is 40GB, the avro file is about >>12GB, >>avro snappy is 13GB! >> >>Thanks, >>Nikhil >> >>-- >>View this message in context: >>http://apache-avro.679487.n3.nabble.com/avro-compression-using-snappy-and >>- >>deflate-tp3870167p3870184.html >>Sent from the Avro - Users mailing list archive at Nabble.com. > |
| Powered by Nabble | Edit this page |
