Quantcast

Avro + Snappy changing blocksize of snappy compression

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Avro + Snappy changing blocksize of snappy compression

snikhil0
I am experimenting with Avro and snappy and want to plot the size of the compressed avro datafile as a function of varying compression block size. I am doing this by setting the configuration value for "io.compression.codec.snappy.buffersize". Unfortunately, this is not working: or more precisely for buffer sizes 256K to 2MB I get the same size output avro (snappyfied) data file. What am I missing? Someone had success with this?

Thanks,
Nikhil
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Avro + Snappy changing blocksize of snappy compression

Harsh J-2
Hey Nikhil,

When using Avro Datafiles, you perhaps need to tweak its sync-interval
to affect compression chunk sizes:
http://avro.apache.org/docs/1.6.3/api/java/org/apache/avro/file/DataFileWriter.html#setSyncInterval(int)

On Wed, Apr 18, 2012 at 10:53 PM, snikhil0 <[hidden email]> wrote:

> I am experimenting with Avro and snappy and want to plot the size of the
> compressed avro datafile as a function of varying compression block size. I
> am doing this by setting the configuration value for
> "io.compression.codec.snappy.buffersize". Unfortunately, this is not
> working: or more precisely for buffer sizes 256K to 2MB I get the same size
> output avro (snappyfied) data file. What am I missing? Someone had success
> with this?
>
> Thanks,
> Nikhil
>
> --
> View this message in context: http://apache-avro.679487.n3.nabble.com/Avro-Snappy-changing-blocksize-of-snappy-compression-tp3920732p3920732.html
> Sent from the Avro - Users mailing list archive at Nabble.com.



--
Harsh J
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Avro + Snappy changing blocksize of snappy compression

Tatu Saloranta
In reply to this post by snikhil0
On Wed, Apr 18, 2012 at 10:23 AM, snikhil0 <[hidden email]> wrote:
> I am experimenting with Avro and snappy and want to plot the size of the
> compressed avro datafile as a function of varying compression block size. I
> am doing this by setting the configuration value for
> "io.compression.codec.snappy.buffersize". Unfortunately, this is not
> working: or more precisely for buffer sizes 256K to 2MB I get the same size
> output avro (snappyfied) data file. What am I missing? Someone had success
> with this?

Snappy uses blocks of 64k (like most LZ compressors), so there should
be little benefit from block sizes larger than this; blocks are
compressed independent from each other (back references are up to 8k
or such anyway). There are some compressors that can use larger
buffers, like bzip2 (I think). But those are more exceptions than
rule.

-+ Tatu +-
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Avro + Snappy changing blocksize of snappy compression

snikhil0
In reply to this post by Harsh J-2
I had tried the sync Interval as well and I get the same results: meaning no change in final avro data file.

Nikhil
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Avro + Snappy changing blocksize of snappy compression

Scott Carey-2
Try a range from smaller block sizes (4k) and up.  256K is a larger block
size than many compression codecs are sensitive to.

Also for reference, try it with the deflate codec at a couple different
compression levels -- 1, 3, 5, and 7 should show a trend with respect to
block size.  As the compression level increases, the compressor can take
advantage of larger blocks.

In the deflate/gzip case that I have explored heavily, the effectiveness
of the block size also varies significantly depending on the
characteristics of the data being compressed.


(note: gzip uses deflate compression)

On 4/18/12 1:33 PM, "snikhil0" <[hidden email]> wrote:

>I had tried the sync Interval as well and I get the same results: meaning
>no
>change in final avro data file.
>
>Nikhil
>
>--
>View this message in context:
>http://apache-avro.679487.n3.nabble.com/Avro-Snappy-changing-blocksize-of-
>snappy-compression-tp3920732p3921256.html
>Sent from the Avro - Users mailing list archive at Nabble.com.


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Avro + Snappy changing blocksize of snappy compression

Tatu Saloranta
On Wed, Apr 18, 2012 at 2:18 PM, Scott Carey <[hidden email]> wrote:
> Try a range from smaller block sizes (4k) and up.  256K is a larger block
> size than many compression codecs are sensitive to.

Agreed: most codecs only go up to 32k or 64k (in fact, Snappy may use
just 32k, not 64k).
Deflate doesn't benefit from above 64k either, nor does lzf.
The only codecs that I think use larger buffers are bzip and lzma;
both of which are typically way too slow to be used for streaming data
processing anyway.

So testing up to 64k is usually enough.

-+ Tatu +-
Loading...