Quantcast

Compressed Avro vs. compressed Sequence - unexpected results?

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Compressed Avro vs. compressed Sequence - unexpected results?

nir_zamir
This post was updated on .
Hi,

We're examining the storage of our data in Snappy-compressed files. Since we want the data's structure to be self contained, we checked it with Avro against Sequence (both are splittable, which should best utilize our cluster).

We tested the performance on a 12GB data (CSV) file, and a 4-nodes cluster on production environment (very strong machines).

Compression

What we did here (for test simplicity) is create two Hive tables: Avro-based and Sequence-based. Then we enabled Snappy compression and INSERTed the data from the RAW table (consisting of the 12GB file).

In terms of compression rate, Avro was better: 72% vs. 57%.
In both cases there were 45 mappers, and CPU/Mem were very far from their limit on all machines.
Since there was no reduce operator, this created 45 files.

Compression time for Avro took longer: 1.75 minutes vs. 1.2 minutes for sequence files.

Decompression

What we did here was this Hive query:
SELECT COUNT(1) FROM table-name;

Here was the real difference: it took Avro about 75% longer to perform this (3 minutes vs. 0.5 minute).
This was very surprising since for our strong machines the I/O would be expected to be the bottleneck, and since Avro files are smaller,we expected them to be faster to decompress.
The number of mappers in both cases was similar (14 vs. 17) and again, CPU/Mem didn't seem to be exausted.
Since our most critical time is reading, this issue makes it hard for us to be using Avro.

Since Avro files were smaller (~65 MB) than the block size (128MB) we thought we should try to enforce a reduce (using ORDER BY) so that we'll deal with one compressed file instead. This didn't change much...

Maybe we're doing something wrong - your input would be much appreciated!

Thanks,
Nir
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Compressed Avro vs. compressed Sequence - unexpected results?

Scott Carey-2
For your avro files, double check that snappy is used (use avro-tools to
peek at the metadata in the file, or simply view the head in a text
editor, the compression codec used will be in the header).

Snappy is very fast, most likely the time to read is dominated by
deserialization.  Avro will be slower than a trivial deserializer (but
more compact), but being many times slower is not expected.  I am not
entirely sure how Hive's Avro serDe works -- it is possible there is a
performance issue there.  If you were able to get a handful of stack
traces (kill -3 or jstack) from the mapper tasks (or a profiler output),
it would be very insightful.


On 5/23/13 12:42 AM, "nir_zamir" <[hidden email]> wrote:

>Hi,
>
>We're examining the storage of our data in Snappy-compressed files. Since
>we
>want the data's structure to be self contained, we checked it with Avro
>and
>with Sequence (both are splittable, which should best utilize our
>cluster).
>
>We tested the performance on a 12GB data (CSV) file, and a 4-nodes cluster
>on production environment (very strong machines).
>
>Compression
>
>What we did here (for test simplicity) is create two Hive tables:
>Avro-based
>and Sequence-based. Then we enabled Snappy compression and INSERTed the
>data
>from the RAW table (consisting of the 12GB file).
>
>In terms of compression rate, Avro was better: 72% vs. 57%.
>In both cases there were 45 mappers, and CPU/Mem were very far from their
>limit on all machines.
>Since there was no reduce operator, this created 45 files.
>
>Compression time for Avro took longer: 1.75 minutes vs. 1.2 minutes for
>sequence files.
>
>Decompression
>
>What we did here was this Hive query:
>SELECT COUNT(1) FROM table-name;
>
>Here was the real difference: it took Avro about *75% longer* to perform
>this (3 minutes vs. 0.5 minute).
>This was very surprising since for our strong machines the I/O would be
>expected to be the bottleneck, and since Avro files are smaller,we
>expected
>them to be faster to decompress.
>The number of mappers in both cases was similar (14 vs. 17) and again,
>CPU/Mem didn't seem to be exausted.
>Since our most critical time is reading, this issue makes it hard for us
>to
>be using Avro.
>
>Maybe we're doing something wrong - your input would be much appreciated!
>
>Thanks,
>Nir
>
>
>
>--
>View this message in context:
>http://apache-avro.679487.n3.nabble.com/Compressed-Avro-vs-compressed-Sequ
>ence-unexpected-results-tp4027467.html
>Sent from the Avro - Users mailing list archive at Nabble.com.


Loading...