Quantcast

map/reduce of compressed Avro

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

map/reduce of compressed Avro

nir_zamir
Hi,

Does anyone know if/how it's possible to get compressed Avro files as an input to a M/R job?
If so, which codecs are supported?

Thanks,
Nir
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: map/reduce of compressed Avro

Martin Kleppmann
You don't need to do anything special to accept compressed Avro files as input, as it's detected automatically and decompressed transparently. M/R jobs support all codecs that the Java implementation supports; at the moment I think that's deflate, snappy and bzip2.

If you want to generate compressed output, use FileOutputFormat.setCompressOutput(job, true);

Martin


On 22 April 2013 02:26, nir_zamir <[hidden email]> wrote:
Hi,

Does anyone know if/how it's possible to get compressed Avro files as an
input to a M/R job?
If so, which codecs are supported?

Thanks,
Nir



--
View this message in context: http://apache-avro.679487.n3.nabble.com/map-reduce-of-compressed-Avro-tp4026947.html
Sent from the Avro - Users mailing list archive at Nabble.com.

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: map/reduce of compressed Avro

nir_zamir
Thanks Martin.

What will happen if I try to use an indexed LZO-compressed avro file? Will it work and utilize the index to allow multiple mappers?

I think that for Snappy for example, the file is splittable and can use multiple mappers, but I haven't tested it yet - would be glad if anyone has any experience with that.

Thanks!
Nir.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: map/reduce of compressed Avro

Martin Kleppmann
To my knowledge, LZO is not a supported codec for Avro data files. It's possible that you have a LZO-compressed Hadoop sequence file containing Avro records, but that would be a format you defined yourself, and not the same as an Avro data file.

Avro data files are designed to be splittable regardless of the codec they use, so you can have multiple mappers that each consume a portion of the input file. The format achieves that by breaking the data into blocks, and compressing each block separately; hence it can be split at block boundaries.

Best,
Martin


On 22 April 2013 23:47, nir_zamir <[hidden email]> wrote:
Thanks Martin.

What will happen if I try to use an indexed LZO-compressed avro file? Will
it work and utilize the index to allow multiple mappers?

I think that for Snappy for example, the file is splittable and can use
multiple mappers, but I haven't tested it yet - would be glad if anyone has
any experience with that.

Thanks!
Nir.



--
View this message in context: http://apache-avro.679487.n3.nabble.com/map-reduce-of-compressed-Avro-tp4026947p4027009.html
Sent from the Avro - Users mailing list archive at Nabble.com.

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: map/reduce of compressed Avro

Scott Carey-2
In reply to this post by nir_zamir
Martin said it already, but I will emphasize:

Avro data files are splittable and can support multiple mappers no matter
what codec is used for compression.  This is because avro files are block
based, and only use the compression within the block.  I recommend
starting with gzip compression, and moving to snappy only if deflate
compression level '1' is not fast enough.

For more information on avro data files, see:
http://avro.apache.org/docs/current/spec.html#Object+Container+Files



On 4/22/13 11:47 PM, "nir_zamir" <[hidden email]> wrote:

>Thanks Martin.
>
>What will happen if I try to use an indexed LZO-compressed avro file? Will
>it work and utilize the index to allow multiple mappers?
>
>I think that for Snappy for example, the file is splittable and can use
>multiple mappers, but I haven't tested it yet - would be glad if anyone
>has
>any experience with that.
>
>Thanks!
>Nir.
>
>
>
>--
>View this message in context:
>http://apache-avro.679487.n3.nabble.com/map-reduce-of-compressed-Avro-tp40
>26947p4027009.html
>Sent from the Avro - Users mailing list archive at Nabble.com.


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: map/reduce of compressed Avro

Enns, Steven
Out of curiosity, are there any other file formats that provide splittable
gzip compression like Avro object containers?  I can only think of
Sequence Files.

On 4/29/13 3:47 PM, "Scott Carey" <[hidden email]> wrote:

>Martin said it already, but I will emphasize:
>
>Avro data files are splittable and can support multiple mappers no matter
>what codec is used for compression.  This is because avro files are block
>based, and only use the compression within the block.  I recommend
>starting with gzip compression, and moving to snappy only if deflate
>compression level '1' is not fast enough.
>
>For more information on avro data files, see:
>http://avro.apache.org/docs/current/spec.html#Object+Container+Files
>
>
>
>On 4/22/13 11:47 PM, "nir_zamir" <[hidden email]> wrote:
>
>>Thanks Martin.
>>
>>What will happen if I try to use an indexed LZO-compressed avro file?
>>Will
>>it work and utilize the index to allow multiple mappers?
>>
>>I think that for Snappy for example, the file is splittable and can use
>>multiple mappers, but I haven't tested it yet - would be glad if anyone
>>has
>>any experience with that.
>>
>>Thanks!
>>Nir.
>>
>>
>>
>>--
>>View this message in context:
>>http://apache-avro.679487.n3.nabble.com/map-reduce-of-compressed-Avro-tp4
>>0
>>26947p4027009.html
>>Sent from the Avro - Users mailing list archive at Nabble.com.
>
>

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: map/reduce of compressed Avro

nir_zamir
Thanks.

If the compression codec doesn't matter, what does it mean that Avro added support for Snappy codec?
If I need the files to be used as input for a M/R, I guess the avro module should be able to decompress each block and extract the objects. Does it make sense?

So are you saying that in this case I can use a non-splittable codec (like deflate)?

Thanks
Nir
Loading...