Quantcast

Multiple input schemas in MapReduce?

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Multiple input schemas in MapReduce?

Markus Weimer-4
Hi,

I'd like to write a mapreduce job that uses avro throughout, but the map phase would need to read files with two different schemas, similar to what the MultipleInputFormat does in stock hadoop. Is this a supported use case?

A work-around would be to create a union schema that has both fields as optional and to convert all data into it, but that seems clumsy.

Has anyone done this before?

Thanks for any suggestion you can give,

Markus

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Multiple input schemas in MapReduce?

Matt Pouttu-Clarke
Hi Markus,

You could use Cascading.  The Cascading.Avro extension automatically
transforms the Avro data into a TupleEntry (a generic object similar to
java.util.Map).  Then you can combine and process data as however you wish
downstream.

Please check this entry for more info:
http://mpouttuclarke.wordpress.com/2011/01/13/cascading-avro/

Cheers,
Matt

On 5/11/11 11:44 AM, "Markus Weimer" <[hidden email]> wrote:

> Hi,
>
> I'd like to write a mapreduce job that uses avro throughout, but the map phase
> would need to read files with two different schemas, similar to what the
> MultipleInputFormat does in stock hadoop. Is this a supported use case?
>
> A work-around would be to create a union schema that has both fields as
> optional and to convert all data into it, but that seems clumsy.
>
> Has anyone done this before?
>
> Thanks for any suggestion you can give,
>
> Markus
>


iCrossing Privileged and Confidential Information
This email message is for the sole use of the intended recipient(s) and may contain confidential and privileged information of iCrossing. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Multiple input schemas in MapReduce?

Jacob R Rideout
In reply to this post by Markus Weimer-4
We do take the union schema approach, but create the unions
programmaticly in java:

Something like:

ArrayList<Schema> schemas = new ArrayList<Schema>();
schemas.add(schema1);
schemas.add(schema2);
Schema unionSchema = Schema.createUnion(schemas);
AvroJob.setInputSchema(job, unionSchema);


On Wed, May 11, 2011 at 12:44 PM, Markus Weimer <[hidden email]> wrote:

> Hi,
>
> I'd like to write a mapreduce job that uses avro throughout, but the map phase would need to read files with two different schemas, similar to what the MultipleInputFormat does in stock hadoop. Is this a supported use case?
>
> A work-around would be to create a union schema that has both fields as optional and to convert all data into it, but that seems clumsy.
>
> Has anyone done this before?
>
> Thanks for any suggestion you can give,
>
> Markus
>
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Multiple input schemas in MapReduce?

Markus Weimer-4
Hi,

this sounds interesting! What datatype would my input in the mapper have? Or: How would I distinguish between the different inputs in the mapper?

Thanks,

Markus

On May 11, 2011, at 3:00 PM, Jacob R Rideout wrote:

> We do take the union schema approach, but create the unions
> programmaticly in java:
>
> Something like:
>
> ArrayList<Schema> schemas = new ArrayList<Schema>();
> schemas.add(schema1);
> schemas.add(schema2);
> Schema unionSchema = Schema.createUnion(schemas);
> AvroJob.setInputSchema(job, unionSchema);
>
>
> On Wed, May 11, 2011 at 12:44 PM, Markus Weimer <[hidden email]> wrote:
>> Hi,
>>
>> I'd like to write a mapreduce job that uses avro throughout, but the map phase would need to read files with two different schemas, similar to what the MultipleInputFormat does in stock hadoop. Is this a supported use case?
>>
>> A work-around would be to create a union schema that has both fields as optional and to convert all data into it, but that seems clumsy.
>>
>> Has anyone done this before?
>>
>> Thanks for any suggestion you can give,
>>
>> Markus
>>
>>

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Multiple input schemas in MapReduce?

Scott Carey
In reply to this post by Markus Weimer-4
Are the multiple schemas a series of schema evolutions?

That is, is there an obvious 'reader' schema, or are they disjoint?  If
this represents schema evolution, it should be possible (but may be a
current bug or limitation) to set the reader schema to the most recent
schema and resolve all files as that schema.

I currently run M/R jobs (but not using Avro's mapreduce package -- its a
custom Pig reader) over sets of Avro data files that contain a schema that
has evolved over time -- at least two dozen variants.  The reader uses the
most recent version, and we have been careful to make sure that our schema
has evolved over time in a way that maintains compatibility.

On 5/11/11 11:44 AM, "Markus Weimer" <[hidden email]> wrote:

>Hi,
>
>I'd like to write a mapreduce job that uses avro throughout, but the map
>phase would need to read files with two different schemas, similar to
>what the MultipleInputFormat does in stock hadoop. Is this a supported
>use case?
>
>A work-around would be to create a union schema that has both fields as
>optional and to convert all data into it, but that seems clumsy.
>
>Has anyone done this before?
>
>Thanks for any suggestion you can give,
>
>Markus
>

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Multiple input schemas in MapReduce?

Markus Weimer-4
Hi,

> Are the multiple schemas a series of schema evolutions?

No, they are multiple distinct schemas. The job I am writing essentially joins data of two different types to form a third type.

Thanks,

Markus
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Multiple input schemas in MapReduce?

Markus Weimer-4
In reply to this post by Jacob R Rideout
Hi,

just an update: The solution below does, indeed, work as expected. Thanks!

Markus

On May 11, 2011, at 3:00 PM, Jacob R Rideout wrote:

> We do take the union schema approach, but create the unions
> programmaticly in java:
>
> Something like:
>
> ArrayList<Schema> schemas = new ArrayList<Schema>();
> schemas.add(schema1);
> schemas.add(schema2);
> Schema unionSchema = Schema.createUnion(schemas);
> AvroJob.setInputSchema(job, unionSchema);
>
>
> On Wed, May 11, 2011 at 12:44 PM, Markus Weimer <[hidden email]> wrote:
>> Hi,
>>
>> I'd like to write a mapreduce job that uses avro throughout, but the map phase would need to read files with two different schemas, similar to what the MultipleInputFormat does in stock hadoop. Is this a supported use case?
>>
>> A work-around would be to create a union schema that has both fields as optional and to convert all data into it, but that seems clumsy.
>>
>> Has anyone done this before?
>>
>> Thanks for any suggestion you can give,
>>
>> Markus
>>
>>

Loading...