Quantcast

schema resolution rules issue

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

schema resolution rules issue

Torche Guillaume

Hi all,

I am trying to understand the schema resolution rules: https://avro.apache.org/docs/1.7.7/spec.html#Schema+Resolution.

Especially the following rule for union types:

if both are unions:

The first schema in the reader's union that matches the selected writer's union schema is recursively resolved against it. if none match, an error is signaled.



Let's say I have the following schema as my writer schema: 

{
  "schema": "{\"type\":\"record\",\"name\":\"RtbEvent\",\"namespace\":\"com.gumgum.avro.rtb\",\"doc\":\"Schema defining an RTB event.\",\"fields\":[{\"name\":\"eventMetadata\",\"type\":[\"null\",{\"type\":\"record\",\"name\":\"EventMetadata\",\"doc\":\"Event metadata.\",\"fields\":[{\"name\":\"metroCode\",\"type\":[\"null\",\"int\"],\"doc\":\"Visitor's Metro code location.\"}]}],\"doc\":\"Event metadata.\"}]}"
}


And my reader schema is:

{
  "schema": "{\"type\":\"record\",\"name\":\"RtbEvent\",\"namespace\":\"com.gumgum.avro.rtb\",\"doc\":\"Schema defining an RTB event.\",\"fields\":[{\"name\":\"eventMetadata\",\"type\":[\"null\",{\"type\":\"record\",\"name\":\"EventMetadata\",\"doc\":\"Event metadata.\",\"fields\":[{\"name\":\"metroCode\",\"type\":[\"null\",\"string\"],\"doc\":\"Visitor's Metro code location.\"}]}],\"doc\":\"Event metadata.\"}]}"
}

The only difference between these two schemas are on the metroCode field where the writer type is the following union: [\"null\",\"int\"] and the reader type is the following union: [\"null\",\"string\"].

As far as I understand the union rule, these two schemas match because null match with null. However when trying to deserialize an Avro event to a Java generated class built with the reader schema it will fail if metro code is a string. If that's the case why would these two schemas be considered to be matching ? 

Here is the link of one my post in the confluent platform user group:

We are trying to understand how this should be handled in terms of schema compatibility when using a schema registry. 

Thanks!


--
Guillaume Torche
Big Data Engineer - GumGum - Online advertising
Professional email: [hidden email]
Personal email: [hidden email]
310 254 8151
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: schema resolution rules issue

Doug Cutting-2
On Mon, Feb 13, 2017 at 10:21 PM, Torche Guillaume <[hidden email]> wrote:

> if both are unions:
>
> The first schema in the reader's union that matches the selected writer's
> union schema is recursively resolved against it. if none match, an error is
> signaled.
> [...]
> The only difference between these two schemas are on the metroCode field
> where the writer type is the following union: [\"null\",\"int\"] and the
> reader type is the following union: [\"null\",\"string\"].
>
> As far as I understand the union rule, these two schemas match because null
> match with null.

That is not correct.  Resolution is described here not as a static
analysis, but as a dynamic process while reading data.  The "selected"
schema refers to branch of the writer's schema that was actually
written.  So, if you'd written selecting a "null" branch of a union,
then this is resolved against any null branch in the reader's union,
successfully in your example.  If however you wrote selecting the
"int" branch, then resolution would fail, as there is no matching
"int" branch in the reader's union.

A static analysis of whether two schemas are compatible could detect
three cases:
  1. All data written by one can be read by the other.
  2. No data written by one can be read by the other.
  3. Some but not all data written by one can be read by the other.

Your example is an instance of (3).  Rejecting these when checking
static compatibility would be the safest strategy, grouping cases (2)
and (3) together as incompatible.

Doug
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: schema resolution rules issue

Torche Guillaume
Hi Doug,

Thanks for clarifying! It makes more sense to me why and how these rules have been designed now. As you mentioned they need to be adapted when checking for static schema compatibility. I will let the confluent people know about this and hopefully we can adapt the schema registry logic when checking for schema compatibility. 





On Wed, Feb 15, 2017 at 9:30 AM, Doug Cutting <[hidden email]> wrote:
On Mon, Feb 13, 2017 at 10:21 PM, Torche Guillaume <[hidden email]> wrote:
> if both are unions:
>
> The first schema in the reader's union that matches the selected writer's
> union schema is recursively resolved against it. if none match, an error is
> signaled.
> [...]
> The only difference between these two schemas are on the metroCode field
> where the writer type is the following union: [\"null\",\"int\"] and the
> reader type is the following union: [\"null\",\"string\"].
>
> As far as I understand the union rule, these two schemas match because null
> match with null.

That is not correct.  Resolution is described here not as a static
analysis, but as a dynamic process while reading data.  The "selected"
schema refers to branch of the writer's schema that was actually
written.  So, if you'd written selecting a "null" branch of a union,
then this is resolved against any null branch in the reader's union,
successfully in your example.  If however you wrote selecting the
"int" branch, then resolution would fail, as there is no matching
"int" branch in the reader's union.

A static analysis of whether two schemas are compatible could detect
three cases:
  1. All data written by one can be read by the other.
  2. No data written by one can be read by the other.
  3. Some but not all data written by one can be read by the other.

Your example is an instance of (3).  Rejecting these when checking
static compatibility would be the safest strategy, grouping cases (2)
and (3) together as incompatible.

Doug



--
Guillaume Torche
Big Data Engineer - GumGum - Online advertising
Professional email: [hidden email]
Personal email: [hidden email]
310 254 8151
Loading...