Why Schema Registry?

21. 11. 2022

Overview

A six-part blog series about the Apicurio schema registry! Click and read why Schema Registry!

We started with a new project in which we used the Apicurio schema registry alongside Kafka. Still, the choice which schema registry is going be used wasn’t made before considering multiple existing schema registries. This will be the series consisting of 5 more blogs on the Apicurio schema registry, which will be published every two weeks.

What is Schema Registry?

Kafka receives and sends out data in byte format, and in doing so it is oblivious to what kind of data is being transferred in and out of the Kafka cluster. Because of how Kafka works, producers and consumers do not communicate with each other directly, instead they transfer data via Kafka topics. That is why producer must know how to serialize data, and consumer how to deserialize it. And that is where Schema Registry comes into play.

Schema Registry is an independent process that runs separately from the Kafka brokers. It is used for storing and retrieving Avro, JSON, Protobuf schemas, etc. Its duty is to maintain a collection of schemas which are used for data verification when it comes to data serialization or deserialization.

A quick rundown of how it works is as follows. The producer, before sending the data to Kafka brokers, communicates with schema registry first which checks if the given schema exists in its schema collection and it checks if it is valid (matching data types, null constraints and so on). If the given schema does not exist in schema registry, it can be automatically registered, depending on the producer configuration. If schema that was given by the producer is not valid, then the produce will fail in a way that the producer code can detect. Similarly, the consumer communicates with the schema registry when trying to deserialize received message. The consumer looks for message’s schema in the schema registry, and if there is any problem with it, then the consumer will take appropriate actions because deserialization has failed.

Schema evolution with time is inevitable. New fields will be added and existing updated or deleted. Good thing is that schema registry has support for schema versioning. That does not fully solve the problem of schema evolution – which is a challenge in any system, but it does make it a less of a problem by preventing execution breakdowns as much as possible.

Different Schema Registries Comparison

Currently there are a few options when choosing which schema registry to use. Some of them are Confluent, Apicurio, Azure, GCP and Janitor schema registry. But two of the more mature schema registries, with the most features, that we are going to dive deeper into are Confluent’s and Apicurio’s.

They both come with essentially free to use licenses and they are both based on open-source code. Exception being that Confluent’s schema registry comes with a clause that making available any software, that uses some of their products, that competes with Confluent products is forbidden. Link to the Confluent community license is here.

Both platforms support schema registration, evolution, and validation. Furthermore, schema manipulation can be done through both command line and GUI on both platforms, though Confluent’s GUI is available only in paid version through web-based tool Confluent Control Center.

Maturity wise, Confluent’s schema registry is de facto standard, they have invented this concept. On the other hand, Apicurio’s schema registry is less mature, but it is experimenting with some new ideas, and generally it is also production ready.

Confluent’s schema registry is built in their platform and its feature set is based on needs of their platform. Apicurio’s schema registry maintains Confluent compatible API – which has some bugs, and some Confluent features are not present, but they are also trying to evolve the component with support for bigger number of artifact types.

Considering tech stack, both platforms are Java based. Confluent’s stack is based on Confluent’s common library set, while Apicurio’s is Quarkus based and that suggests more significant usage of known frameworks.

Both platforms support Avro, JSON and Protobuf schema formats, while Apicurio supports much more, such as OpenAPI, AsyncAPI, GraphQL, WSDL, XML and Kafka Connect schemas.

Apicurio Registry supports 3 persistence implementations: In-Memory, Kafka and database storage, while Confluent Registry supports Kafka.

Confluent’s security is based on RBAC and is integrated in Confluent cloud and on-prem Confluent platform which are not free to use. On the other hand, Apicurio’s security is based on Keycloak and is OIDC compliant and does not require any purchase.

Considering CI/CD integration, Maven plugin is used in both.

Both platforms have documentations. Confluent’s documentation is better quality than Apicurio’s, but while reading it, you should be careful to figure out what is free to use and what is not.

Apicurio Challenges, Pros and Cons

Apicurio Registry is still relatively young technology and brings some challenges in its usage. It still has some bugs which are being resolved with time. There are also problems with using references in schemas, but more about it will be said in future blogs.

Apicurio’s big advantage is being completely free open-source platform. Vast variety of supported data types cannot be ignored. If we need security on our schema registry, then Apicurio Registry is great choice because we can set up security for free. Apicurio Registry also provides (not-fully though) compatible API with Confluent platform so that could be a big plus when deciding which schema registry to use. Another plus is that direct communication with developers is possible, and they are willing to help with problems about their platform.

Previously mentioned not-fully compatible API with Confluent can be frustrating at times when using for instance Confluent’s SerDe library. Disadvantage could be documentation with some false information in it.

Conclusion – why was Apicurio Registry our Choice?

Using Apicurio has brought some challenges to us, but we managed to overcome them. We had a lot to learn and do, such as adjusting Avro schemas in particular way to make things work properly, finding the proper way to register schemas, adapting SerDe libraries, etc.

Our project required fully open-source platform, so we chose Apicurio as our schema registry. Additionally, Apicurio was our choice because it provides the greatest flexibility, and its compatibility with Confluent Serdes adds to that.

If you are interested in this topic, then don’t miss out on future releases where we will discuss ways of transporting schema ids, comparison between SerDe libraries, problems which we have encountered throughout using Apicurio registry, testing, and Apicurio operator and schema managing.

This was first out of 6 blogs on the topic of Apicurio schema registry. In the next one we will compare ways of transporting schema’s global id in message headers and in message payload using magic byte. Stay tuned.

Schema registry blog series (1 of 6):

Part 1

Part 2

Part 3

Part 4

Part 5