Headers VS Magic Byte
3 minute read
Schemas from Apicurio Schema Registry can be retrieved in multiple ways. Individual schemas can be retrieved either by using global ID or content ID. Global ID is unique ID of an artifact version while content ID is the unique ID of the artifact content. Content ID can be shared between multiple artifact versions if they share the exact same content, their global ID though will not be identical.
By default, schemas from Apicurio Registry are retrieved using a global ID, which is part of the message. Depending on the producer configuration, global ID can be located in the message headers or in the payload.
Our Decision Between Using Headers vs Magic Byte
Originally Kafka messages consisted of key and value, and it wasn’t until Kafka 0.11 that the concept of record headers were introduced. Without headers the only place where schema ID could be stored was in the message payload. Message headers are meant for storing message metadata and they seem to be the logical choice for storing schema ID in them, and we tried to use them for it in our project, but it didn’t quite work out. Although they seem like a better fit, they are not universally supported throughout various open-source components, for example one of our reasons to ditch headers was that Kafka Connect didn’t have full support for them, it could read from them, but it couldn’t set up headers. Because event processing requires that schema id is transported uniformly throughout every step of processing, we couldn’t use headers later on as Kafka Connect didn’t fully support them.
It’s also worth mentioning CloudEvents specification. Event producers tend to describe events differently and the lack of a common way of describing events means developers must constantly re-learn how to consume events and it also limits the potential for libraries, tooling, portability. CloudEvents is a specification for describing event metadata in common formats to provide interoperability across services, platforms, and systems.
The whole Kafka ecosystem still relies on using magic byte in payload, but in the future, we believe headers will be more prevalent. Because of the current state and our current needs, we deemed magic byte as the better option for our projects.
When global ID is configured to be located in message payload, then the format of the data begins with a magic byte, followed by the global ID, and then message data. Consumers are checking for the magic byte at the start of the message payload to determine where the global ID is located. If magic byte is found at the start of the message, then global ID is read from the message payload, else global ID is read from message headers.
Pros and Cons
Using magic byte is a nice and compact way of transporting global ID in message payload, but it comes with a price. Other components that use Avro, and Serde classes for it cannot use message payload directly if they are not aware of the existence of the magic byte at the beginning of the message. If Serde classes are not aware of the magic byte, then to them message payload seems corrupted. On the other hand, using headers does not have this problem.
Disadvantage with using magic byte for detecting the location of global id in it is that Apicurio uses 8 bytes to store the identifier in the Kafka message body, while Confluent SerDes use 4 bytes so that might cause compatibility issues if for instance one service uses Apicurio and other one uses Confluent SerDes.
The big advantage of headers is that clients who are not interested in header content can easily just ignore it and take whatever they need from message.
Global ID in headers causes message to be somewhat bigger than message with global id in its payload. Using headers also isn’t universally supported so it might cause some problems.
In this blog we covered the topic of transporting schema ID through both headers and payload in combination with magic byte, and why we opted for magic byte. In the next blog we will dive more into Serde libraries through their comparison across multiple platforms.
Schema registry blog series (2 of 6):
Nathen shared his toughts on recent hot takes that ”DevOps is dead”!