Why Apache Kafka
I was working on a project recently where we needed to process raw data from hundreds of devices every few seconds and save the result in database. We decided to stream data and process them in a service which could be scaled horizontally.
We had an on-premises server for our non-production environment and an unclear specification for production. As I had previous experience with AWS Kinesis, we chose Apache Kafka which is a similar streaming platform but can be deployed on-premises. We developed the producer and consumer libraries in Microsoft .NET on Windows as they were a natural choice given our team’s skillset, but this took time to get up and running as Kafka was originally developed in the Java ecosystem, so useful samples and libraries were difficult to come by.
I tried to put the basics together in this blog post as a quick guide for developers who already know what streaming is and want to use it in a similar environment. I also chose to use Confluent as I like its visual Control Center. But you can use the core Apache Kafka too. Confluent is a platform founded by the original Kafka developers including Apache Kafka with a few more features and improvements.
- If you need to learn Kafka
basics first, this link could be useful: https://www.confluent.io/what-is-apache-kafka/ - To compare Confluent Kafka
and Apache Kafka including the licensing information, read this: https://www.linkedin.com/pulse/comparing-confluent-kafka-apache-maria-hatfield-phd/
Creating a test topic
The next thing you will need is to create a topic to which you can publish test messages. Confluent Platform has a Control Centre page like this:
Go to Topics list and create a new topic named “testTopic” with more than one partition.
Partitions increase parallelism of Kafka topic:
- each partition is consumed by exactly one consumer in the group.
- one consumer in the group can consume more than one partition.
- the number of consumer processes in a group must be <= number of partitions
Note that I faced a problem when trying to use the default option - the “Create with defaults” button was disabled. I resolved this by clicking on “Customize settings” and then clicking “Back” without changing anything.
Topics use ksqlDB, a SQL database in the backend to save streamed data. Data is kept in database until it is expired, not when it is read by a consumer as consumers from different groups can process the same message independently of each other.
Installation
There are two options to install Apache Kafka using Confluent Platform :
- Local: https://docs.confluent.io/platform/current/quickstart/ce-quickstart.html
- Docker: https://docs.confluent.io/platform/current/quickstart/ce-docker-quickstart.html
As we were developing in a Windows environment, our decision was to use the Docker installation option with Linux containers for better support.
It only took a few steps:
- Install Docker Desktop
- Change Docker memory to a minimum of 6GB from Preferences > Resources > Advanced. It is only 2GB by default.
- Download the Confluent Platform all-in-one Docker Compose file:
curl --silent --output docker-compose.yml https://raw.githubusercontent.com/confluentinc/cp-all-in-one/6.1.1-post/cp-all-in-one/docker-compose.yml
- Start the Confluent containers:
docker-compose up -d
- Verify the services are up and running:
docker-compose ps
NOTE: If the "State" is not "Up" repeat the previous command again. It takes a few minutes for all the services to start and get ready to use.
- Open Docker Desktop > Containers List, and open the control-center in browser which is usually localhost:9021
Developing using .NET Core
We found that the Confluent Kafka library in GitHub has the best examples of using .NET with Apache Kafka. Start by cloning and browsing the “examples” folder as it has two very basic sample projects for creating a Producer and Consumer:
https://github.com/confluentinc/confluent-kafka-dotnet
NuGet package
Confluent Kafka client libraries can be installed in Visual Studio using NuGet either through the Package Manager UI by searching for “Confluent.Kafka”, or by running the following command in the Package Manager Console:
Install-Package Confluent.Kafka
Creating a producer
The code for creating a very basic producer can be seen below:
Note that the ProducerConfig only requires the BootstrapServers property which takes the URL value of the broker container which was started by Docker Compose:
You can run multiple instances of the producer to see how they all post messages to the same topic.
Regarding how partitions are selected when a message is produced:
- If a valid partition number is specified, that partition will be used when sending the record.
- If no partition is specified but a key is present, a partition will be chosen using a hash of the key
- If neither key nor partition is present a partition will be assigned in a round-robin fashion.
Creating a consumer
Consumers read and process data from a topic from the position of the last offset that their group previously read from.
This means that a topic can have many independent consumers that read data based on their offset index without knowing about other consumers.
The code for creating a very basic consumer can be seen below:
One of the important configuration items of a consumer is GroupId. If two consumers have the same GroupId, they work together in parallel to process messages of a topic. A message is processed only once by consumers of the same topic. This is used to scale the consumers but if two consumers have different GroupIds, then they process messages independently. i.e. a message is processed by both.
More Details
Schema-registry
Kafka streams decouple producers and consumers, so a producer does not know who is going to consume the message it posted to a topic, however both producer and consumer usually share the data schema. For example, if the message is data of type “Order”,
then the producer knows how to serialize the “Order” class to JSON, and the consumer reads the message and deserializes it to an “Order” type again.
The Schema Registry is included in Confluent Platform to avoid this dependency . It is built around the Apache
Avro™ serialization format. It is the recommended format for use with Confluent Platform and it is very flexible.
This is an example of working with Avro:
var record = new GenericRecord(logMessageSchema);
record.Add("IP", "127.0.0.1");
record.Add("Message", "a test log message");
record.Add("Severity", new GenericEnum(logLevelSchema, "Error"));
Read more here: https://www.confluent.io/blog/decoupling-systems-with-apache-kafka-schema-registry-and-avro/
Summary
If you are a .NET developer and want to stream data on Windows, you can simply setup Confluent Platform on Docker and use the Confluent Kafka library to develop your producers and consumers.
References
- https://docs.confluent.io/platform/current/quickstart/ce-docker-quickstart.html
- https://github.com/confluentinc/confluent-kafka-dotnet
- https://www.confluent.io/blog/decoupling-systems-with-apache-kafka-schema-registry-and-avro/
- https://www.confluent.io/what-is-apache-kafka/
- https://www.linkedin.com/pulse/comparing-confluent-kafka-apache-maria-hatfield-phd/