Demystifying Apache Kafka: A Comprehensive Guide
What is Apache Kafka?
Apache Kafka is a distributed streaming platform that is built on the principles of a messaging system. Apache Kafka’s implementation started as a messaging system to create a robust data pipeline. However, over time, Kafka has evolved into a full-fledged streaming platform that offers all the core capabilities to implement stream processing applications over real-time data pipelines. The latest version of Apache Kafka comes in three flavors.
- Kafka Client APIs
- Kafka Connect
- Kafka Streams
In this article, I will briefly introduce you to all these three flavors of Apache Kafka.
Apache Kafka Client APIs
Like any messaging system, Apache Kafka has three main components.
These three components are at the core of Apache Kafka. Everything else is built over and above that core infrastructure. The three Kafka core components provide the capability of creating a highly scalable messaging infrastructure and real-time streaming data pipelines. Like any other messaging system, Apache Kafka works in asynchronous mode. The following diagram explains the core functionality of Apache Kafka.
Kafka broker is the core infrastructure of Kafka. It is a cluster of computers that are running Kafka broker services. In a typical case, one machine runs one instance of Apache Kafka broker. Kafka broker can store Kafka messages in local disk-based storage. The Kafka broker also comes with the replication capability; hence, a message received by one broker service is copied and replicated to some other brokers. The replication provides a fault tolerance capability to the Apache Kafka broker. If a broker service in the cluster goes down, the other brokers can serve the message from their copy.
Kafka producers are the applications that send data to Kafka brokers using specific Kafka client APIs. Apache Kafka provides a set of producer APIs that allows applications to send continuous streams of data to the cluster of Kafka Brokers. You can implement multiple instances of the producer applications, and all of them can simultaneously transmit data to the brokers. That notion of numerous Kafka producer applications sending data to Apache Kafka Brokers is the core of Kafka’s scalability on the producer side of the data pipeline. Apache Kafka also provides the notion of topics. Kafka producers always send data to a defined topic, allowing multiple applications to group their data and separate it from other applications’ data into their topics.
Kafka consumers are the applications that request and consume data from the Kafka brokers using specialized Kafka client APIs. Apache Kafka provides a set of consumer APIs that allows applications to receive continuous data streams from the cluster of Kafka Brokers. You might implement multiple instances of consumer applications that can simultaneously read data from Brokers. Kafka consumer APIs also offer the notion of consumer groups. You can group your consumers to share the reading and processing data workload. Each consumer in the group receives a portion of the data, and the Apache Kafka broker ensures that the same data record is not sent to two consumers in the given consumer group. The notion of the consumer groups is the core of Kafka’s scalability on the consumer side of the data pipeline.
Kafka Connect is built on top of Kafka’s core components. Kafka Connect offers a reliable and scalable method to move data between the Kafka broker and the other data sources. Kafka Connect offers you two different things to achieve data movement.
- Off-the-shelf Kafka connectors
- Kafka Connect APIs and a framework
These are ready-to-use and off-the-shelf Kafka connectors that you can use to move data between Kafka broker and other applications. For using Kafka connectors, you do not need to write code or make changes to your applications. Kafka connectors are purely based on configurations. You can classify these Kafka connectors into two different groups.
- Source connector
- Sink connectors
Source connectors are built on the foundation of the Kafka producers. You can use a source connector to pull data from a source system (for example, RDBMS) and send it to Kafka Broker.
Sink connectors are the complementary part of the source connectors, built on the foundation of Kafka Consumers. You can use a sink connector to pull data from the Kafka broker and send it to the target system (For example, HDFS).
The Kafka community has developed many off-the-shelf sources and sync connectors for various systems. You can get an extensive list of Kafka connectors at Kafka Connect Hub.
Kafka Connect framework
The second part of Kafka Connect is a robust and easy-to-use development framework. The Kafka Connect framework lets you quickly develop your custom Source and Sink connectors. If you do not have a ready-to-use connector for your system, you can leverage the Kafka connect framework to develop your connectors. The framework makes it simpler for you to write high-quality, reliable, and high-performance custom connectors. Using the Kafka Connect framework, you will scale down the development, testing, and small production deployment lifecycle.
Why do we need Kafka to connect?
When creating a data pipeline using Kafka, you implement a producer/consumer or can use an off-the-shelf connector. You might also consider developing your custom connector using the Kafka Connect framework. However, the question is when to use what?
Kafka producers and consumers are embedded in your application. They become an integral part of your application. Your application might be persisting data in a storage system like a database or a log file. However, you also wanted to send data to a Kafka broker for further consumption by other applications. Hence, you modified your application code and implemented Kafka producer APIs to send data to a Kafka broker. This approach works perfectly well when you have access to the application code and modify the application code.
When you cannot access the application code or do not want to embed the Kafka producer or a consumer in your application to achieve modularity and simple management, you should prefer a Kafka connector. If your application is persisting data in some storage system and you have access to the storage system, you should prefer to use the Kafka connector to build your data pipeline. Kafka connector should be able to cut down your development activity, which can be used and managed by non-developers.
Suppose the connector does not exist for your storage system, and you have a choice of embedding your connector in the application or developing a new connector for your application. In that case, it is recommended to prefer to create a new connector. Because the framework provides out-of-the-box features like configuration management, offset storage, parallelization, error handling, support for different data types, and standard management REST APIs.
Apache Kafka Streams
Apache Kafka client APIs, Kafka connect, and the Kafka brokers provide a reliable and highly scalable backbone infrastructure for delivering data streams amongst applications. You can use your choice of the stream processing system to develop a real-time streaming application. Apache Spark, Apache Storm, and Apache Flink are among the most popular stream-processing frameworks. However, starting from Kafka 0.10 release, Kafka includes a powerful stream processing library called Kafka Streams. The Kafka streams library allows Kafka developers to extend their standard applications to consume, process, and produce new data streams. You can avoid the cost and overhead of maintaining an additional cluster by implementing Kafka Streams for your real-time stream processing requirements compared to other cluster-based stream processing systems like Apache spark. Apache Spark might make more sense when you have a distributed machine-learning algorithm and know that you will need the capabilities of the Spark cluster. However, the Apache Kafka stream is an excellent alternative for simple real-time applications without enough justifications for building a Spark cluster.
Kafka streams allow you to process your data in real time on a per-record basis. You do not need to group data in small batches and work on the micro-batches like other stream processing frameworks. The ability to work on each record as it arrives is critical for the millisecond response time. A typical Kafka stream application would read the data from the Kafka topic in real-time, perform necessary action on the data and possibly send it back to the Kafka broker on a different topic. You can still use Kafka Producers, Kafka Consumers, and Kafka connectors to handle the rest of your data integration needs within the same cluster. A typical implementation might use all three flavors of Apache Kafka to solve the bigger problem and create a robust application.