-
Notifications
You must be signed in to change notification settings - Fork 0
Kafka
Kafka is a distributed event storming platform, it is more than a messaging system. It is central hub of the integration architecture.
Apache Kafka is an open-source distributed streaming platform known for its high-throughput, fault-tolerant, and scalable data streaming capabilities.


Let's explore the key components of Kafka's architecture in a nutshell: Producers: Producers are responsible for publishing data records to Kafka topics. They can be applications or systems that generate data and send it to Kafka for processing.
Topics: Topics are categories or feeds to which data records are published by producers. Topics are organized into partitions, allowing data to be distributed and processed in parallel.
Partitions: Each topic is divided into multiple partitions. Partitions enable horizontal scalability and parallel processing within a Kafka cluster. They ensure that data within a topic is spread across multiple brokers for increased throughput.
Brokers: Brokers are the Kafka servers in the cluster that handle the storage and replication of data. Each broker manages one or more partitions and communicates with producers and consumers.
Consumers: Consumers are applications or systems that subscribe to Kafka topics and consume the data records. They read data from the partitions and process it based on their specific requirements.
Consumer Groups: Consumer groups are a way to scale out consumption and achieve load balancing. Consumers within a group coordinate to consume different partitions of a topic, ensuring parallel processing and fault tolerance.
ZooKeeper: In older versions of Kafka, ZooKeeper was used for managing and maintaining cluster metadata, including broker and consumer group information. However, starting from Apache Kafka 2.8, ZooKeeper is no longer a dependency, and Kafka uses its internal metadata management system.
Kafka Cluster: A Kafka cluster consists of multiple brokers working together to handle data streams. The cluster provides fault tolerance and scalability, allowing data to be replicated across brokers for high availability.
Connectors and Streams: Apache Kafka offers additional components like Kafka Connect and Kafka Streams. Kafka Connect simplifies integration with external data systems, enabling efficient data ingestion and extraction. Kafka Streams provides a high-level API for building stream processing applications on top of Kafka.
Kafka
- is an event ledger, keeping track of all the messages that come in
- is distributed in nature
- is a redundant system
- uses Messaging System Semantics (which means, it functions similar to the messaging system)
- ensures Clustering as core principle - employs multiple nodes to distribute the load
- ensures Durability & Ordering Guarantees
Kafka can be employed for some of the use cases given below
- Real-time Data Analytics:
- Fraud Detection:
- Internet of Things (IoT):
- Log Aggregation and Monitoring:
- Event Sourcing:
- Microservices Communication:
- Messaging and Notifications:
- Data Pipeline and ETL:
- Clickstream Analysis:
- Machine Learning and Model Training:
Apache Kafka in Industry Use Cases
- Uber: Uber uses Kafka to handle real-time data streams from millions of drivers and riders, enabling efficient tracking, dispatching, and real-time analytics for dynamic pricing and trip optimization.
- LinkedIn: LinkedIn utilizes Kafka as a central data pipeline for ingesting and processing massive amounts of user-generated data, enabling real-time updates, personalized recommendations, and targeted advertisements.
- Netflix: Netflix employs Kafka to power its real-time event streaming infrastructure, allowing real-time monitoring, analytics, and decision-making for content delivery, user recommendations, and system health monitoring.
- Airbnb: Airbnb relies on Kafka for handling real-time booking data, user interactions, and operational metrics, enabling real-time insights, fraud detection, and personalized experiences for their guests and hosts.
- Twitter: Twitter uses Kafka to handle high-volume data streams of tweets, user interactions, and trending topics. Kafka enables real-time processing, analytics, and delivery of tweets to users' timelines.
- PayPal: PayPal utilizes Kafka for real-time transaction processing, fraud detection, and risk management. Kafka enables fast and reliable data streaming for monitoring and analyzing financial transactions.
- Walmart: Walmart leverages Kafka to process and analyze real-time data from point-of-sale systems, supply chain operations, and customer interactions. Kafka enables real-time inventory management, demand forecasting, and personalized promotions.
- Cisco: Cisco integrates Kafka into its IoT infrastructure, enabling real-time data ingestion and processing from IoT devices. Kafka facilitates real-time analytics, security monitoring, and device management in their IoT ecosystem.
- Financial Institutions: Many financial institutions rely on Kafka for handling high-speed, real-time data feeds from stock exchanges, market data providers, and trading systems. Kafka enables real-time market analysis, algorithmic trading, and risk management.
- Gaming Industry: Gaming companies use Kafka to handle real-time player interactions, in-game events, and telemetry data. Kafka enables real-time analytics, game optimization, and personalized gaming experiences.
By leveraging Apache Kafka, Netflix achieved significant improvements:
By leveraging Apache Kafka, Netflix achieved significant improvements in their data streaming, processing, and analytics capabilities. They were able to process and analyze real-time data streams at scale, enabling data-driven decisions and personalized recommendations. Kafka's scalability, fault-tolerance, and event-driven architecture helped them handle massive volumes of data efficiently, ensuring uninterrupted streaming services. Integration with various systems and frameworks streamlined their data pipelines, improving operational efficiency and cost optimization. Overall, Kafka empowered Netflix to deliver a seamless streaming experience and enhance their operational capabilities.
- Asynchronous processing (where synchronization is hard)
- Scaling ETL Jobs / Data Pipelines / Big Data Ingest
- Processing is error-prone (ex: parsing logic might throw exceptions due to invalid payload data)
- Event Store (to go back to retry and perform certain operations)
- Distributed Processing
It is important that the delivery of the messages to be in sequential order,
ex: Creating Order, Updating order to be in sequential, not in other way.
It is a record based operation,
- Key, Value, Timestamp
- Immutable
- Append Only
- Persisted
3 Components
- Broker: Node in the cluster
- Producer: Writes the records to a broker
- Consumer: Reads records from a broker
Kafka is not doing push
records to consumers, instead, consumers connects to brokers and ask for records.