Microsoft HDInsight is the cloud service that deploys and provisions Hadoop clusters on the Azure cloud. HDInsight is a completely managed and open source analytics service to support enterprise needs. HDInsight supports a wide variety of scenarios with the help of open source frameworks like Hadoop, Storm, Spark, R Server and Hive, not limited to define a pipeline for E-extract, T-transform, L-load, IoT, machine learning, and many more.
With enterprise feedback on non-availability of Kafka as a managed service and guaranteeing Kafka bits are up and running, Microsoft included Kafka as service in the HDInsight family. This inclusion is important for the HDInsight user community to manage a stream of records.
What is Kafka?
Kafka is the framework that helps in handling millions of events/transactions per second, with high through put. Kafka works well in distributing streaming data too. This is a centralized messaging center, which receives messages from disparate data sources and makes them available to their subscriber. It is an open source framework with highly scalable and high-tolerant architecture.
Kafka plays a broker role and runs as a cluster, stores streams of data received from the publisher and makes this data stream available to the subscriber for processing. Kafka is like a message queue or enterprise messaging system with high fault tolerance. Kafka not only helps in maintaining messages received, but also persists the message as configured and deletes the message after reaching threshold time.
Kafka Key Terminologies
Kafka works as a centralized distribution center (broker) with a combination of sender (producer) and subscriber (consumer). Below are the logical components of Kafka:
Kafka stores streams of records (messages) in categories known as Topic and each message in topic contains three key information; 1) a key, 2) a value, 3) a timestamp.
Kafka come with a set of APIs to manage various workflows and all required activities. Some of the key APIs are as follows:
- Producer API – Producers are data sources, which produce and send streams of data to Kafka. Kafka has a set of APIs for producers to communicate with a Kafka cluster. Some of the key methods are send, sendoffsetsToTransaction, beginTransaction, commitTransaction, etc.
- Consumer API – Consumers are subscribers, which subscribe to Kafka Topics and process the stream of records produced to them. The subscribers can be an analytical system, a data warehouse, or any other application to use streams of records for further processing. Some of the key methods are subscribe, unsubscribe, listTopics, poll, pause, subscription, etc.
- Stream API – Kafka supports stream processing inside the Kafka cluster using Stream APIs. Stream APIs allows you to build streaming solutions without Storm or Spark. Some of the key methods are start, close, setUncaughtExceptionHandler, etc.
- Connector API – Kafka provides a way to integrate Kafka with other systems using connector APIs. These connector APIs allow building and running a reusable source connector for producer and sink connector to export Kafka topic to consumer. Some of the key methods are start, stop, version, validate, etc.
Benefits of Kafka Integration with HDInsight
Kafka is an open source framework that has challenges with installing, managing, and maintaining a stream of records processing pipeline, as do other open source technologies. Also, to keep open source bits up and running, it’s required that in-house resources have an expert level of knowledge with technology to ensure high availability. Microsoft understood these challenges from enterprises and added Kafka as a managed service in the HDInsight family.
Microsoft not only integrated Kafka with HDInsight, they also provide an enterprise level streaming ingestion service that is cost effective and easy to configure. There are lot of benefits enterprises can leverage with Kafka integration with HDInsight. Some of them are as follows:
- High SLA (99.9%) – Microsoft is offering 99.9% SLA on the Kafka uptime
- Native Integration with Azure Managed Disks – Azure Managed Disks is an innovation to keep the service cost effective. Azure Managed Disk provides higher scale and throughput for the disks used by Kafka on HDInsight.
- Rack Awareness for Azure – Kafka Vs Managed Kafka, Kafka designed with a single dimension view of a rack, which works well on some environments. However, with Managed Kafka on HDInsight, a rack is separated out into two dimensions – Update Domains (UDs) and Fault Domains (FDs). Microsoft provides tools that can rebalance Kafka partitions and replicas across UDs and FDs. This helps in achieving fault-tolerance.
- Alerting, Monitoring and Automated Maintenance – Kafka on HDInsight leverages Azure Log Analytics to monitor. Azure Log Analytics captures and provides Azure VM information, like Disk and Network Interface Card (NIC) metrics, and Java Management Extensions (JMX) metrics from Kafka.
- Secure deployment with Azure Virtual Network – Azure provides secure Virtual Network boundaries, which can be locked down. Also, Kafka clusters on HDInsight are required to be deployed within a Virtual Network. This ensures the security for financial, healthcare, and other systems.
Summary
Kafka integration with HDInsight is the key to meeting the increasing needs of enterprises to build real time pipelines of a stream of records with low latency and high through put. With this integration, HDInsight service provides all key open source frameworks in one place to consume and process a stream of records at a very high rate (1 million events per second). The managed Kafka consumes IoT events or any other stream of records at a very high rate and Storm/Spark is used to process these events for real time computation, real time aggregation, and pipe out this data to long term storage.