July 5, 2018

Hello!

I am starting with the Kafka Series. The article in this series are meant for aspiring data scientist (like Me), who wish to learn full stack Big Data & Machine Learning pipeline.

I will be referring to various sources of knowledge, and list them under Reference section of the article.

Introduction

Kafka is a distributed streaming platform. It is used to simplify data pipelines that are made up of vast number of producers and consumers. Kafka gives you a stream and you can plugin a processing framework. There are no plug and play components. To customize, You need to learn the basics of kafka.

Components

Producers

Produces is an application that sends data to kafka
It is a small to medium size data
For kafka, it is simple array of bites.
If you want to send a file to kafka, Create a producer application and send each line of file as a message.
A message is one line of text.
If want to send a record, each row will be a message.
If you want to send the result of a query. create a producer application, run the query, fetch result and send each row as message.

Connectors

Database, they import data from database to kafka and export as well.

Stream Processors

Cluster

It is a group of computer acting together for a common purpose.
Cluster, Each executing one instance of kafka brokers.

Broker

Is a Kafka server.
It is agent to exchange messages.

Consumers

It recieves data.
Producer don’t send data to receipeint address. producer send the message to kafka server.
Anyone who is interested in that data, can come forward and request the information, provided they have permission to read it.
If you want to read a file, create a consumer application and request kafka for the data.
Client application will recieve a lines of message.

Topic

Unique name for stream of data or Kafka Stream.

Partitions

Data can be larger than storage capacity of single computer.
Obvious solution, is to distribute the data on different systems.
Break the data into partitions, and store.
When we create a topic, we give the argument for partition.
Every partition sits on a single system.

Offset

This is a sequence number assigned to message arrived partition.
Immutable
It starts from 0.
There is no global offset across partition.
To locate a message, Topic , Partition and offset.

Consumer Group

It is a group of consumers dividing the task among themselves.
eg. help in writing the data to data center
Partition and consumer groups are tool for scalability.
Maximum number of the consumers in a group, is total number of partitions on the topic.
Kafka doesn’t allow more than 2 consumer from the same partition, simultaneously.

To learn more about Apache Kafka, Stay Tuned.

Hope this helps! Keep tuned for more blogs from ML series.

Happy Learning!

Rajiv Jha :)

Rajiv Jha

My name is Rajiv Jha. I am Senior Engineering student at Guru Gobind Singh Indraprastha University.

Kafka Tutorial - Part I

Introduction

Components

Rajiv Jha

Recent post

Kafka Tutorial - Part I

Introduction

Components

Rajiv Jha

Recent post

Newsletter