Kafka Tutorial - Part I
July 5, 2018
I am starting with the Kafka Series. The article in this series are meant for aspiring data scientist (like Me), who wish to learn full stack Big Data & Machine Learning pipeline.
I will be referring to various sources of knowledge, and list them under Reference section of the article.
Kafka is a distributed streaming platform. It is used to simplify data pipelines that are made up of vast number of producers and consumers. Kafka gives you a stream and you can plugin a processing framework. There are no plug and play components. To customize, You need to learn the basics of kafka.
- Produces is an application that sends data to kafka
- It is a small to medium size data
- For kafka, it is simple array of bites.
- If you want to send a file to kafka, Create a producer application and send each line of file as a message.
- A message is one line of text.
- If want to send a record, each row will be a message.
- If you want to send the result of a query. create a producer application, run the query, fetch result and send each row as message.
- Database, they import data from database to kafka and export as well.
- It is a group of computer acting together for a common purpose.
- Cluster, Each executing one instance of kafka brokers.
- Is a Kafka server.
- It is agent to exchange messages.
- It recieves data.
- Producer don’t send data to receipeint address. producer send the message to kafka server.
- Anyone who is interested in that data, can come forward and request the information, provided they have permission to read it.
- If you want to read a file, create a consumer application and request kafka for the data.
- Client application will recieve a lines of message.
- Unique name for stream of data or Kafka Stream.
- Data can be larger than storage capacity of single computer.
- Obvious solution, is to distribute the data on different systems.
- Break the data into partitions, and store.
- When we create a topic, we give the argument for partition.
- Every partition sits on a single system.
- This is a sequence number assigned to message arrived partition.
- It starts from 0.
- There is no global offset across partition.
- To locate a message, Topic , Partition and offset.
- It is a group of consumers dividing the task among themselves.
eg. help in writing the data to data center
- Partition and consumer groups are tool for scalability.
- Maximum number of the consumers in a group, is total number of partitions on the topic.
- Kafka doesn’t allow more than 2 consumer from the same partition, simultaneously.
To learn more about Apache Kafka, Stay Tuned.
Hope this helps! Keep tuned for more blogs from ML series.
Rajiv Jha :)