23

I would like to ask if my understanding of Kafka is correct.

For really really big data stream, conventional database is not adequate so people use things such as Hadoop or Storm. Kafka sits on top of said databases and provide ...directions where the real time data should go?

Loredra L
  • 1,369
  • 2
  • 14
  • 29

6 Answers6

21

I don't think so.

Kafka is messaging system and it does not sit on top of database.

You can compare Kafka with messaging systems like ActiveMQ, RabbitMQ etc.

From Apache documentation page

Kafka is a distributed, partitioned, replicated commit log service. It provides the functionality of a messaging system, but with a unique design.

Key takeaways:

  1. Kafka maintains feeds of messages in categories called topics.
  2. We'll call processes that publish messages to a Kafka topic producers.
  3. We'll call processes that subscribe to topics and process the feed of published messages consumers..
  4. Kafka is run as a cluster comprised of one or more servers each of which is called a broker.

enter image description here

Communication between the clients and the servers is done with a simple, high-performance, language agnostic TCP protocol.

Use Cases:

  1. Messaging: Kafka works well as a replacement for a more traditional message broker. In this domain Kafka is comparable to traditional messaging systems such as ActiveMQ or RabbitMQ
  2. Website Activity Tracking: The original use case for Kafka was to be able to rebuild a user activity tracking pipeline as a set of real-time publish-subscribe feeds
  3. Metrics: Kafka is often used for operational monitoring data, which involves aggregating statistics from distributed applications to produce centralized feeds of operational data
  4. Log Aggregation
  5. Stream Processing
  6. Event sourcing is a style of application design where state changes are logged as a time-ordered sequence of records.
  7. Commit Log: Kafka can serve as a kind of external commit-log for a distributed system. The log helps replicate data between nodes and acts as a re-syncing mechanism for failed nodes to restore their data
Ravindra babu
  • 45,953
  • 8
  • 231
  • 206
  • 1
    Sorry, but I do not understand why we have Kafa for the task seemingly like communication between server and client? – Loredra L May 17 '16 at 14:42
  • To provide loose coupling between two different enterprise services/systems. Sender and Receiver services are loosely coupled via messaging integration. Visit this link: enterpriseintegrationpatterns.com and enterpriseintegrationpatterns.com/patterns/messaging – Ravindra babu May 17 '16 at 15:05
  • [Apache Kafka is not for Event Sourcing](https://medium.com/serialized-io/apache-kafka-is-not-for-event-sourcing-81735c3cf5c): The author claims that, "you can use Kafka as an event store or an event log, but it really isn’t a suitable tool for event sourcing.". The author says that loading the current state is expensive as you have to replay all past states. I'm not sure, though, if they simply want to advertise their system. – Martin Thoma Jul 31 '18 at 04:44
  • @MartinThoma, if you use it as an event source, you would most probably keep the latest state in some kind of cache, so whenever a new event comes in, this cache/state is updated. This is usually called a "projection", as it is the projection of all the events(relevant to your entitiy). only in case of a crash, you would need to "re-run" all the events to get the projection, but again depending on your application u can "persist" the cache from time to time, and just replay the "missing" events – Anderson Saunders Nov 30 '20 at 05:02
8

To fully understand Apache Kafka's role you should get a wider picture and know Kafka's use cases. Modern data processing systems try to break with the classic application architecture. You can start from the kappa architecture overview:

In this architecture you don't store the current state of the world in any SQL or key-value database. All data is processed and stored as one or more series of events in an append-only immutable log. Immutable events are easier to replicate and store in a distributed environment. Apache Kafka is a system that is used storing these events and for brokering them between other system components.

user152468
  • 2,992
  • 6
  • 23
  • 53
mgosk
  • 1,844
  • 14
  • 22
4

Use cases on Apache Kafka's official site: http://kafka.apache.org/documentation.html#uses

More use cases :-

Kafka-Storm Pipeline - Kafka can be used with Apache Storm to handle data pipeline for high speed filtering and pattern matching on the fly.

Vijay
  • 4,328
  • 1
  • 27
  • 36
4

Apache Kafka is not just a message broker. It was initially designed and implemented by LinkedIn in order to serve as a message queue. Since 2011, Kafka has been open sourced and quickly evolved into a distributed streaming platform, which is used for the implementation of real-time data pipelines and streaming applications.

It is horizontally scalable, fault-tolerant, wicked fast, and runs in production in thousands of companies.

Modern organisations have various data pipelines that facilitate the communication between systems or services. Things get a bit more complicated when a reasonable number of services needs to communicate with each other at real time.

The architecture becomes complex since various integrations are required in order to enable the inter-communication of these services. More precisely, for an architecture that encompasses m source and n target services, n x m distinct integrations need to be written. Also, every integration comes with a different specification, meaning that one might require a different protocol (HTTP, TCP, JDBC, etc.) or a different data representation (Binary, Apache Avro, JSON, etc.), making things even more challenging. Furthermore, source services might address increased load from connections that could potentially impact latency.

Apache Kafka leads to more simple and manageable architectures, by decoupling data pipelines. Kafka acts as a high-throughput distributed system where source services push streams of data, making them available for target services to pull them at real-time.

Also, a lot of open-source and enterprise-level User Interfaces for managing Kafka Clusters are available now. For more details refer to my answer to this question.

You can find more details about Apache Kafka and how it works in the blog post "Why Apache Kafka?"

Giorgos Myrianthous
  • 30,279
  • 17
  • 114
  • 133
2

Apache Kafka is an open-source software platform written in Scala and Java, mainly used for stream processing.

The use cases of Apache Kafka are:

  • Messaging
  • Website Activity Tracking
  • Metrics
  • Log Aggregation
  • Stream Processing
  • Event Sourcing
  • Commit Log

For more information use the official apache Kafka site. https://kafka.apache.org/uses

Manjunatha H C
  • 197
  • 1
  • 3
-1

Kafka is a pub-sub highly scalable messaging system. It acts as a transport layer guaranteeing exactly once semantics and Spark steaming does the processing. The next question that comes to my mind is even spark can poll directories to check for files and even read from a socket or port. How this Kafka and spark work in tandem ? I mean does an application written in some language instead of writing to a database for storage directly feds to the port (or places the files which would not really be tak time and would rather be some kind of batch processing) from which the data is then read by a Kafka producer and then via the Kafka consumer API is then read and processing by spark streaming?