When – and When Not – to Use Open Source Apache Cassandra, Kafka, Spark and Elasticsearch
Just about all technology decision-making must meet two essential criteria: it must enable you to meet your business goals and it must work well alongside the rest of your technology stack. When it comes to selecting data-layer technologies to build out your application architecture, open source Apache Cassandra, Apache Kafka, Apache Spark, and Elasticsearch continue to rise in popularity.
However, they’re not the right choice for every use case.
Let’s take a deeper look at each of these technologies, and some of the use cases that are – and are not – advantageous applications of these open source solutions.
Apache Cassandra
Originally created by Facebook in 2007, Cassandra utilizes a Dynamo architecture and a Bigtable-style data model to provide a NoSQL data store that delivers high availability and high scalability.
When You Should Use Apache Cassandra
Cassandra is an ideal choice for use cases that require the highest levels of always-on availability. The database is also particularly well suited to serving organizations that anticipate massive workloads, or that wish to ensure that their services can grow flexibly as workloads expand (and thus need the easy scalability that Cassandra provides). Cassandra offers reliable data redundancy and active-active operations across multiple data centers.
When You Shouldn’t
Cassandra is more resource-intensive than alternatives when tasked with data warehousing or pure analytics storage (even factoring the use of available Spark connectors and Tableau and Hadoop plugins). Cassandra is also poorly suited to real-time analytics, especially in the form of end user ad-hoc or custom queries, because the need to implement code on the application side can become convoluted. Additionally, Cassandra does not meet most ACID requirements.
Apache Kafka
First created by the technical team at LinkedIn, Apache Kafka provides a highly scalable and highly available streaming platform and message bus. Kafka functions as a distributed log, in which newly arriving messages are added to the head of a queue and readers (consumers) will consume them based on an offset.
When You Should Use Apache Kafka
Apache Kafka is generally a smart choice for use cases that involve microservices and service-oriented architecture. Kafka can also serve as a highly effective work queue that’s able to coordinate separate work paths, reserving compute power by listening and waiting until work arrives. The platform’s stream processing capabilities are useful for anomaly detection, roll-ups, and aggregations, as well as for passing metrics through. Kafka is also a highly-capable option for event sourcing, data reconciliation across various microservices, and to provide an external commit log for distributed systems. Additional appropriate use cases include log aggregation, data masking and filtering, data enrichment, and fraud detection.
When You Shouldn’t
While it might be tempting in some cases, it can be ill-advised to use Kafka as a database or source-of-record, at least without a very solid understanding of Kafka limitations and properties for this use case. A true database will almost always be simpler to operate and more flexible. Kafka is a similarly inappropriate choice for in-order processing across an entire topic. In any use case where the objective is to advance data packets to the end source fast, such as real-time audio and video or other lossy data streams, organizations should use purpose-built solutions instead of Kafka.
Apache Spark
A general-purpose cluster computing framework suited to use cases involving large data volumes, Apache Spark divides data and runs computation on those segments, such that workers perform all possible work up until they require data from other workers. This design gives Spark tremendous scalability and availability, while also making it highly resilient against data loss.
When You Should Use Apache Spark
Spark lends itself to use cases involving large scale analytics, especially cases where data arrives via multiple sources. Spark is a powerful solution for ETL or any use case that includes moving data between systems, either when used to continuously populate a data warehouse or data lake from transactional data stores, or in one-time scenarios like database or system migrations. Organizations building machine learning pipelines atop existing data, working with high-latency streaming, or performing interactive, ad-hoc, or exploratory analysis will find Spark a strong fit. Spark also lends itself to helping organizations meet their compliance needs by offering data masking, data filtering, and auditing of large data sets from a compliance perspective.
When You Shouldn’t
In general, Spark isn’t going to be the best choice for use cases involving real-time or low latency processing. (Apache Kafka or other technologies deliver superior end-to-end latency for these needs, including real-time stream processing.) When working with small or single datasets, Spark is most often too excessive an option. Also, when it comes to data warehousing and data lakes, it’s better to use a higher-level technology in place of Apache Spark, although such products for Spark do exist.
Elasticsearch
Elasticsearch offers a full-text search engine that features a wide range of functionality to search and analyse unstructured data. The technology offers scalable linear search in close to real-time, provides robust drop-in search replacement, and significant analytics capabilities.
When You Should Use Elasticsearch
Elasticsearch is strongly suited to use cases requiring full-text search, geographic search, scraping and combining public data, logging and log analysis, visualizations, and small volume event data and metrics.
When You Shouldn’t
Elasticsearch should not be used as a database or source-of-record, with relational data, or to meet ACID requirements.
Selecting Complementary Technologies
Choosing the best mix of technologies for your organization (whether open source or otherwise) obviously entails more than just evaluating the solutions themselves – decision makers must also envision how the organization will adopt and utilize each solution as part of their technology stack. Apache Cassandra, Apache Kafka, Apache Spark, and Elasticsearch offer a particularly complementary set of technologies that make sense for organizations to utilize together, and which offer freedom from license fees or vendor lock-in thanks to their open source nature. By teaming these technologies and realizing their collected advantages, organizations can achieve their goals and enable the development of applications that are highly scalable, available, portable and resilient.
Ben Slater is the Chief Product Officer at Instaclustr, which provides a managed service platform of open source technologies such as Apache Cassandra, Apache Spark, Elasticsearch and Apache Kafka.
Related Items:
Here’s What Doug Cutting Says Is Hadoop’s Biggest Contribution
A Decade Later, Apache Spark Still Going Strong
Four Open Source Data Projects To Watch Now