
IBM to Buy DataStax for Database, GenAI Capabilities

IBM today announced its intent to acquire DataStax, the longtime backer of the Apache Cassandra database that has recently broadened its reach into streaming data and generative AI. IBM cited DataStax’s capability to manage unstructured data as well as its vector database, which is used for developing RAG solutions.
Apache Cassandra was originally developed at Facebook in 2008 to serve the fledgling social network’s need for a highly scalable, fault-tolerant database to store big data generated by users on its website. Facebook was a big user and creator in the nascent big data ecosystem, building its social media empire atop non-relational technology like Apache Hadoop and HBase, another NoSQL data store, as well as Apache Hive, which it created to make Hadoop look like a relational database. (Facebook would eventually move back to using relational databases, specifically Postgres, but that is another story.)
Cassandra, which technically is a wide-column store that favors data availability and reliability (at the expense of data consistency), became a top-level project at the Apache Software Foundation in 2010. That is the same year that Jonathan Ellis and Matt Pfeil co-founded a company in Austin, Texas called Riptano, which it quickly renamed DataStax.
At first, DataStax followed the typical commercial open-source business model, offering an enterprise version of Apache Cassandra called DataStax Enterprise (DSE). The company, which had moved to Santa Clara, California by 2014, attracted customers from the Fortune 500, such as FedEx, Capital One, and Verizon. It has raised $106 million in venture capital at a $830 valuation, and was on pace for an IPO in the 2015 or 2016 timeframe.
That IPO never happened, as MongoDB dominated the NoSQL space and went public in 2017. In May 2020, DataStax launched Astra DB, a fully managed version of Cassandra running in the cloud atop Kassandra, giving customers the scalability and availability benefits of the NoSQL database but without the management responsibilities (like many distributed systems, Cassandra can be difficult to manage). Later that year, it released K8ssandra, an open source version of the database running atop the resource manager.
Soon, the company started branching beyond NoSQL databases. In 2021, it launched Astra Streaming, an event streaming platform based on Apache Pulsar, a publish and subscribe (pub-sub) data platform that competes with Apache Kafka. In 2023, DataStax bought Kaskada, an AI startup that helped to automate tedious feature engineering tasks, and made the software open source under the Luna ML brand.
DataStax further bolstered its generative AI capabilities in 2023 with the launch of a vector store in Astra DB. Vector stores emerged as critical tools for building retrieval-augmented generation (RAG) pipelines to bolster the accuracy of large language model (LLM) output in generative AI applications. Then in 2024, DataStax further fleshed out its RAG story when it nabbed Langflow, which developed an open source framework for building RAG pipelines.
All of the accumulated capabilities that DataStax built and bought obviously caught the eye of IBM. Big Blue, which has been rallying its business to some degree on the back of its watsonx AI offerings, cited open source projects like Apache Cassandra, Apache Pulsar, Langflow, and OpenSearch (a branch of Elasticsearch and Kibana) in its press release announcing the acquisition.
IBM is particularly enamored of how DataStax has built its unstructured data management capabilities under a single product. While it didn’t mention DataStax’s Hyper-Converged Data Platform (HCDP) by name, it seems clear that IBM is banking on harnessing the tech to help customers turn unstructured data into winning AI applications.
“Unstructured data represents a treasure trove of untapped business intelligence, representing 93% of all enterprise data in 2024, according to IDC,” Ritika Gunnar, IBM’s general manager of data and AI, says in a blog post. “Harnessing the power of this data within generative AI applications is essential. But to do that, enterprises must first make order out of data chaos.”
According to Gunnar, IBM wants to bring DataStax’s open source offerings together with its watsonx portfolio of products, specifically Apache Iceberg, Apache Spark, Velox, and Presto, to help customers leverage large amounts of unstructured data.
“The data infrastructure required for AI is much more than just vector,’” Gunnar writes. “Many modalities of data–JSON, time-series, key/value, tabular, graph–need to come together to make the data ingest and search accurate and relevant. By having them built into a simplified and scalable solution (thanks to generative AI) users don’t have to stitch together a multitude of data representations to gain value from their enterprise data.)
In his own blog post, DataStax CEO Chet Kapoor discussed how DataStax and IBM have worked together with open source software (OSS) since 2020, including deploying DataStax products atop the IBM OpenShift platform.
“We respect the leadership and stewardship that IBM has demonstrated with OSS and the great OSS companies that have found a home at IBM, like Red Hat and others, and we’re excited to become part of a company that understands the power of openness,” Kapoor writes. “With our technologies and IBM’s watsonx.data, their hybrid, open data lakehouse, we will be able to bring vector and AI search to the entire data estate and make IBM’s capabilities available to every developer.”
Terms of the deal, which is expected to close in the second quarter, were not disclosed. DataStax was valued at $1.6 billion during its most recent funding round, in June 2022. The company has raised $342.6 million over several rounds. It has hundreds of paying customers, according to IBM.
Related Items:
DataStax Rolls Out Vector Search for Astra DB to Support Gen AI
DataStax Announces New K8ssandra Operator
Cassandra Now Officially In the Cloud with DataStax Astra