Now and Then: The Evolution of In-Memory Computing
The history of data warehousing, big data, and analytics can be described as a constant challenge to process and analyze ever-increasing volumes of data in shorter amounts of time. Fundamentally, the single biggest factor that affects our ability to process data is the speed at which we can access it, and then do something with it. Data warehousing architectures have addressed this over the years by adopting massively parallel processing (MPP) designs, but the lower latency of in-memory computing combined with expanded workload capabilities promises breakthrough performance and real-time access to big data–if the price and capacity challenges can be addressed.
It is only recently that the possibility of using memory for working space and primary data storage has become viable. The first in-memory database systems from the 1990s were not “big data” or “analytics” systems as they were not capable of handling the volume of data required to do true analytics over very large data sets.
Today we have systems that can accommodate terabytes of RAM. Coupled with innovative compression techniques, we have the capability of storing hundreds of terabytes in-memory, albeit at a significant price premium over traditional or solid-state storage options. In short, we can now do big data analytics on purely in-memory systems. More importantly, in-memory systems open the doors to new possibilities, with the potential to combine transactional processing (OLTP) with analytic processing (OLAP) in the same system, for a hybrid OLTAP system!
Combining OLTP and OLAP processing is a difficult problem to solve, as the workload characteristics differ hugely. Most OLTP systems are dealing with high-volume, short transactions over very discrete data, and organize the data in a row-based structure. Most OLAP systems, on the other hand, deal with much longer running queries over much larger datasets and require a divide and conquer approach to be effective — MPP. Many of these also use a columnar data structure that is optimized for analytic processing.
When we add in-memory capabilities to the mix, the lines become blurred. In-memory provides OLTP-like response for operational data warehousing queries, and can start to combine transactional processing on the same physical infrastructure that also runs analytics.
A Converged Future
Most of today’s analytic environments are largely limited to the latency of batch analytics. Not only are the reports that are run on them submitted in overnight batches, but the data is refreshed through a batch ETL process. The most current data many of these systems can provide is yesterday’s. A hybrid approach combining in-memory and streaming capabilities with traditional data warehousing systems allows us to solve this problem at scale and usher in the next wave of responsiveness and personalization, while also providing huge opportunities for converged systems that can handle a much more diverse set of workloads.
In-memory computing tends to be a trade-off between performance, scalability, and cost. In 2013, the average price of RAM was more than 100 times more per GB than for disk. Practically speaking, that means that while you might be able to build 100TB in-memory data warehouse, the raw storage will cost 100 times more than traditional, disk-based solutions! The elimination of indexes and aggregate tables reduces the raw storage requirements significantly, but in-memory systems still require traditional persistent storage too, so the cost offset is not as great as one might think.
The tradeoff comes down to data volume, and how quickly you need to access it. While the cost differential between DRAM and traditional storage will decrease over time, there is still a wide enough discrepancy that an all in-memory approach is not feasible for multi-terabyte and peta-scale analytics systems — even when taking into account compression, elimination of traditional constructs like indexes, and other efficiencies gained from in-memory approaches.
A more intelligent approach is to use hybrid models where frequently accessed data (hot data) is moved onto faster media (in-memory), and less frequently accessed data is left on slower, higher capacity media. Many vendors do just that with traditional disk and SSD flash storage; fewer have extended this capability to in-memory.
While the hype today is on in-memory analytics systems, the real performance gains and opportunities for optimization are going to come in two key areas:
- Systems that move beyond in-memory and start to leverage the built-in cache inside the CPU itself.
- Systems that are able to effectively deliver converged OLTAP systems to do transactional and analytic processing on the same platform
While in-memory access times are 40,000x faster than disk access times, making them far better for low-latency applications where immediate response is a necessity, CPU cache adds another 15-40x performance boost — if it can be effectively leveraged. Practically speaking, this means that data that is in a CPU’s cache can be accessed 600,000 times faster than data on disk!
And when we add converged capabilities to the mix, we will gain huge efficiencies by eliminating a significant amount of data movement from OLTP to OLAP systems thus providing near real-time analytic capabilities on the most recent data.
It’s really the only thing that makes sense in a big data world: better than in-memory performance, without the constraints of a pure in-memory system, and near real-time access to transactional data analytics. The future looks bright!
About the author: Adam Ronthal has more than 17 years of experience in technical operations, system administration, and data warehousing and analytics. An IBMer, he is currently involved in Big Data & Cloud Strategy. You can follow Adam on Twitter at ARonthal.