Blurred Storage Lines: Clouds That Appear Like On-Prem
Cloud storage is growing fast as companies look to take advantage of low cost and flexible storage options for terabytes and petabytes of data. But for all the convenience of cloud storage, sometimes it’s just better when data is closer. That’s helping to drive adoption of distributed file systems from vendors like LucidLink and KMesh that can make cloud-resident data appear like it’s stored on-prem.
Interest in object stores is surging for good reasons, according to Peter Thompson, the founder and CEO of LucidLink, a San Francisco-based storage startup.
“It’s cost effective. It’s highly durable. It’s elastic,” he says. “And because of the way that companies are working — where we have distributed workforces and distributed workflows — it’s contributing to this need to be able to access data from multiple locations as quickly as possible. Those are all converging at the same time.”
But as good as object stores are, they typically lack several capabilities that big companies demand – namely speed and consistency. That should effectively relegate object stores to backup and archive roles.
However, due to the simplicity in how developers can manipulate data using S3’s native REST API, object stores are being used for a variety of other applications that perhaps they are not architecturally best suited.
That’s the functionality gap that LucidLink is targeting with its product, which basically is a distributed file system that makes any S3-compatible object store – in the cloud or on-premise – look and function like a fast local file system, no matter where users are located around the globe.
The LucidLink Filespaces product essentially functions as a software-based acceleration gateway. Installed on an X86 server, the customer points the local Filespaces file system at a remote cloud object store. As customers start accessing data in the object store, Filespaces essentially starts streaming the file.
Thompson doesn’t claim to be fully solving the problem of making data appear to be local even when it’s stored in a data center halfway across the globe. “You have to live within the rules of physics,” he says. “But what we’re addressing is the problem of latency, as it relates to cloud, both in terms of distance and also in terms of object storage. Those are the two things we’re really addressing.”
The Filespaces software uses tricks like pre-fetching and caching locally to ensure the hottest data is stored locally. It also avoids NFS, which is too chatty for use over WANs, and provides compression and encryption.
It’s very much like Netflix, Thompson tells Datanami. “You can think of us as Netflix, but instead of streaming movies, we’re providing reads and writes,” he says. “It’s like Netflix for general files.
The software does not synchronize the local data store with the cloud object store, which would create thorny consistency issues, Thompson says.
“That source of truth is the data lake in the cloud,” Thompson says. “We’re not synchronizing entire data sets or files. We’re synchronizing metadata, so that when the application is requesting a particular block, they can see whether it’s stale or not, and then we just stream the relevant part of that file on demand.”
The company has about 200 active instances currently, according to Thompson, who says LucidLink is most effective with bigger files, such as CAD/CAM files, video footage, radiology images, and other types of massive, unstructured data. Customers are primarily in media and entertainment, oil and gas, and healthcare industries, but not exclusively.
One of LucidLink’s customers is a consortium of wineries, tasting rooms, and other business in the wine business. The company tried to keep all of its data straight by utilizing a file sharing service that synchronized data across on-prem and cloud locations. That worked for smaller files and small data sets, but as the company grew, it proved too difficult to keep the files synched.
“What we hear from customers is there’s a long tail to regular file-based access,” Thompson says. “Everybody is talking about the cloud and moving to the cloud. But when we talk to customers [they’re concerned with] how to do it easily and with as little disruption as possible.”
A Global Mount
Another company targeting this space is Kmesh, which is based in Santa Clara, California. Kmesh develops a Lustre-based distributed file system that blends the benefits of centralized data lakes based on object stores and distributed repositories of data.
“Our thesis is the last 20 years has been focused on centralized data lake,” says Jeff Kim, Kmesh CEO. “Put all of your data into a single repository, mostly on prem, and then operate all of your applications and users and stuff from this centralized data lake. Our view is the next 20 years we’re going to see a mass transformation of de-centralized data lakes into distributed data ponds, across both on prem, multi cloud and now edge compute.”
Enterprise today are trying to figure out how to do this effectively — to move data around and orchestrate data and set up data policies across all of these compute environments, Kim says. “What Kmesh does is we offer software to allow customer to make that transformation easier,” he says.
Customers can use the Kmesh file system to streamline access to other data stores, including NoSQL databases like Cassandra and MongoDB, a Spark implementation, or even log histories or configuration files. No matter where those data stores are located, they can be made available in a single global namespace via Kmesh, Kim says.
“Instead of that [data] residing in a single location, you can spread it out across the globe or across compute environments and still have the ability to manage it like a single repository,” Kim says. “You don’t have to change your application, your functionality. You can basically distribute this at the file system level and it doesn’t matter what you have on top.”
Kmesh lets users access data stored in a variety of underlying file systems or database as if it’s local, even though the data actually may be located in the cloud. The software provides a range of pre-fetching and caching capabilities to give customers the flexibility they need, Kim says.
Data locality laws that have sprung up in the wake of GDPR are one of the bigger drivers of business for Kmesh. So is the desire among customers to avoid lock in with cloud vendors. “We’re seeing that quite a bit,” Kim says. “You can absolutely go to AWS, use all AWS features and go all in on AWS. Later on though if you want to work with Azure or Google cloud or a new edge compute from Vapor or Ericsson, the system that they built would have to be completely overhauled because all your technology is based on the AWS software stack.”
Kim tells his customers to take care to future-proof their applications so they’re not locked into a single vendor. “Don’t use Amazon’s RDS [Relational Database Services] because it’s completely black box. There’s nothing you can do outside of Amazon with that. Better to use Cassandra, which works everywhere, and then use something like us so you can move and mix and do whatever you need to do.”
While Kmesh’s software is based on Lustre, most customers don’t care about it, as they are just trying to solve business challenges. The one exception are the government labs that are looking for more effective ways to store their data.
“We took Lustre, which was primarily built for supercomputer environments, and we ported it to the cloud,” Kim says. “Why it took three years to get there was because a lot of software that was built for HPC supercomputers we had to adapt to work in a cloud environment over a WAN, and [move it] from microsecond environments to the millisecond environment.”
Kmesh and LucidLink are examples of a new breed of file system that’s emerging to tackle today’s big data challenges. While cloud object stores are poised to capture much of the storage workloads, the diversity of today’s applications require a mixed approach that distributed file systems appear poised to help fill.
Related Items:
Object and Scale-Out File Systems Fill Hadoop Storage Void
The State of Storage: Cloud, IoT, and Data Center Trends
Big Data Is Still Hard. Here’s Why