Follow BigDATAwire:

February 15, 2022

Coiled Finds Traction in Deploying Dask at Scale

(dTosh/Shutterstock)

When a data scientist is done playing around with a model and wants to run it at scale, she has several options. One potential avenue is Dask, the open source framework that parallelizes Python code. And since 2020, when the creator of Dask launched Coiled, data scientists have had a place to get technical support too.

There’s no shortage of Python code in the world today, both for data science and general computing use cases. The language ascended to the number one position on the TIOBE Index in 2022, and it’s by far the most popular language for data science and machine learning work today.

This is great news for Matthew Rocklin, the author of Dask. Rocklin originally released Dask in January 2015 to provide a way to scale up Python code to run on distributed clusters. From NumPy and Pandas to scikit Learn and PyTorch, there has been major growth in the Python data ecosystem, and Dask adoption has grown with it.

But there’s a lot that goes into distributed applications, and managing Dask applications – including applications that use the Dask engine as well as the pre-built Pandas environment — isn’t always easy, according to Rocklin.

“The classic story we see today is that some companies are using Dask for three to six months,” Rocklin tells Datanami. “It’s usually a data science or engineering team. They’re using their laptops. And they really like it.”

At some point, a confident member of the data science or engineering team decides to run Dask on a bigger data set in the cloud, Rocklin says. They start the new project in the cloud, and then they run into trouble.

“Then they realize, oh, this is actually kind of hard,” he says. “This is tricky. And so they want they want help in a few ways.”

That’s where Coiled comes in.

Dasking in the Cloud

Coiled is the commercial outfit that runs Dask in the cloud via the software as a service (SaaS) delivery method. Coiled spins up Dask environments when customers need large clusters, and spins them back down when the work is over.

Rockin explains how the basic Coiled workflow works:

“When the Python user is in their notebook, a laptop or in some other cloud system like SageMaker, and they want to scale this out. [They say] ‘I’ve run some code locally. I’m having a good time. And I want to operate on my full data set.’

“They give us enough permissions to operate within the cloud environment in a very safe and secure way…They import Coiled. They ask Coiled for a Dask cluster with 500 machines of a certain type. We present those machines to them in about a minute. They then go off to do that work. They produce some plot, they turn the cluster down, and they go off on their on their merry way.

“We make it very, very easy to get large-scale computer resources at the drop of a hat.”

Customers turn to Coiled when they find their data scientists or data engineers have turned into DevOps engineer. That’s not really their forte, so by adopting Dask and Coiled, they can automate much of that operations work, and get back to the Python-based data science or data engineering work that they’re paid to do–and frankly, what they would prefer to do.

Rocklin sees two groups of users being attracted to Coiled: small teams that want to get their data scientist back, and large teams that are laser-focused on cutting costs.

“On small teams, cost is not a major factor. The major cost is actually that that half of FTE they sacrifice for DevOps. They’re just trying to get that person back,” he says. “But when you go to the Fortune 50, Fortune 100 companies, it becomes critical and costs become a major aspect.”

In addition to spinning up Dask servers, Coiled provides observability into the Dask environment.

“I can tell you exactly how much money you spent parsing CSV files and how much you would have saved how you switched to Parquet,” Rocklin says. “Dask gives you a lot of intelligence, a lot of visibility into what your computations are doing.”

Coiled for Growth

Currently, Coiled runs on AWS and Google Cloud, and Rocklin is working on supporting Microsoft Azure in the near future. The company itself is about a 50-person, geographically distributed team, although Rocklin and the company are based in Austin, Texas. Coiled itself was spun out of Anaconda, the Python data science platform company that is also based in Austin.

Matthew Rocklin is the CEO and founder of Coiled and the creator of Dask

Coiled is not the only company looking to bring automated Dask environments to the public cloud. An outfit called SaturnCloud has a similar offering. Google Cloud and Microsoft Azure also have Dask as a service offerings. But in Rocklin’s view, the main competitor is roll-your-own software development programs.

“The competition we most often see is do-it-yourself,” he says. “Dask open source is good enough that many companies can do this themselves and it’s on us to make sure that we’re providing a better experience than that, a more efficient experience than that.

Also competing with the Dask-Coiled combo is Apache Spark and its commercial backer, Databricks. This presents more formidable competition to Coiled, which raised $21 million in a Series A round of funding last year to go along with a $5 million seed round.

“Coiled is definitely a young company, but it’s attached to a very mature open source project,” Rocklin says. “The product is up there, and it does what we need it to do.”

As proof, Rocklin cited a recent survey by the Python Software Foundation about developers product usge. “Eleven percent said they use Spark for big data and 5% say they use Dask,” Rocklin says. “We’re definitely second place, but not by not much. We’re a heavy hitter in terms of usage.”

But in Rocklin’s view, Dask has a big advantage over Spark: It’s more easy-going, and less finicky about who and what it works with, particularly in the Python data ecosystem.

“While you can use the Python language with Spark, those libraries don’t easily work with Spark,” he says. “Spark is a bit too opinionated. It’s got its own way of doing things. It’s got its own DataFrame. It’s got its own machine learning library. It’s got its own libraires for this stuff.

“Dask on the other hand is not opinionated,” Rocklin continues. “Dask uses the existing Python libraries. People like those libraries. That’s the part they like. No one really likes the Python language itself… Everyone understands it. It’s the lowest common denominator. But the value of it is all those libraries that have built over the last couple decades.”

Open Data Ecosystem

Coiled provides technical support for Dask users, including office hours where they can get access to Dask experts. That’s valuable today, especially as companies are trying to navigate an increasingly complex landscape of tools, Rocklin says.

Dask has deep roots in the Python ecosystem, which trends more toward classic data science use cases. Rocklin is seeing more interest emerge among Python coders for data engineering use cases. That’s the opposite of Databricks, which started out with a heavier focus on data engineering in Spark and is now trying to move more toward data science, he says.

“The question is, where do those two things mix?” he says. “What’s nice is that we’re leaving the era of the all-in-one platform. I’m a little bit bearish on Databrick as a result. Instead, I think we’re going to see lots of different technologies, lots of different companies co-exist.”

This dynamic has created an affinity between Coiled and Snowflake, Rocklin says.

“We provide very disparate services, and we make sure that our technologies work well together,” he says. “And also Snowflake is generally a much better SQL experience than what Dask or Coiled provides. But Dask and Coiled provide a much better machine learning experience, a much better ad hoc computing experience, a much better Python experience. So the two technologies complement each other well.”

Snowflake isn’t usually listed among the companies pursuing an open data ecosystem, but as Rocklin sees it, it plays an important role in the emerging big data field.

“I think we’re going to see a lot of customers go towards not an all in one platform , but go towards a sort of mix and match best of the technology stack,” he says. “The cloud makes it very easy for all this technologies to coexist.”

Related Items:

What’s Driving Python’s Massive Popularity?

Three Reasons Python Is The AI Lingua Franca

Do Customers Want Open Data Platforms?

BigDATAwire