Databricks Bucks the Herd with Dolly, a Slim New LLM You Can Train Yourself
Databricks is getting into the large language model (LLM) game with Dolly, a slim new language model that customers can train themselves on their own data residing in Databricks’ lakehouse. Despite the sheepish name, Dolly shows Databricks is not blindly following the generative AI herd.
Many of the LLMs gaining attention these days, such as OpenAI’s GPT-3 and Google’s LaMDA, sport hundreds of billions of parameters and take tens of thousands GPU hours to train. Because of the costs involved in training these massive models, most early AI adopters just use the LLMs trained by the tech giants. They’re not able to train their own LLMs on their own custom data, but instead put their LLM efforts into creating the right prompts to send to the LLM via APIs.
Databricks is hoping to change that approach with Dolly, which is much smaller than LLMs like GPT-3 (let alone the massive new GPT-4) and requires much fewer computational resources to train. According to a Databricks blog post today, Dolly features only 6 billion parameters (compared to GPT-3’s 175 billion), which helps to make it “cheap to build,” the company says.
“We’re in the earliest days of the democratization of AI for the enterprise, and much work remains to be done,” Databricks execs Ali Ghodsi, Matei Zaharia, and several others wrote in the blog, “but we believe the technology underlying Dolly represents an exciting new opportunity for companies that want to cheaply build their own instruction-following models.”
Databricks is taking a more targeted approach with Dolly than others have taken with their LLMs. Instead of creating a massive model from scratch and then spending months training it on giant corpus of data culled from the Internet, Databricks took a pre-existing model off the shelf and spend three hours training it on a much smaller amount of high-quality data. The whole experience shows that an off-the-shelf model can deliver some of the same capabilities users have seen with ChatGPT–namely, it’s instruction-following functions–without the enormous cost.
Dolly is an open source clone of an LLM developed at Stanford called Alpaca, which itself was inspired LLaMA, an LLM created an open sourced by Facebook AI Research (FAIR) at Meta. Because it’s a clone, the folks at Databricks decided to name it Dolly, the sheep that was the first animal ever to be cloned.
What’s unique about Alpaca is that the Stanford researchers were able to demonstrate “ChatGPT-like interactivity” with a training set composed of just 50,000 human-like questions and answers, the Databricks execs say.
“Dolly works by taking an existing open source 6 billion parameter model from EleutherAI and modifying it ever so slightly to elicit instruction following capabilities such as brainstorming and text generation not present in the original model, using data from Alpaca,” they wrote in the blog.
Despite using a fraction of the targeted training data and having nearly 30x fewer parameters, Dolly was able to show “many of the same qualitative capabilities, including text generation, brainstorming and open Q&A” found in the larger LLMs, but without the huge training cost.
“Whereas the work from the Alpaca team showed that state-of-the-art models could be coaxed into high quality instruction-following behavior,” the Databricks team wrote, “we find that even years-old open source models with much earlier architectures exhibit striking behaviors when fine tuned on a small corpus of instruction training data.”
The company has open sourced Dolly. It’s also released a Databricks notebook that customers can use to build Dolly themselves on Databricks.
Databricks has been quietly watching the generative AI show from the sidelines, but today’s announcement is an indication that it’s ready to join the action. The company says that in the coming months, it will be making a series of announcements geared towards helpings its clients make use of LLMs. As Dolly indicates, the focus will be on enabling customers to run LLMs themselves.
“There are many reasons a company would prefer to build their own model rather than sending data to a centralized LLM provider that serves a proprietary model behind an API,” the Databricks folks say. “For many companies, the problems and datasets most likely to benefit from AI represent their most sensitive and proprietary intellectual property, and handing it over to a third party may be unpalatable. Furthermore, organizations may have different tradeoffs in terms of model quality, cost, and desired behavior. We believe that most ML users are best served long term by directly owning their models.”