AtScale Claims Text-to-SQL Breakthrough with Semantic Layer
One of the bottlenecks in getting value out of generative AI is the difficulty in turning natural language into SQL queries. Without detailed contextual understanding of the data, the text is converted into SQL that doesn’t give the right answer. But thanks to the use of its semantic layer, AtScale claims that it has achieved a breakthrough in text-to-SQL processing that could pave the wave for more natural language query adoption in GenAI.
AtScale got its start in the advanced analytics world by providing an OLAP layer that helped to accelerate SQL queries in big data environments. The company’s query engine was used to speed throughput in some of the world’s biggest data warehousing and data lake environments. As the big data market has evolved in recent years, AtScale has shifted its focus to its semantic layer, which sits between the business intelligence (BI) tool and the data warehouse.
The semantic layer has emerged as a critical component in advanced analytic systems, particularly as the scale, user-base, and importance of automated decision-making systems increased. By defining the key metrics that a business will use, the semantic layer ensures that a diverse user community working with large and diverse data sets can still get the right answers.
The importance of a semantic layer is not always obvious. For instance, a question such as “What was our total sales last month by region?” may seem, at first glance, to be relatively simple and straightforward. However, without concrete definitions for what each of those words mean in terms of the organizations’ specific data, it’s actually quite easy to get wrong answers.
Semantic layers have become even more critical in the advanced analytics space since the GenAI revolution started back in 2022. While large language models (LLMs) like OpenAI’s GPT-4 can generate decent SQL based on natural language input, the odds that the generated SQL will provide accurate answers are slim without the specific business context provided by a semantic layer.
AtScale sought to quantify the difference between in accuracy between using an LLM-generated SQL queries with and without the semantic layer. It set up a test that utilized the Google Gemini Pro 1.5 LLM against the TPC-DS dataset. It shared the results of the test in its new white paper, titled “Enabling Natural Language Prompting with AtScale Semantic Layer and Generative AI.”
For the first test without the semantic layer, AtScale configured the system to using few-shot prompting against the source schema of the TPC-DS dataset. The system is configured to return “not sure” if the LLM deems the question unsolvable. If the query generates an SQL error, the query is retried three times before it’s deemed unsolvable. If the query runs without an error, a result is returned and then manually checked for correctness. This system demonstrated a 20% total accuracy rate, the company said.
The second system used the same model and dataset, but was configured differently. There were two key differences, according to Jeff Curran, the data science team lead at AtScale.
“First, instead of the DDL being provided in the prompt, metadata about the semantic model’s logical query table is provided. It is important to specify here that only metadata from the Semantic Layer is provided to the LLM, no data from the underlying warehouse tables is sent to the LLM,” Curran wrote in the white paper.
“Second, in this system, the generated queries are submitted against the AtScale Query Engine instead of the data warehouse. The final change is that AtScale determines the validity of the SQL syntax for the query made against it, as opposed to the data warehouse,” he wrote.
When AtScale implemented its semantic layer and OLAP engine to the mix, the accuracy rate of the generated SQL queries jumped to 92.5%, according to the paper. What’s more, the semantic layer-based system got 100% of the easier questions correct; it only generated erroneous data with the most complex queries among the 40 questions.
This level of accuracy makes GenAI-powered natural language query (NLQ) systems useful in business settings, the company says.
“Our integration of AtScale’s Semantic Layer and Query Engine with LLMs marks a significant milestone in NLP and data analytics,” David Mariani, the CTO and co-founder of AtScale, said in a press release. “By feeding the LLM with relevant business context, we can achieve a level of accuracy previously unattainable, making Text-to-SQL solutions trusted in everyday business use.”
“For example, a question like ‘What was the sum of web sales for each product brand in the year 2002?’ requires SQL that defines the ‘sum of web sales’ KPI, as well as joins between the underlying web_sales, date_dim, and item_dim tables,” Curran wrote. “This level of high schema and question complexity questions was unsolvable for the control NLQ system, but returned the correct result when using the AtScale backed NLQ system.”
While prompt engineering and RAG can provide some context for LLMs and help them towards the right answer, there are questions about just how far users can trust the LLMs. AtScale notes that the LLM sometimes hallucinated data that didn’t exist, or ignored orders to use certain filters.
“Even though the column names were provided, there were cases where the LLM referenced column names that did not exist in the table,” Curran wrote. “The name generated by the LLM was always a simplified version of a column from the provided metadata.”
With a few tweaks and some fine-tuning, the accuracy rate could be bumped up even higher, Curran wrote.
“We believe that a set of additional training data designed to prepare an LLM for work with the AtScale Query Engine could vastly improve performance on even higher complexity question sets,” he wrote. “In conclusion, the AtScale Semantic Layer provides a viable solution to accomplishing basic NLQ tasks.”
Related Items:
Is the Universal Semantic Layer the Next Big Data Battleground?
AtScale Announces Major Upgrade To Its Semantic Layer Platform
Why a Universal Semantic Layer is the Key to Unlock Value from Your Data