Foundation models help extract data insights

Dec 07, 2023

In every job I’ve had, data has played a central role in the company’s success. From my time at companies like Google and Slack, I saw how crucial it is to have a firm understanding of how a product is being used. The key to this information is quality data. Misinterpret the data, and it will have a direct impact on your business.

Christina Janzer, SVP of Research & Analytics at Slack, agrees. According to Christina, “Many companies, including Slack, face data hygiene challenges. Having low-quality data prevents data scientists from doing their jobs, and if you start by unknowingly looking at the wrong data, you’ll never succeed.” At Slack, we once ran an experiment and misclassified users during the analysis. We concluded that the experiment was negative, but when we realized the error, we saw that it was, in fact, positive!

One startup attempting to leverage data insights more effectively is Numbers Station, a company started by Stanford academics who are using generative AI to gather and interpret data. Its co-founder, Ines Chami, has a doctorate in philosophy in computational and mathematical engineering from Stanford, where she co-authored a seminal paper on how foundation models can be used to clean and integrate data. (You can find the code and experiments from the paper, which are publicly available, here.)

Foundation models’ role in the future of business intelligence

What stands out most to me from my conversation with Ines is how LLMs are poised to unlock insights from data for a wider set of users than ever before. This is a big win for business leaders who might lack the technical expertise needed to efficiently structure, clean, and analyze troves of data.

Historically, enterprise data has been incredibly difficult to query and analyze because it's often structured in a way that’s messy and complex to comprehend. Currently, Numbers Station is building LLMs to help organizations capture valuable insights from their data. “We’re building LLMs that can ingest a variety of knowledge sources to build a more holistic view of the data,” Ines told me. “This obviously unlocks a lot of other use cases, alongside self-service analytics.”

How Numbers Station works

Building enterprise-specific knowledge layers is only the first step in Number’s Station’s journey to its ultimate goal: democratizing access to data via personalized and private LLMs. Already, the team has developed proprietary technology to automatically create semantic layers using various knowledge sources within an organization. Today, its primary audience–non-technical business users–can translate business questions expressed in natural language into code based on their data schema and data model. “Overall, the product makes the time to value much shorter and gives more organizations the ability to leverage semantic layers in their BI pipelines,” said Ines.

A lot of this work means extracting information from lines of SQL code. For example, if a Numbers Station customer is attempting to access data that involves the number of active customers, Numbers Stations will save the definition of an “active customer” as used by the organization. They do this by extracting that information from the query and putting it in a semantic layer in the metric fields. This is the first step of laying out an initial semantic layer that functions automatically.

It’s important to note that, like many companies using foundation models as a starting point, Numbers Station is not yet fully automated. A critical part of its process requires human oversight to interact with the semantic layer, either by curating the metrics or adding new metrics. Once this semantic layer has been established, it can be used as a part of Numbers Station’s analytics copilot.

Now, when a user asks a question in natural language, Numbers Station taps into the semantic layer to provide an appropriate response. Any code that’s generated as a reply is drawn directly from the semantic layer that was created for the enterprise, which means that the data definitions return trusted results. “Whenever we generate code, it’s not just randomly made up code from the internet where the definition of active customer might be different,” said Ines.

What matters most for LLMs and data insight

It may seem like Numbers Station is positioning itself to compete with software that specializes in providing business intelligence insights like Tableau or AtScale. But Ines says this isn’t Numbers Station's end goal. “We’re not trying to reinvent all of Tableau’s visualization features,” she said. “Instead, we want to create a path to visualization from natural language. Numbers Station is not a business intelligence tool. We’re a platform for LLM-powered applications in the modern data stack.”

While working on her PhD, Ines spent a lot of time building and cleaning knowledge graphs, two components that are critical to successfully applying foundation models on unstructured data. “What we're now realizing even more is that creating an accurate knowledge source is really the value that complements LLMs,” she said. “It goes back to data-centric AI. It's all about the data. And if I have an accurate knowledge source, then I'm going to unlock more capabilities downstream.”

LLMs are powerful tools that could transform how companies extract and organize data insights, and many companies are tackling this space. I’m eager to see how this technology evolves.

Foundation models help extract data insights

Foundation models’ role in the future of business intelligence

How Numbers Station works

What matters most for LLMs and data insight

Discussion about this post