July 24, 2023

Cleanlab Emerges with $5 million to Automate Data Curation for LLMs and the Modern AI Stack

Tuesdays with Trailblazers ft. Avesta Hojjati, VP of Engineering, DigiCert

Tuesdays with Trailblazers ft. Mathias Golombek, CTO, Exasol

Cleanlab unleashes AI’s full potential by automatically turning unreliable real-world data into reliable insights and models for large image, text, and tabular datasets

Today Cleanlab, the automated solution for boosting the accuracy of enterprise artificial intelligence (AI), LLM, and analytics solutions, announced its $5 million seed investment round led by Bain Capital Ventures. The flagship product, Cleanlab Studio, is the only enterprise solution for evaluating and correcting errors in both large structured data (e.g. tabular data and spreadsheets) and large unstructured data (e.g. visual data, LLM generated data, conversational data, etc).

Today most companies are adopting AI models and business intelligence (BI) solutions, but they aren’t utilizing the full range of their data to train the model. Data and label quality issues like outliers, label errors, and data shift often make the data too poor to be useful input for reliable business intelligence, training of ML models, or fine-tuning of LLMs.

Inaccurate data costs U.S. businesses $3.1 trillion per year and growing, according to research from IBM. Using Cleanlab, organizations like Amazon, Google, Walmart, Deloitte, Wells Fargo and many others have dramatically cut costs and time spent on data quality by automating the correction of errors in their datasets. Cleanlab is designed to work with most kinds of datasets including text, images, and tabular/CSV/JSON data.

Cleanlab solves this problem for enterprise by analyzing unreliable, real-world datasets to find and fix errors and generate an improved dataset, and uses that improved dataset and AI-generated new labels, freeing up precious engineering resources to focus on problem solving, not data curation and model training.

Cleanlab has already created the most popular open-source library for data-centric AI, used by thousands of data scientists to automatically diagnose issues in real-world data through algorithms running on top of any existing ML model. However, diagnosis alone doesn’t work for companies that don’t have the model or interfaces to fix the issues they’ve identified. To serve this broader market, the company introduced Cleanlab Studio, an enterprise application that seamlessly handles correcting data issues and reliable model deployment.

Curtis Northcutt, Jonas Mueller and Anish Athalye, all three PhDs from MIT, founded Cleanlab after working on a new area of AI known as “confident learning”, invented by Northcutt during his PhD at MIT while working with Isaac Chuang (pioneer of the quantum computer).

Using Cleanlab Studio, both individual data scientists and enterprise teams get more value out of their data by automating the process of finding and fixing outliers, label issues, and other data issues in image, text, and tabular datasets, enabling them to train more reliable models and derive more accurate analytics and insights. Unlike other solutions in this space, Cleanlab Studio handles model training for you with state-of-the-art auto-ML, requires no hyper-parameter tuning or model selection, no code, and no machine learning expertise to deliver an improved dataset, ML model, and business insights in significantly less time.

“We often forget that like humans, artificially intelligent solutions embody imperfection. The next evolution of AI is being able to characterize this imperfection: understanding, finding, and fixing errors in the data it’s trained on. Everyone can relate to Cleanlab because it works like how you do: if you are taught wrong things, you perform worse on the exam. Cleanlab automates data curation and correction to produce more accurate models in less time,” said Cleanlab AI co-founder & CEO Curtis Northcutt. “We don’t guarantee perfection. We guarantee improvement. Cleanlab breaks AI’s glass ceiling by providing accessibility and reliability for AI solutions.”

“A major risk with LLMs is ‘garbage-in, garbage-out’ in that if they’re trained on messy data that contains bias, inaccuracy, or nonsensical information, their outputs will often contain similar issues,” said Aaref Hilaly, partner at Bain Capital Ventures. “There’s also great opportunity in better data curation, since LLM performance is still largely data-bound, as Deepmind’s Chinchilla paper (and others have shown). Cleanlab is the easiest way to curate data for training and fine-tuning, and an integral part of the emerging infra stack that supports modern AI.”

“Cleanlab helped us improve accuracy by 28%, while reducing the number of labeled transactions required to train the model by more than 98%,” said David Muelas Recuenco, Expert Data Scientist at BBVA (Banco Bilbao Vizcaya Argentaria), one of the largest financial institutions in the world when discussing how Cleanlab reduced their costs for dataset curation and model training by over 98%.

“Using Cleanlab AI, we’ve increased model accuracy by 15 percent, and reduced training iterations by one-third,” said Steven Gawthorpe, Senior Managing Consultant Data Scientist at Berkeley Research Group. “Our team has been extremely impressed with the accuracy, speed and ease-of-use that Cleanlab provides.”

Prior to Cleanlab, Co-Founder and Chief Scientist Jonas Mueller built Amazon’s auto-ML solution, used by all AWS auto-ML jobs today. Co-Founder and CTO Anish Athalye holds 5k+ citations for several groundbreaking works demonstrating where AI solutions are broken and how to improve them. By coupling Curtis’s work to auto-fix issues in most datasets with Jonas’s work to auto-train ML models on any dataset with Anish’s work in secure systems, the team was able to create Cleanlab Studio to achieve its mission to make AI more accessible and more effective for humanity.

Cleanlab Studio integrates with most common data and ML workflows, uploading large datasets at internet-bandwidth times, and scales for enterprises.

On June 1, 2023, Databricks announced its partnership with Cleanlab to bring automatic data correction to both structured and unstructured datasets via the Databricks platform through the Cleanlab Studio integration.

In 2021, Cleanlab was nominated for the best paper award at NeurIPS. In 2022, Cleanlab published 5 peer-reviewed papers NeurIPS and ICML conferences/workshops and in 2023, Cleanlab’s executive team taught MIT’s course on Data-centric AI.

Cleanlab is actively working with organizations training large models or developing business intelligence and analytics solutions on image, text, tabular, and other types of data.