Worries about dodgy data have been with us since the dawn of spreadsheets. Now that GenAI has exploded onto the scene, questionable inputs are being blamed for the hallucinations and other odd behaviors that routinely plague LLMs.
Data quality determines the reliability of AI outputs. If you don’t have full confidence in the strings, floats, bools, chars, enums, and arrays you’re feeding into a machine learning model, you can’t be 100% sure of the answers it spits out or the inferences it makes.
To trust AI, you need trusted data. How can MLOps teams ensure their training sets are always fit for purpose?
Key Takeaways
- Data sets with incomplete, out-of-date, or unreadable information are being blamed for high-profile AI malfunctions: hallucinations, made-up facts, or doubling down when falsehoods are challenged.
- With the arrival of Big Data analytics, you might think we’d have solved the data quality issue by now. But problems persist.
- Just this week, the teenage founder of Scale AI closed an Nvidia-backed $1 billion funding round for his data-cleansing startup.
- Better technology to manage data might help, but tools can only do so much. Chief Data Officers and Data Science teams need to work together and create a company-wide culture of data quality assurance.
Garbage In. Garbage Out.
GenAI’s plausibility problem is getting harder to ignore.
A November 2023 study by Vectara, an AI startup founded by ex-Google employees, found the frequency of GenAI hallucinations ranging from three percent for ChatGPT to 5% for Meta’s AI systems and 8% for Anthropic’s Claude 2. Google’s PaLM came in highest at an eye-watering 27%.
Figures from machine learning platform Aporia suggest that may be the tip of the iceberg.
Nearly 90% of MLOps professionals working on AI projects told researchers that their models display signs of hallucination. Aporia’s survey also found that 93% of engineers experience issues on a daily or weekly basis.
That’s a lot of unreliable inferencing. While part of the blame rests with the models themselves, experts say poor training data is often the fail point.
Lorraine Barnes, UK Generative AI lead at Deloitte, told Techopedia that GenAI tools read patterns and “sometimes these patterns lead to unexpected or inaccurate outputs, even if the training data itself is high quality.”
But she adds that the importance of fit-for-purpose data “cannot be overstated — especially in AI applications. Unlike traditional applications, AI often makes decisions or generates content that can directly impact business outcomes.
“If the data fed into the AI model is flawed, biased, or incomplete, the resulting decisions or outputs will be flawed as well.”
Why Dodgy Data Is Behind Those AI Hallucinations
“There’s a common assumption that the data (companies) have accumulated over the years is AI-ready, but that’s not the case,” writes Joseph Ours, a Partner at Centric Consulting. “The reality is that no one has truly AI-ready data, at least not yet.”
He says firms have typically collected data primarily for immediate operational needs or to power analytics tools driven by humans. “This often leads to limited and gap-filled datasets. They might be rich in specific operational aspects but missing other potential dimensions.”
In layman’s terms, that means corporate data sets can be riddled with duplicates, out-of-date information, or stored in hard-to-read formats. They can also be incomplete, of uncertain lineage, or collected in ways that skirt data privacy regulations.
Bad data is both an AI and a business risk.
An Innovation Killer
If an LLM contains data that’s inaccurate, incomprehensible, or missing key details, it’s easy to see how it could make errors. Complicating matters further is the tendency to ‘overfit,’ meaning the AI model memorizes the inputs and outputs it made using a sub-optimal data set. Its ability to generalize new data is compromised, creating the basis for hallucinations.
Sue Daley, Director for Tech and Innovation at techUK, told Techopedia that Generative AI’s effectiveness “hinges on the quality of data it’s trained on.
“Incomplete or inaccurate data may result in lower model accuracy, higher likelihood of bias, and in incorrect or factually inaccurate outputs.”
That would make bad data an innovation killer. As far back as 2018, Gartner famously predicted that 85% of AI projects would fail due to bias in data or algorithms. Last year, the Harvard Business Review said plainly that ‘most AI projects fail,’ with ‘availability, quantity, freshness, and overall quality of data’ being a key barrier.
While the AI juggernaut continues to dominate tech headlines at the corporate level, a lack of trust could still hinder investments in R&D.
Maintaining Data Quality Is Hard
With all the emphasis on business intelligence and analytics over the last two decades, why is data quality still a concern?
George Johnston, data, privacy and analytics lead at Deloitte, told Techopedia there are several challenges in getting data quality right, but four stand out:
- There’s So Much of It: “The sheer volume and variety of data formats available today make it difficult to manage, clean, and maintain.”
- Siloes Make Unifying Data More Difficult: “Integrating data from diverse sources and systems often involves resolving inconsistencies, dealing with missing values, and data compatibility.”
- Legacy Systems Can’t Keep Up: “Many organizations rely on outdated systems that were not designed to maintain data quality at the scale and complexity observed in today’s system landscape.”
- Budgets, Tools, and Skill Sets May Be Lacking: “Getting data quality right, requires resources, expertise, and tools. Many organizations struggle to secure the necessary funding for their data initiatives. This is often due to competing priorities, with other investments promising more immediate returns.”
AI Needs Data. Does Data Need AI?
So what’s the solution? Some say the symbiotic relationship between data and AI points to a solution.
Data management and governance vendors have begun adding ‘AI-powered’ to their descriptors. These platforms automate time- and labor-intensive tasks around data quality like cleaning, extracting, integrating, cataloging, labeling, and securing it. Adding an AI engine scales their capabilities to meet the intense data and computational demands of LLMs.
Investors have also woken up to the issue. Earlier this week, Scale AI, a startup specializing in data quality for AI applications, announced a $1 billion funding round backed by Nvidia (NVDA) and Amazon (AMZN).
New and better solutions for data management are part of the answer, but people and processes factor in too.
TechUK’s Daley says it demands a holistic approach:
“It involves significant investment in better infrastructure and tools, but also changes in organizational culture, data governance practices, and regulatory guidance. Given the systemic nature of these issues, addressing them will require data governance and management to be put at the top of each organization’s agenda, and should be supported by the next government’s policy agenda.”
Deloitte’s Johnston points to more focused coordination between Chief Data Officers (CDOs) and Machine Learning Operations (MLOps) teams as a way to embed data quality into corporate culture. He says:
“A key point of intersection between the CDO and MLOps team lies in the iterative process of identifying and prioritizing data quality where it truly matters for specific AI projects. Rather than embarking on an exhaustive enterprise-wide data quality remediation programme, a more effective approach is to focus on improving the quality of data directly relevant to specific use cases.”
The Bottom Line
Deloitte’s Lorraine Barnes says it’s essential to ensure machine learning models are effective and reliable in real-world applications.
“Just as a chef carefully selects the freshest, highest-quality ingredients to create a culinary masterpiece, AI developers must curate and pre-process data to ensure its accuracy, relevance, and representativeness.”
One way or another, tech vendors, CDOs and data science teams need to deal with GenAI’s data dilemma. Those HAL 9000 memes aren’t going to go away on their own.
FAQs
Why is data quality important in generative AI?
Where can I get training data for machine learning?
What type of data is used to train generative AI models?
References
- (PDF) Information and Data Quality in Spreadsheets?(Researchgate)
- Cut the Bull…. Detecting Hallucinations in Large Language Models?(Vectara)
- 2024 AI & ML Report?Evolution of Models & Solutions – Aporia?(Aporia)
- Lorraine Barnes | Deloitte UK?(Www2.deloitte)
- No One’s Data is Ready for AI – Yet?(Centricconsulting)
- Sue Daley – techUK | LinkedIn?(Uk.linkedin)
- Gartner Says Nearly Half of CIOs Are Planning to Deploy Artificial Intelligence?(Gartner)
- Keep Your AI Projects on Track?(Hbr)
- George Johnston?(Uk.linkedin)
- How AI Is Improving Data Management?(Sloanreview.mit)
- Accelerate the Development of AI Applications | Scale AI?(Scale)
- Scale AI valued at $14 bln in Nvidia, Amazon-backed funding round?(Reuters)
- I’m sorry Dave?(Reddit)