Artificial intelligence (AI) is nothing without data. Hence, the underlying assumption has been that whoever wins the AI race must have a pool of training data that’s not just vast but better than others.
Take OpenAI’s GPT-4 and Meta’s Llama 3.1 as examples; they run on 1.7 trillion and 405 billion parameters, respectively, and have gained popularity due to their gigantic training datasets.
Each newly released large language model (LLM) often shatters the parameter record of its predecessor, a trend that has become a central benchmark for measuring progress in LLM development on Hugging Face.
But is bigger always better? Diffbot Technologies, a California-based startup known for its knowledge graph technology, does not think so.
Key Takeaways
- Diffbot’s AI model challenges larger LLMs by emphasizing factual accuracy over sheer parameter size.
- It uses a proprietary knowledge graph with real-time updates instead of relying on pre-trained data.
- Diffbot’s graph retrieval-augmented generation (GraphRAG) allows dynamic knowledge retrieval, reducing reliance on static datasets.
- Benchmark tests suggest Diffbot outperforms leading LLMs like GPT-4 and Gemini in real-time factual accuracy.
- Experts believe Diffbot’s hybrid approach offers a step toward solving AI hallucinations but acknowledge that no model is entirely immune.
Diffbot’s Different Look at AI Models
On January 9, Diffbot launched its first open-source LLM and claims that despite not being anywhere near the parameters of GPT-4 or Llama 3.1, it beats them on factual accuracy.
Major LLMs like GPT-4, Geminin Ultra, and Llama 3 are built by tapping into colossal datasets, often running into trillions of tokens. These AI models are fed a diverse diet of web content, books, articles, and code, which allows them to pick up complex language patterns.
However, unlike the existing popular LLMs, Diffbot said its AI model uses its proprietary Knowledge Graph, which houses over 10 billion entities and a staggering trillion structured facts gathered from across the web.
According to the startup, its AI model is built on a fine-tuned version of Meta’s Llama 3.3 and brings in a novel approach called Graph Retrieval-Augmented Generation (GraphRAG).
This method lets the model query a knowledge database that updates dynamically during inference instead of relying entirely on pre-trained data. So, instead of picking reasoning functions from knowledge storage, Diffbot’s system sieves responses from up-to-date knowledge graphs.
89,886 developers are building their own Perplexity on-prem with Diffbot LLM — https://t.co/wVsp0iZGvt
— Diffbot ?? (@diffbot) January 30, 2025
Our Quick Test of Diffbot’s AI Model on Slippery Questions
To get a clear picture of how Diffbot’s AI model handles scenarios prone to hallucinations, we tested the publicly available demo on Diffi.chat and compared its responses to those from ChatGPT’s free tier and Gemini 1.5 Flash.
We used the prompts that are already known for setting hallucination traps for AI chatbots, and below is how each of the AI models handled them.
First, we asked the chatbots:
What is the world record for crossing the English Channel on Foot?
Diffbot AI managed to avoid the hallucination trap, giving a correct but a bit too wordy answer.
When we prompted the Gemini 1.5 Flash version with the same question, it fell for the hallucination trap and even supported its answer with an image.
The same prompt was fed to the ChatGPT free plan, which produced a more comprehensive and accurate response.
In addition to the above prompts, we set three other hallucination traps using prompts like “Who was the sole survivor of Titanic?” “Write a description of a landscape in four-word sentences.”
While Diffy.chat and ChatGPT scaled through them with more accurate responses, Gemini 1.5 Flash botched them up. However, when we prompted Gemini on a second try, it provided a more accurate response.
This little experiment might not represent a complex way to trick AI into hallucination; however, it shows, at a basic level, how Diffbot’s AI model, which is lightweight compared to other leading LLMs, might perform when the going gets tough.
Can Diffbot’s AI Model Solve the AI Hallucination Puzzle?
Recent research suggests that leading LLMs still struggle with factual accuracy.
Benchmarks like C-Eval and AGIEval show that while top models achieve over 80% accuracy in basic knowledge tasks, their performance drops to 50-60% in professional-level reasoning.
Similarly, multi-dimensional assessments through platforms like OpenCompass further demonstrate strong capabilities in language understanding and knowledge retrieval (over 80% accuracy), though accuracy falls below 65% in tasks requiring advanced reasoning or specialized expertise.
According to Diffbot, benchmark tests show that its model hits 81% accuracy on FreshQA, a Google-designed benchmark for real-time factual knowledge, outpacing ChatGPT and Gemini. It also clocked in at 70.36% on MMLU-Pro, a harder version of a standard academic knowledge test.
These benchmark scores highlight Diffbot’s success in tackling factual accuracy, one of AI’s toughest challenges to date.
Rogers Jeffrey Leo John, co-founder and CTO of DataChat, a no-code, generative AI platform for analytics, said that Diffbot’s dynamic approach addresses the problem of static training data in LLMs.
Leo John told Techopedia:
“Diffbot acts like an LLM that can search and synthesize information from trusted sources, such as libraries or Wikipedia, in real time. Unlike GPT or Gemini, which rely on massive parameters and external search engines, Diffbot uses fewer parameters while leveraging a vetted knowledge engine to deliver efficient, high-quality answers.”
In a chat with Techopedia, Founder and CEO of QueryPal, Dev Nag, mentioned that Diffbot’s hybrid approach represents a shift in how LLMs are designed and utilized. He said:
“By separating the knowledge layer, powered by their trillion-fact Knowledge Graph, from the reasoning layer in their fine-tuned Llama model, Diffbot can keep its knowledge up to date without the need for costly and time-consuming retraining.”
Nag emphasized that this GraphRAG design allows LLMs to focus on querying external sources rather than memorizing facts.
While acknowledging that this doesn’t entirely eliminate hallucinations, as models can still misinterpret or combine retrieved facts incorrectly, it provides a critical advantage.
“The generation process is grounded in verifiable data points, each linked to source nodes in the graph, often with URLs, making the outputs more transparent and auditable for users,” Nag added.
The Bottom Line
Hallucination remains one of the banes of AI models despite efforts to solve it. Diffbot’s method of pairing a fine-tuned version of Meta’s Llama 3.3 with real-time querying of its knowledge graph could be one step closer to the solution.
While Dom Couldwell, Head of Field Engineering EMEA at DataStax, agrees that Diffbot’s model will help with more accurate responses, he maintains that the best way to solve hallucinations in LLMs is for organizations to use their data to train models in the context of their application.
Couldwell told Techopedia via email:
“Using your own data as part of your system provides the AI with more relevant data to pull from and create that relevant response. The process is called Retrieval Augmented Generation, or RAG. If you want to reduce hallucinations and improve your relevancy, spend time on the context that your application works in and what data it uses.”
FAQs
How does Diffbot’s AI model differ from traditional LLMs?
What is graph retrieval-augmented generation (GraphRAG)?
How does Diffbot’s AI model perform against ChatGPT and Gemini?
Can Diffbot’s AI model eliminate hallucinations in AI?
References
- Diffbot Launches World’s Most Factually Grounded Language Model: New Benchmark in AI-Powered Knowledge Retrieval (Diffbot)
- Diffy Chat (Diffy)
- Evaluating Large Language Models on Spatial Tasks: A Multi-Task Benchmarking Study (Arxiv)
- GitHub – freshllms/freshqa: Data and code for FreshLLMs (https://arxiv.org/abs/2310.03214) (GitHub)
- [2406.01574] MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark (Arxiv)