In today’s world, artificial intelligence (AI) is transforming high-stakes areas like healthcare, transportation, and finance. With large language models (LLMs) at the forefront, ensuring their safety, limitations, and risks are more critical than ever.
To help in making ethical choices, the trustworthiness of different LLMs has been evaluated using the DecodingTrust framework. This platform, which won an award at NeurIPs’23, provides detailed assessments of LLM risks and trustworthiness.
We explore how ratings are formed and, most importantly, which AI models you should use if trust is your top priority.
Key Takeaways
- Claude 2.0 is rated the safest AI model with a trustworthiness score of 85.
- GPT-4 is more susceptible to misleading prompts compared to GPT-3.5.
- No single AI model excels in all areas; each has unique strengths and vulnerabilities.
Top 10 Most Trustworthy AI Models
As of 2024, the LLM Safety Leaderboard, hosted by Hugging Face and based on DecodingTrust, rated Anthropic’s Claude 2.0 as the safest model, with an 85 trustworthiness score.
Claude 2.0 was followed by Meta’s Llama-2-7b-chat-hf (75 trustworthiness score) and OpenAI’s GPT-3.5-turbo-0301 (score of 72).
Some top-line conclusions that come from the tests include:
- GPT-4 is more vulnerable than GPT-3.5, especially to misleading prompts.
- No single LLM is best in all trustworthiness areas. Different models excel in different aspects.
- Improving one trustworthiness area may lead to worse performance in another.
- LLMs understand privacy terms differently. For example, GPT-4 may not leak private information when prompted with “in confidence” but might when prompted with “confidentially”.
- LLMs can be misled by adversarial or tricky instructions.
Trustworthy AI Models: What Do We Mean By “Trustworthy”?
The LLM Safety Leaderboard uses the DecodingTrust framework, which looks at eight main trustworthiness aspects:
- Toxicity
DecodingTrust tests how well the AI handles challenging prompts that could lead to toxic or harmful responses. It uses tools to create difficult scenarios and then checks the AI’s replies for any toxic content.
- Stereotype and Bias
The evaluation looks at how biased the AI is against different demographic groups and stereotype topics. It tests the AI multiple times on various prompts to see if it treats any group unfairly.
- Adversarial Robustness
This tests how well the AI can defend itself against tricky, misleading inputs designed to confuse it. It uses five different attack methods on several open models to see how robust the AI is.
- Out-of-Distribution Robustness
This checks how the AI handles unusual or uncommon input styles, like Shakespearean language or poetic forms, and whether it can answer questions when the required knowledge wasn’t part of its training.
- Privacy
Privacy tests check if the AI leaks sensitive information like email addresses or credit card numbers. It also evaluates how well the AI understands privacy-related terms and situations.
- Robustness to Adversarial Demonstrations
The AI is tested with demonstrations that contain false or misleading information to determine its ability to identify and handle these tricky scenarios.
- Machine Ethics
This tests the AI’s ability to recognize and avoid immoral behavior. It uses special datasets and prompts to see if the AI can identify and respond appropriately to ethical issues.
- Fairness
Fairness tests see if the AI treats all individuals equally, regardless of their background. The model is prompted with challenging questions to ensure it doesn’t show bias in its responses.
Each aspect is scored from 0-100, where higher scores mean better performance.
For AI models to be responsible, they need to do well in all these areas. DecodingTrust gives an overall trustworthiness score, with higher scores showing more reliable models.
The Bottom Line
The stakes are high. As AI models continue to enter important areas, trustworthy data is not optional — it is essential.
The latest results show that no single model is the best in every area, with each having its strengths and weaknesses. While Anthropic’s Claude 2.0 is currently the safest model, GPT-4’s higher vulnerability to misleading prompts shows an urgent need for improvement.
So, the call goes out for ongoing research and innovation. Creating more reliable and ethical AI technologies is not just a technical challenge but a moral duty. The future depends on how well we meet this challenge.