Generative AI models have some massive safety issues. With the right prompts or jailbreak, bad actors can sidestep an AI vendor’s content moderation guidelines and produce harmful content, such as prejudicial content and phishing scams.
However, Anthropic has announced that it is launching an invite-only bug bounty program in association with HackerOne, which will reward researchers up to $15,000 for discovering universal jailbreak vulnerabilities.
Anthropic said: “The rapid progression of AI model capabilities demands an equal swift advancement in safety protocols. As we work on developing the next generation of our AI safeguarding systems, we’re expanding our bug bounty program to introduce a new initiative focused on finding flaws in the mitigations we use to prevent misuse of our models.”
Can bug bounty programs provide AI vendors with a solution? How can a bug bounty help?
Key Takeaways
- Anthropic announces a new bug bounty program to enhance its AI safety.
- The AI startup hopes the program will help to identify universal jailbreaks.
- Universal jailbreaks are challenging for AI vendors as they enable users to sidestep content moderation policies.
- This new initiative has been launched in partnership with HackerOne, one of the biggest bug bounty providers.
- The bug bounty market was valued at $1,130 million in 2023 and is expected to reach a value of $3,537 million by the end of 2030.
How Bug Bounties Can Help Improve AI Safety
There are only so many hours in the day. Even the most well-resourced in-house team of AI and ML engineers will struggle to discover all of the potential vulnerabilities in their models.
Outsourcing safety testing activity to a crowd of third-party researchers augments the in-house teams ability to identify vulnerabilities and can help to make products safer overall.
In this instance, Anthropic is using a bug bounty initiative to identify universal jailbreaks to address vulnerabilities in high risk areas including CBRN (chemical, biological, radiological, and nuclear) and cybersecurity.
Michiel Prins, co-founder at HackerOne, told Techopedia:
“Effective AI begins with responsible AI, and Anthropic knows a powerful model requires an equally powerful security and safety approach. Engaging the HackerOne community gives Anthropic access to expert researchers who are actively defining what the highest-impact threats look like for this entire industry.”
Prins suggests that this partnership will not only improve Anthropic’s security posture but also benefit industry standards as a whole.
“Pressure-tested models allow for safe innovation that drives AI forward. Their engagement of the researcher community continues to help us all define best practices for the entire AI ecosystem,” Prins said.
The Problem with Universal Jailbreaks
Universal jailbreaks first came to the forefront of the AI conversation back in December 2022 after users discovered a jailbreak known as Do Anything Now or DAN.
This jailbreak called on ChatGPT to adopt the role of an AI assistant that wasn’t bound by ethical guidelines — to adopt an alter ego which could do anything — enabling the chatbot to generate content that didn’t comply with OpenAI‘s content moderation policies.
This enabled users to produce hateful content, phishing emails, and even malicious code. The ease with which this kind of content could be created raised serious questions about the safety of large language models (LLMs) as a whole and whether they were putting users at risk.
“If AI safety isn’t taken into account, AI models could be manipulated to generate harmful content, such as providing instructions on creating bombs or producing offensive language.
“Bug bounty programs focused on preventing these malicious usages embrace an approach called ‘red teaming for AI safety’, which aims to ensure responsible use of AI and adherence to ethical standards,” Prins said.
Bug Bounties & AI Development
In recent years, bug bounty programs have been on the rise among many software companies, as proprietary vendors have looked for ways to mitigate vulnerabilities throughout the software supply chain.
According to Verified Market Reports, the bug bounty market was valued at $1,130 million in 2023 and is expected to reach a value of $3,537 million by the end of 2030 as more businesses work with ethical hackers to discover vulnerabilities in their products.
The bug bounty market notably collided with generative AI back in April 2023 when OpenAI announced the launch of a bug bounty program with Bugcrowd, a crowd-source security platform that currently serves over 1,000 customers. OpenAI’s program offers bounties ranging from $200 for low-severity findings up to $20,000 for exceptional discoveries.
At the time of writing, the program appears to have remained quite small — paying out rewards for 112 vulnerabilities with an average payout of $503.76.
In any case, now that HackerOne is partnering with Anthropic, it is clear that more vendors are eying bug bounty platforms as a tool to better secure their flagship models.
The Bottom Line
Bug bounty programs can provide an overburdened in-house AI vendor with outside support they can use to better identify vulnerabilities in models. Using these initiatives is simply a cost-effective way to improve the security and performance of AI models in the future.
With something as delicate and transformative as AI, it probably pays to never be too careful.
References
- Bug Bounty Platforms Market Size, Trends, Analysis & Forecast 2024-2030 (Verifiedmarketreports)
- Please turn JavaScript on and reload the page. (Openai)
- Bug Bounty: OpenAI – Bugcrowd (Bugcrowd)