From OpenAI and Google to Microsoft, the entire AI developer community uses public vector databases and large language model (LLM) services — yes, accessible by anyone.
Despite being vital components of the AI supply chain, vector databases and LLM services are often overlooked in terms of cybersecurity. Now a new study from Legit Security found that these elements are ripe with vulnerabilities, data breaches, cybersecurity risks, and exposed sensitive data.
Techopedia explores a deeply under-represented risk vector for any company that uses AI within its workflow.
Popular Public Vector Databases and LLM Tools Plagued with Risks
On August 28, Legit Security revealed that after scanning vector databases used by artificial intelligence developers, they found dozens of secrets (passwords, API keys), including OpenAI API keys, Pinecone (vector database SaaS) API keys, GitHub access tokens and URLs with database passwords on the servers.
The most popular vector databases, Milvus, Qdrant, Chroma, and Weaviate, primarily used in retrieval-augmented generation (RAG), were found to be prone to data leakage and data poisoning as well as susceptible to security vulnerability exploitation.
Legit Security found about 30 servers that had publicly accessible corporate or private data, including company private email conversations, customer PII and product serial numbers, financial records, and candidate resumes and contact information.
Three vector databases from two of the most popular platforms in engineering services, fashion, and the industrial equipment sector contained documents, media summaries, customer details, and purchase information.
The investigation also revealed that vector databases are susceptible to data poisoning. This includes applications such as medical chatbots that operate with patients’ information databases.
Regarding LLMs, Legit found that the widely used no-code LLM service Flowise can access all kinds of sensitive data, like private company files, LLM models, application configurations, and prompts. Flowise often integrates with external services, like OpenAI API, AWS Bedrock, and even Confluence or GitHub, which increases security risks.
Legit Security scanned 959 Flowise servers and found that 438 of servers (45%) are vulnerable.
Why Would an AI Tech for Internal Use Be Public?
Naphtali Deutsch, Security Researcher at Legit Security, spoke to Techopedia about the report and the problems with vector databases today.
“A vector database lets AI models search this data by similarity, retrieve the closest match to any new information, and generate a more accurate response based on it. This purpose is for internal use, so databases are public out of convenience or misconfiguration.
“Public databases can lead to private data leakage and data poisoning because if security measures are not applied, anyone can access the information stored inside.”
Deutsch explained that most people ignore security risks and prefer publicly exposing them for easier access.
“The risk is much lower when using private databases instead.”
Deutsch warned organizations and developers to restrict access to their databases or AI models to their private networks only and disallow anonymous access.
“In addition, it is recommended that you mask sensitive information from your data before using it in AI applications,” Deutsch said.
Bruno Kurtic, co-founder, President, and CEO of Bedrock Security, a data security company, explained to Techopedia the risks associated with developers not implementing best practices.
“The biggest risk is sensitive data leakage, especially the data used during the ingestion and building of the vectors. If unsanctioned data (e.g., sensitive data) is used, it may eventually be revealed in the final application (e.g., a query to a chatbot).
“An attacker could leverage this by looking for similar vectors that may have been developed with the raw sensitive data training, enabling the attacker to gain a wider picture of sensitive data beyond the initial raw data used.”
Basic Cybersecurity Principles Neglected in the AI Hype Fog
A recent Arize report found that 1 in 5 top Fortune 500 companies mention AI in their financial reports — but not for its benefits. The report concludes that the number of Fortune 500 companies citing AI as a risk factor in their financial reports has increased by 473.5% since 2022.
Only 30.1% of teams deploying large language models are implementing observability, despite most of them wanting better tracing and debugging workflows.
Adam Hevenor, director of product management of AI at Aerospike, a scalable, millisecond latency, real-time database provider, also spoke to Techopedia about how the AI hype is transforming security and privacy frameworks.
“There are no really good reasons to have a public vector database.”
Hevenor discussed how administrators can strengthen security by leveraging encryption features that make data unreadable even if an attacker manages to access it.
The big question remains: If it is widely known that public vector databases present risks, why are they still in use? Hevenor from Aerospike answered that question.
“All the databases described offer authentication mechanisms,” Hevenor said. “But they were left off probably due to convenience and urgency in developing an AI application quickly or were mistakenly not enabled.”
AI developers, pushed by the pressure of the AI hype, seem to be forgetting that basic principles like authentication, encryption, isolation, or Zero Trust security must be applied to AI.
“It can be tempting to make exemptions to security policy on the most cutting-edge projects, but this is what leads to mistakes like making your vector database publicly accessible,” Hevenor said.
Observability, Governance, and Developing Models In-House
The most secure AI models and those that have been demonstrated to generate the most return on investment are those that are built and managed in-house. However, given the speed at which AI is being integrated into the business arena, organizations often choose to use third-party AI providers, public vector databases, and other risky resources.
The recent Sapio Research for Hewlett Packard Enterprise (HPE) study found that understanding of the AI environment is still low. In the U.K. and Ireland, only 6% of organizations can run real-time data pushes/pulls to enable innovation and external data monetization.
More concerningly, just 29% of companies set up data governance models and can run advanced analytics. Matt Armstrong-Barnes, chief technologist for AI at Hewlett Packard Enterprise, said:
“Businesses are investing in AI without first taking a holistic view of the technology and how to implement it. Diving in before considering whether they are set up to benefit from AI and who needs to be involved in its roll-out will lead to misalignment between departments and fragmentation that limits its potential.”
Ashley Manraj, Chief Technology Officer at Pvotal, a low-code, event-driven, micro-service-based, cloud-native platform, spoke to Techopedia about how developers and leaders should consider building their AI.
“Instead of relying on public vector databases, organizations can leverage their existing private databases like PostgreSQL (advanced open-source database) and augment them with vector search capabilities,” Manraj said.
By maintaining data within private infrastructure, this approach guarantees greater control over access permissions, security policies, and regulatory compliance.
“Sensitive data remains confined within your organization’s perimeter, mitigating risks associated with data breaches or unauthorized access,” Manraj added.
“By integrating vector search within your existing infrastructure, you retain the robust security and control offered by PostgreSQL while harnessing the power of vector-based AI applications, especially in powerful techniques like multilingual RAG.”
Manraj warned that developers’ security efforts should not stop at vector databases and must go beyond.
“Developers need to understand that the limitation here is not due to vector databases alone, but rather a lack of proper safeguards and practices built around their database use.”
Manraj explained that Incorporating DevSecOps operational flows atop base core AI workflows with a Zero Trust methodology and developing AI with the right segregation of access for public versus private operations are key to mitigating the risks of generative AI.
The Bottom Line
As Legit Security unveils the vast risks and vulnerabilities that exist within LLM services and vector databases used by top companies around the world, a pattern emerges.
The AI hype has created a competition rush that not only pressures in-house developers to take risks by neglecting security principles but strays organizations away from true innovation and moves them further into important security risks like data leakage.
The sophisticated nature of AI technology is also challenging developers who struggle to maintain visibility and governance across the supply chain. If not tamed, this trend will continue to pose significant problems both for the AI industry developing third-party solutions and public AI services, as well as for any organization and individual building its AI.