Enterprises want it all, and they want it now – or at least within a few seconds. They want the benefits that GenAI can bring, like fast content and strategic advice based on data inputs.
It’s not surprising that GenAI adoption is skyrocketing. Economists at the University of Chicago found that over 50% of us regularly use ChatGPT at work, rising to around three-quarters of respondents when it comes to roles like software development, marketing and IT. Forrester, meanwhile, has reported that 73% of data and analytics decision-makers see a positive impact on their organizations from the use of AI.
The quest for GenAI-powered insights drives many businesses to connect their databases directly with an LLM. On the surface, it seems like a great idea: they can uncover new insights based on proprietary information that was previously inaccessible and roll out helpful chatbots that quickly answer customer questions without the need to wait for an agent.
The idea is to create a seamless system for line-of-business (LOB) users to gain answers to their queries without having to know how to code, how to turn questions into SQL queries, or which visualizations are the best fits for their reports.
But there are a number of risks lurking under the surface. Directly joining your database with a publicly available LLM opens a Pandora’s box of potential data leaks, regulatory non-compliance, incorrect responses, and vulnerabilities for cyber attacks. It’s not always workable, either. Big databases can slow an LLM down so much that it’s not effective for enterprise use cases.
Is there a way for enterprises to tap into the benefits of using GenAI with proprietary data, without suffering the drawbacks?
Compromising Data Privacy, Compliance and Security
Unfortunately, public LLMs are far from secure. Samsung made headlines when proprietary data was leaked through an employee using ChatGPT. Prompts can leak sensitive data to users outside your company, especially if malicious actors hack or trick the LLM. What’s more, the big LLM developers use your prompts and uploads for the purposes of training their models.
Even if all the data remains unseen by unauthorized eyes, simply sharing it with a cloud-hosted LLM can be a breach of compliance with regulations like GDPR, CCPA, HIPAA, which have strict rules about cloud server usage. Additionally, every connection point to your database is a potential entry point for cyberattacks, including APIs for LLMs.
“The privacy risks are extreme, in my opinion,” explains Avi Perez, CTO and co-founder of Pyramid Analytics. “Because you’re effectively sharing your top-secret corporate information that is completely private and frankly, let’s say, offline, and you’re sending it to a public service that hosts the chatbot and asking it to analyze it.”
The dangers here are real, Perez continues. “And that opens up the business to all kinds of issues – anywhere from someone sniffing the question on the receiving end, to the vendor that hosts the AI LLM capturing that question with the hints of data inside it, or the data sets inside it, all the way through to questions about the quality of the LLM’s mathematical or analytical responses to data.”
With specialized setups, it’s possible to share just the metadata instead of connecting the LLM directly to your entire database. The LLM then processes your questions and generates queries that you can run on your data without exposing it to unauthorized access. This is the architecture that some advanced business intelligence platforms employ.
Masking techniques, encryption, and tokenization can also be used to hide sensitive variables while preserving the data’s fundamental schema and structure.
Slow and Inefficient Workflows
Many business customers have gigantic amounts of data, amounting to millions of rows and hundreds of columns. Many services don’t even permit you to upload datasets of that magnitude, but if you try, you’ll find that it slows the system down dramatically.
It could take hours for all the data to be uploaded. By the time your LLM processes all those rows, the data will be out of date. It’s one thing to use an LLM to write a single personalized email, but as the data grows, so does processing time. Token limits or quotas can create throughput bottlenecks and drive up latency.
Users would have to wait minutes for an answer, and nobody has that kind of patience. Certainly not your customers, who expect instant replies from customer support chatbots. Given that one reason for connecting the LLM to your database is greater efficiency and more meaningful insights, this approach is tantamount to shooting yourself in the foot.
Query translators offer a way to overcome this hurdle. These are tools that convert a natural language prompt or description into an SQL statement. You can review the code, and then execute the query on your self-hosted secure environment. The LLM only sees your query, not the data, thereby protecting your database without compromising on real-time analysis.
Unreliable Outputs
Your databases might not be organized well enough for the LLM to do its job effectively, and your source applications could deliver data in ways that LLMs can’t understand. A data scientist can ensure the data used is clean, and the LLM is able to make an order of it, but removing that layer drives hallucinations or errors. When you connect the LLM directly to unprocessed data, it doesn’t necessarily know what to do. You’d have to update the LLM whenever the source app is updated, and debugging would be a never-ending challenge.
It doesn’t help that few LOB users know how to write prompts, code queries, or configure LLMs. If they word a query in a way that delivers irrelevant responses, they won’t know that the outcome is inaccurate. They might use phrasing that causes the LLM to delete some data or creates an unnecessarily performance-intensive inquiry that raises expenses.
“The quality of the output heavily depends on the relevance and quality of the information retrieved,” says Zive’s Stefanie Dankert. “If the underlying knowledge base is poorly organized or out-of-date, the answers provided by the LLM can be inaccurate or irrelevant. Unfortunately, this is almost always the case in reality, and only small companies are typically able to keep all their internal data and knowledge structured manually.”
One solution is to apply sandboxing. This is where you build a controlled environment for LLMs to run SQL queries on a sample or synthetic database using anonymized, aggregated, placeholder, or even AI-generated data. The sandbox can mimic a real database while keeping your actual data securely isolated from the LLM.
It’s an effective way to generate, test, and validate SQL queries. Even if the data is modified or accessed by unauthorized personnel, the actual risks are slim to none.
LLMs and Databases Need to Maintain a Healthy Distance
It’s true that AI brings a lot more power to business analytics, driving productivity, democratizing access to insights, and speeding up response times. But connecting it directly to your database doesn’t necessarily bring the extra benefits you might expect. It’s important to use workarounds that establish secure fences between the database and the LLM so that you can enjoy the advantages of GenAI analysis without the drawbacks.
The opinions expressed in this post belongs to the individual contributors and do not necessarily reflect the views of Information Security Buzz.