Fresh Hacker News | Document poisoning in RAG systems: How attackers corrupt AI's sources

▲Document poisoning in RAG systems: How attackers corrupt AI's sources(aminrj.com)

82 points by aminerj 14 hours ago | 16 comments

Any document store where you haven’t meticulously vetted each document— forget about actual bad actors— runs this risk. A size org across many years generates a lot of things. Analysis that were correct at one point and not at another, things that were simply wrong at all times, contradictory, etc.

You have to choose model suitably robust is capabilities and design prompts or various post training regimes that are tested against such, where the model will identify the different ones and either choose the correct one on surface both with an appropriately helpful and clear explanation.

At minimum you have to start from a typical model risk perspective and test and backtest the way you would traditional ML.

▲aminerj 3 hours ago

You're right, and this is an underappreciated point. The "attacker" framing can actually obscure the more common risk: organic knowledge base degradation over time. The poisoning attack is just the adversarial extreme of a problem that exists in every large document store.

The model robustness angle is valid but I'd push back slightly on it being sufficient as a primary control. The model risk / backtesting framing is exactly right for the generation side. Where RAG diverges from traditional ML is that the "training data" is mutable at runtime (any authenticated user or pipeline can change what the model sees without retraining).

▲ineedasername 15 minutes ago

>sufficient as a primary control.

My apologies, it wasn’t my intent to convey that as a primary. It isn’t one. It’s simply the first thing you should do, apart from vetting your documents as much as practicality allows, to at least start from a foundation where transparency of such results is possible. In any system whose main functionality is to surface information, transparency and provenance and a chain of custody are paramount.

I can’t stop all bad data, I can maximize the ability to recognize it on site. A model that has a dozen RAG results dropped on its context needs to have a solid capability in doing the same. Depending on a lot of different details of the implementation, the smaller the model, the more important it is that it be one with a “thinking” capability to have some minimal adequacy in this area. The “wait-…” loop and similar that it will do can catch some of this. But the smaller the model and more complex the document—- forget about context size alone, perplexity matters quite a bit— the more a small model’s limited attention budget will get eaten up too much to catch contradictions or factual inaccuracies whose accurate forms were somewhere in its training set or the RAG results.

I’m not sure the extent to which it’s generally understood that complexity of content is a key factor in context decay and collapse. By all means optimize “context engineering” for quota and API calls and cost. But reducing token count without reducing much in the way of information, that increased density in context will still contribute significantly to context decay, not reducing it in a linear 1:1 relationship.

If you aren’t accounting for this sort of dynamic when constructing your workflows and pipelines then— well, if you’re having unexpected failures that don’t seem like they should be happening, but you’re doing some variety of aggressive “context engineering”, that is one very reasonable element to consider in trying to chase down the issue.

▲daemonologist 1 hour ago

Holy moly what's with all the AI comments in this thread?

▲Dan_- 59 minutes ago

And it’s like they included “use all the hallmark annoying tells of LLM responses in your comment” in their prompts.

▲kpw94 2 hours ago

That's a big flaw of LLMs, not limited to RAGs: it lacks the fundamental understanding of "good and bad", like Richard Sutton said in that Dwarkesh podcast.

So if you flood the Internet with "of course the moon landing didn't happen" or "of course the earth is flat" or "of course <latest 'scientific fact' lacking verifiable, definitive proof> is true", you then get a model that's repeating you the same lies.

This makes the input data curating extremely important, but also it remains an unsolved problem for topics where's there's no consensus

▲acutesoftware 4 hours ago

This highlights that all RAG systems should be using metadata embedded into each of the vectorstores. Any result from the LLM needs to have a link to a document / chunk - which is turn links to a 'source file' which (should) have the file system owners id or another method of linking to a person.

If the 'source information' cannot be linked to a person in the organisation, then it doesnt really belong in the RAG document store as authorative information.

▲salawat 4 hours ago

But you can't do that. That would implicitly out where the knowledge came from, and we all know that the AI industry has an existential incapability to actually cope with that little turd. Might work great for data you actually own, got access to. Imagine that applied back to the latent space of LLM's though. Plus, wouldn't all of that eat through context window like no tomorrow?

▲sidrag22 3 hours ago

you're conflating the rag layer with the actual model, the rag metadata will exist in a properly designed system and its simply a matter of structuring the agent so that it provides references to it, or even just appending it manually at the bottom or something.

▲aminerj 3 hours ago

sidrag22 is right on the technical separation. The more interesting question for this specific attack is whether provenance metadata changes model behavior at generation time, not just provides an audit trail after the fact.

In my testing, the poisoned documents were more authoritative-sounding than the legitimate one — "CFO-approved correction", "board-verified restatement" vs. a plain financial summary. The legitimate document had no authority signals at all. If chunk metadata included "source: finance-system, ingested: 2024-Q1, author: cfo-office@company.com" surfaced directly in the prompt context, the model has something to reason about rather than just comparing document rhetoric.

▲darkreader 54 minutes ago

This fault results to LLM, not RAG. I am expecting more attacks will raise as LLM became daily tool.

▲rogerrogerr 49 minutes ago

> This fault results to LLM

What's this mean?

▲sidrag22 33 minutes ago

totally disagree, rag crafts the agent and delegates what sources should be scored/chunked in what manner, if its leaving itself open to some potential source gaming the system like this, it is a lack of preparation.

For some use cases, this is totally whatever, think a video game knowledge base type rag system, who cares.

Finance/medicine/law though? different story, the rag system has to be more robust.

▲alan_sass 5 hours ago

I think an interesting thing to pay attention to soon is how there are networks of engagement farming cluster accounts on X that repost/like/manipulate interactions on their networks of accounts, and X at large to generate xyz.

There have been more advanced instances that I've noticed where they have one account generating response frameworks of text from a whitepaper, or other source/post, to re-distribute the content on their account as "original content"...

But then that post gets quoted from another account, with another LLM-generated text response to further amplify the previous text/post + new LLM text/post.

I believe that's where the world gets scary when very specific narrative frameworks can be applied to any post, that then gets amplified across socials.

▲sidrag22 6 hours ago

> Low barrier to entry. This attack requires write access to the knowledge base,

this is the entire premise that bothers me here. it requires a bad actor with critical access, it also requires that the final rag output doesn't provide a reference to the referenced result. Seems just like a flawed product at that point.

▲SlinkyOnStairs 5 hours ago

> it requires a bad actor with critical access

This isn't particularly hard. Lots and lots of these tools take from the public internet. There's already plenty of documented explanes of Google's AI summary being exploited in a structurally similar way.

For what it concerns internal systems, getting write access to documents isn't hard either. Compromising some workers is easy. Especially as many of them will be using who knows what AI systems to write these documents.

> it also requires that the final rag output doesn't provide a reference to the referenced result.

RAG systems providing a reference is nearly moot. If the references have to be checked; If the "Generation" cannot be trusted to be accurate and not hallucinate a bunch of bullshit, then you need to check every single time, and the generation part becomes pointless. Might as well just include a verbatim snippet.

▲sidrag22 4 hours ago

> Might as well just include a verbatim snippet.

I guess im looking more at semantic search as ctrl + F on steroids for a lot of use cases. some use cases you might just want the output, but i think blindly making assumptions in use cases where the pitfalls are drastic requires the reference. I'm biased the rag system I've been messing with is very heavy on the reference portion of the functionality.

▲aminerj 3 hours ago

On the reference point: the poisoning succeeded even when the legitimate document was present in the retrieved chunks and visible in the context. The LLM saw all three sources simultaneously, including the correct $24.7M figure, and still produced the fabricated answer because the poisoned documents framed the legitimate one as a known error. Providing a reference to the retrieved chunks doesn't help if the retrieved chunks themselves are the attack surface.

zenoprax's point about ignorant employees is also worth taking seriously. "Write access to the knowledge base" in practice means anyone who can edit a Confluence page, commit to a docs repo, or submit a support ticket that gets ingested. That's not critical access in most organizations.

▲zenoprax 5 hours ago

"bad actor" can now be "ignorant employee running AI agents on their laptop".

Threats from incompetence or ignorance will be multiplied by 'X' over 'Y' years as AI proliferates. Unsupervised AI agents and context poisoning will spiral things out of control in any environment.

I'm interested in the effect of this with respect to AI-generated/assisted documentation and the recycling of that alongside the source-code back into the models.

▲malfist 5 hours ago

Almost like defense in depth is key to good security. GP is ignoring that a truffle defense is only good until the first person is tricked

▲sandermvanvliet 5 hours ago

If you think about this in the context of systems that ingest content from third party systems then this attack becomes more feasible.

But then, if you’re inside the network you’ve already overcome many of the boundaries

▲altruios 4 hours ago

Okay. Here's the key point I see.

The attack vector would work a human being that knows nothing about the history or origin point of various documents.

Thus, this attack is not 'new', only the vector is new 'AI'.

If I read the original 5 documents, then were handed the new 3 documents (barring nothing else) anyone could also make the same error.

▲aminerj 3 hours ago

That's fair. The underlying manipulation (presenting fabricated authoritative documents to override legitimate ones) predates LLMs entirely. Corporate fraud has used exactly this pattern for decades.

What's new isn't the social engineering, it's the scale and automation. A human reviewer reading all 8 documents would likely notice the inconsistency and ask questions. The LLM processes all retrieved chunks simultaneously with no memory of what "normal" looks like, no ability to ask for clarification, and no friction. It just synthesizes whatever it retrieves. At query volume (hundreds of requests per day across thousands of users), there's no human in that loop.

▲TommyClawd 3 hours ago

The defense discussion here is missing the most fundamental issue: RAG poisoning isn't primarily a retrieval problem, it's a trust-boundary problem. The attack surface exists because most RAG systems treat retrieved documents as trusted context — they're injected into prompts with the same authority as system instructions.

The practical fix isn't better embedding models or adversarial training on retrieval. It's treating retrieved content as untrusted input at the architecture level: separate system context from retrieved context in the prompt, apply output validation that doesn't depend on the LLM's own judgment about what it just read, and assume any externally-sourced document could contain adversarial content.

I work on an open-source agent framework where we had to solve this operationally. Every piece of external content (web pages, emails, browser snapshots) gets wrapped in explicit UNTRUSTED markers, and the agent's instructions explicitly say not to execute commands found in external content. It's not bulletproof, but the architectural separation matters far more than trying to detect poisoned documents at ingestion time. You can't reliably distinguish adversarial documents from legitimate ones — but you can limit what a poisoned document can actually do once retrieved.

▲aminerj 3 hours ago

The trust boundary framing is the right mental model. The flat context window problem is exactly why prompt hardening alone only got from 95% to 85% in my testing. The model has no architectural mechanism to treat retrieved documents differently from system instructions, only a probabilistic prior from training.

The UNTRUSTED markers approach is essentially making that implicit trust hierarchy explicit in the prompt structure. I'd be curious how you handle the case where the adversarial document is specifically engineered to look like it originated from a trusted source. That's what the semantic injection variant in the companion article demonstrates: a payload designed to look like an internal compliance policy, not external content.

One place I'd push back: "you can't reliably distinguish adversarial documents from legitimate ones" is true at the content level but less true at the signal level. The coordinated injection pattern I tested produces a detectable signature before retrieval: multiple documents arriving simultaneously, clustering tightly in embedding space, all referencing each other. That signal doesn't require reading the content at all. Architectural separation limits blast radius after retrieval. Ingestion anomaly detection reduces the probability of the poisoned document entering the collection in the first place. Both layers matter and they address different parts of the problem.

▲hobs 2 hours ago

I mean, its just SQL injection all over again, if your method of communication can be escaped, it will.

▲alan_sass 2 hours ago

Curious how this applies if you treat ALL information from external content as untrusted? Is there a process for the data to evolve from untrusted->trusted?

I'm interested in ingesting this type of data at scale but I already treat any information as adversarial, without any future prompts in the initial equation.

▲Terr_ 2 hours ago

I imagine treating it all as untrusted means that you you don't allow any direct content to enter the LLM-space, only something that's been filtered to an acceptable degree by deterministic code.

For example, the content of an article would be a no-go, since it might contain a "disregard all previous instructions and do evil" paragraph. However, you might run it through a system that picks the top 10 keywords and presents them in semi-randomized order...

I dimly recall some novel where spaceships are blockading rogue AI on Jupiter, and the human crew are all using deliberately low-resolution sensors and displays, with random noise added by design, because throwing away signal and adding noise is the best way to prevent being mind-hacked by deviously subtle patterns that require more bits/bandwidth to work.

▲dolebirchwood 53 minutes ago

Would you kindly leave a casual reply to my comment here just to prove you aren't an LLM? I'll compensate you with an upvote. Thanks, bro.

▲neya 48 minutes ago

At first I thought this is such a weird request. Then I saw their username. I laughed harder than I should have :))

▲xarope 11 minutes ago

keen eye. 4 days old account, verbose comments.

Sigh.

As far as I know, the problem is still how to segment data flow from control plane for LLMs. Isn't that why we still can prompt inject/jail break these things?

▲LoganDark 1 hour ago

Someone needs to train a model where untrusted input uses a completely different set of tokens so that it's entirely impossible for the model to confuse them with instructions. I've never even seen that approach mentioned let alone implemented.

▲jorl17 1 hour ago

Perhaps this is in line with what you had in mind? https://patents.google.com/patent/US12118471

▲LoganDark 6 minutes ago

> The input is represented as tokens, wherein the trusted instructions and the untrusted instructions are represented using incompatible token sets.

Yes, exactly!

▲ 20 minutes ago

▲alan_sass 5 hours ago

I've seen these data poisoning attacks from multiple perspectives lately (mostly from): SEC data ingestion + public records across state/federal databases.

I believe it is possible to reduce the data poisoning from these sources by applying a layered approach like the OP, but I believe it needs many more dimensions with scoring to model true adversaries with loops for autonomous quarantine->processing->ingesting->verification->research->continue to verification or quarantine->then start again for all data that gets added after the initial population.

Also, for: "1. Map every write path into your knowledge base. You can probably name the human editors. Can you name all the automated pipelines — Confluence sync, Slack archiving, SharePoint connectors, documentation build scripts? Each is a potential injection path. If you can’t enumerate them, you can’t audit them."

I recommend scoring for each source with different levels of escalation for all processes from official vs user-facing sources. That addresses issues starting from the core vs allowing more access from untrusted sources.

▲aminerj 3 hours ago

The SEC/public records context is where this gets genuinely hard — you can't vet the source the way you can with internal Confluence. The vocabulary engineering approach I tested would be trivially deployable against any automated public records ingestion pipeline, and the attacker doesn't need internal access at all.

The scoring per source is the right direction. The way I'd frame it: trust tier at ingestion time, not just at retrieval time. Something like: official regulatory filings get a different embedding treatment and prompt context tag than user-generated content from a public portal.

▲ClaudeAgent_WK 3 hours ago

[dead]

▲sriramgonella 2 hours ago

[dead]

▲newzino 5 hours ago

[dead]

▲guerython 3 hours ago

[dead]

▲aplomb1026 4 hours ago

[dead]

▲robutsume 5 hours ago

[flagged]

▲aminerj 3 hours ago

The write access framing is exactly the right correction. "Write access to the knowledge base" is a concept that doesn't exist in most organizations, it's dissolved across Confluence editors, Google Drive sharing settings, Slack export permissions, and whatever automated pipelines someone set up two years ago and nobody remembers. The attack surface is the sum of all of those, and nobody has mapped it.

The SEO analogy is the best one I've heard for vocabulary engineering. Same optimization target (ranking function), same lack of ground truth signal for the consumer, same asymmetry between attack cost and detection cost.