A few months ago, a group of researchers ran a quiet experiment. They took 4,061 receipts and forms (real ones, drawn from public datasets in nine languages) and used two off-the-shelf AI image tools (Gemini 2.5 Flash Image and Ideogram v2 Edit) to alter the numbers. The amounts changed. Everything else stayed the same.
Then they pointed three of the best-known forgery detection systems at the doctored documents and asked a simple question: can you tell which ones are fake?
The answer was: not really.
That experiment, published in February 2026 as a benchmark called AIForge-Doc, is one of two recent preprints suggesting that the tools businesses rely on to spot tampered documents and AI-generated images are starting to fall behind the tools used to create them. The findings aren't peer-reviewed yet, but the direction of travel matters, especially if you're a finance leader whose payment controls assume that a "looks legitimate" PDF invoice is sufficient evidence.
Here's what the research actually shows, and what it might change for AP teams.
What the research found
The AIForge-Doc paper (Cornell University arXiv, preprint 2602.20569) tested three representative detectors against AI-inpainted financial documents:
- TruFor, a general-purpose forensic detector, dropped from an AUC of 0.96 on the older NIST16 benchmark of Photoshop-style forgeries to 0.751 on AI-inpainted documents. Still better than chance, but a long way from reliable.
- DocTamper, a document-specific detector, fell from 0.98 in-distribution to 0.563, which is barely above guessing.
- GPT-4o, asked to judge documents zero-shot, scored 0.509, which is statistically indistinguishable from a coin flip.
A second preprint posted in April 2026 (arXiv 2604.25213, "When the Forger Is the Judge") put the same kind of question to OpenAI's image model GPT-Image-2 and found that even the model that produces the forgery cannot reliably recognise the documents it has tampered with.
That paper is only weeks old and has not been independently scrutinised yet, so treat it as an early signal rather than settled science.
What the research doesn't say
Before drawing strategic conclusions, it's worth being clear about the limits.
- Both papers are preprints. Neither has been peer-reviewed or accepted at a conference. The methodology in AIForge-Doc looks sound (real datasets, named tools, pixel-level masks, public detectors), but it hasn't been stress-tested by the broader research community yet.
- The benchmark is narrow. It covers numeric-field tampering on receipts and forms, using two specific inpainting APIs. It doesn't prove that all detectors are useless against all AI-generated content.
- TruFor at 0.751 is degraded, not broken. General-purpose forensic tools still outperform document-specific ones and still beat random. They're just nowhere near where they were on traditional Photoshop forgeries.
The honest summary: the strongest claim the evidence will support is that current detection tools degrade sharply against modern diffusion-based document tampering, and that the newest generative models produce forgeries with fewer of the artefacts those detectors were trained on. The headline version ("AI detectors can no longer tell the difference") might overstate the finding, but the trajectory is real.
Why this matters for finance leaders
The reason this research lands harder for finance teams than, say, social-media platforms is that the documents being forged are exactly the documents AP teams handle every day: invoices, receipts, vendor onboarding forms, bank-detail change requests and statutory declarations.
Three implications worth thinking through:
- Visual authenticity is no longer a control. If your AP process treats a clean-looking PDF invoice as confirmation that the underlying transaction is legitimate, you're relying on a signal that researchers have just shown is unreliable against current AI tools. The same applies to scanned bank-detail change forms, ier letterheads and screenshots used to support remittance changes.
- Image-based identity verification is exposed. Onboarding flows that rely on a photographed driver's licence or passport sit on the same shaky ground. Simonchik's work focuses on this directly. Underground markets are reportedly already selling AI-generated identity documents for as little as $15 USD.
- The detection arms race is structurally tilted. Generators improve faster than detectors because every new generative model is a new attack surface, and detectors trained on yesterday's artefacts struggle on tomorrow's outputs. This is unlikely to reverse.
None of this means AP fraud is suddenly inevitable. It means the controls that prevent it have to shift away from "does this look real?" toward "is this actually true?"
What to do about it
The practical response for finance teams isn't to buy a better detector. It's to stop relying on the document as the source of truth and start verifying the underlying facts.
- Verify vendor bank details out of band. Don't accept account changes via email, document upload or invoice header. Confirm them via a phone call to a number sourced from your own records, not from the request. Eftsure customers do this automatically: every vendor's banking details are independently verified and continuously monitored against a cross-checked database.
- Treat any change to vendor data as a high-risk event. New vendor onboarding and bank-detail changes are where impersonation attacks land. Apply tighter controls there than to routine invoice processing.
- Add real-time intelligence at the payment layer. Document review happens early in the AP process. Fraud often slips through because by the time someone in approvals second-guesses an invoice, the controls upstream have already cleared it. Real-time alerts at the moment of payment catch the cases that document checks miss.
- Don't rely on a single layer. The lesson from AIForge-Doc isn't "TruFor is bad" or "DocTamper is bad". It's that any single check, no matter how sophisticated, is a single point of failure against an attacker using current tools. Layered controls (verification, segregation of duties, monitoring and intervention) are what survive.
Eftsure's position has always been that payment fraud is a finance problem before it's an IT problem, and that the controls that work are the ones built into the payment workflow, not bolted onto it.
This research doesn't change that argument, but it sharpens it.
The takeaway
The research is early, and current detectors aren't dead. But the trajectory is clear enough that finance leaders should stop treating "the document looks legitimate" as evidence of anything. The vendor exists or doesn't. The bank account belongs to them or doesn't. Those are facts you can verify. What the document looks like, increasingly, is whatever an attacker wants it to look like.
If you want to know how Eftsure verifies vendors and protects payments at the moment they leave your account, book a chat with us.