GPT-5.4 tops AI accounting test but errors remain high
DualEntry has published results from an independent benchmark of artificial intelligence systems on real accounting workflows, placing OpenAI's GPT-5.4 top among 19 models with 77.3% accuracy across 101 tasks.
The assessment focused on day-to-day finance work rather than general knowledge. It covered transaction classification, journal entry creation, bank reconciliation, financial reporting, and month-end close, along with accounts payable, accounts receivable, and a set of accounting knowledge questions.
Despite GPT-5.4's lead, the benchmark highlighted persistent reliability gaps for firms considering deeper automation in finance. No model scored above 80% overall, and most failed more than one-third of tasks, according to DualEntry's breakdown.
Gemini 3.1 Pro ranked second at 66%, more than 11 points behind GPT-5.4. Z.ai's GLM-5 and MiniMax M2.5 followed at 65.3% each. Anthropic's Claude Sonnet 4.6 scored 63.4%, and Claude Haiku 4.5 scored 61.4%.
Other models in the published top 10 included Claude Sonnet 4.5 at 59.4%, OpenAI's GPT-5.2 at 58.4%, GPT-5.1 at 57.4%, and Qwen3 Coder Next at 57.4%.
Older models lag
The results showed a sharp gap between newer reasoning models and older generations. GPT-4 scored 19.8% on the same task set, and a separate GPT-4-0613 entry also scored 19.8%.
Several other models posted low results, including Claude Opus 4.6 at 38.6%, Nemotron Nano 12B at 32.7%, and Gemini 2.5 Flash Lite at 27.7%.
The test used deterministic binary grading, with each answer marked correct or incorrect. DualEntry allowed multiple runs per model to calculate overall accuracy and difficulty tiers. Each model ran in an isolated environment without connections to external financial systems.
DualEntry framed the results as a warning against assuming strong language generation translates into dependable accounting execution. AI systems can draft structured outputs such as transaction mappings and journal entries, but finance teams still need validated records, balanced entries, reconciliations that resolve, and reports that can withstand audit scrutiny.
"Even the best model still fails about one in four accounting tasks. That's why AI needs workflow guardrails before it can run financial operations autonomously," said Santiago Nestares, Cofounder of DualEntry.
Workflow coverage
The 101 tasks were divided into eight categories. Transaction classification and journal entry creation had 13 questions each. Accounts payable also had 13. Accounts receivable had 12, as did bank reconciliation and month-end close. Financial reporting had 13 questions, and AI accounting knowledge had 13.
The tasks used a provisioned chart of accounts and minimal context, aiming to simulate environments where staff and systems must act on sparse transaction descriptions, incomplete memos, and accounting policies that vary by business.
The headline results suggest today's leading models can still make errors that require review by finance staff. In practical terms, a 77.3% accuracy rate implies roughly a quarter of tasks could be wrong, potentially leading to misclassifications, imbalanced entries, or unreconciled differences if the work is not checked.
That failure pattern matters for businesses experimenting with AI in close processes, particularly with large transaction volumes and time pressure. It also matters in regulated environments where audit trails and controls must coexist with operational speed.
Providers tested
The benchmark included models from OpenAI, Google, Anthropic, Alibaba, Zhipu AI, MiniMax, Moonshot AI, and Nvidia. DualEntry evaluated 19 models in total and published a broader leaderboard beyond the top 10 highlighted in its summary.
The data also showed uneven performance across model families. Some clustered in the high-50s to mid-60s, while others fell well below 40%. The spread suggests model choice can materially change outcomes on similar accounting tasks, even before firms add their own prompts, policy rules, and internal checks.
DualEntry develops an AI-native ERP product and positions the benchmark as part of a broader discussion about deploying AI in core financial processes. Nestares argued that drafting ability alone is not enough for operational finance.
"Large language models are powerful drafting tools, but finance doesn't run on drafts; it runs on validated records," Nestares said. "The benchmark shows that AI can accelerate accounting workflows, but without system-level controls and validation, errors can quickly cascade through financial reporting."