Can AI Judge Economic Risk? IMF Tests GPT on Global Surveillance Reports
An IMF study finds that advanced AI models can reliably extract factual information and approximate human judgments when reviewing complex economic reports, achieving up to 75–80 percent accuracy in recent years. However, the models remain overly optimistic and struggle with nuanced evaluations, meaning they can assist economists but not replace them.
At the International Monetary Fund, economists spend months preparing and reviewing detailed reports on the economic and financial health of countries. These Article IV staff reports assess growth, debt, banking system risks and policy responses. They are central to global economic surveillance.
Now, the IMF has tested whether artificial intelligence can help with that work. In a new study, IMF economists Paola Ganum and Tohid Atashbar examined whether large language models such as GPT-4o, GPT-4.1, GPT-o1 and GPT-5 can analyze these complex reports with accuracy close to that of human experts.
Instead of testing AI on simple questions, the researchers gave it 543 real IMF reports published between 2016 and 2024. The goal was clear: Can AI meaningfully assist economists in reviewing macrofinancial risks?
Putting GPT to the Test
Each report was fed into different GPT models using an automated system. The AI was asked to complete the same structured questionnaire used by IMF economists.
It had to rate each report from 1 to 4 in three key areas:
-
How well macrofinancial risks were integrated into the main economic analysis
-
How clearly systemic risks and vulnerabilities were identified
-
Whether policy advice properly addressed those risks
The models also answered 40 yes-or-no questions. These covered factual issues such as whether the report discussed banking sector weaknesses, financial soundness indicators, emerging financial risks or regulatory policies.
Human economist ratings served as the benchmark. The AI's answers were then compared against them using measures such as accuracy and agreement rates.
Where AI Performs Well
The results show that AI performs best on structured, factual questions. On the 40 yes-or-no questions, advanced models such as GPT-o1 and GPT-5 matched human answers roughly 75 to 80 percent of the time in the most recent years.
This suggests that AI is quite reliable when it comes to identifying whether specific topics are mentioned in long reports. If the task is to extract information, the technology is already strong.
Performance on rating tasks has improved significantly over time. Early versions such as GPT-4o struggled, especially with older reports. But newer models, particularly GPT-o1 and GPT-4.1, reached accuracy levels in the mid-70 percent range for recent reports. In many cases, the AI's rating was within half a point of the human score.
The researchers also found that AI results are fairly consistent. When the same report was run again using the same settings, the answers were largely stable. That reduces concerns about randomness in high-stakes analysis.
The Optimism Problem
Despite these gains, one clear pattern emerged: AI tends to be more generous than human reviewers.
Across years and country groups, the models gave higher ratings on average than IMF economists did. They were less likely to assign low scores and more likely to cluster in the upper-middle range.
Why? One explanation is that large language models are trained to be helpful and non-confrontational, which may make them less critical. Another possibility is that human economists implicitly compare reports against each other, while AI evaluates each document on its own without that broader context.
The models also struggled more with complex, open-ended questions. When deeper judgment and interpretation were required, agreement with human experts dropped sharply. Exact rating matches were rare, even when the AI was generally close.
A Tool, Not a Replacement
The study does not suggest that AI can replace economists. Instead, it points to a more realistic role: AI as a support tool.
The models are particularly strong at scanning long documents, extracting factual information and flagging key topics. That alone could save time in large-scale review exercises. But nuanced judgment, context and critical assessment still require human expertise.
Interestingly, AI ratings were influenced by similar factors as human ratings. Reports that included well-articulated systemic risk discussions or recent financial sector reviews tended to receive higher scores from both humans and machines. This suggests the AI is not guessing randomly, but identifying meaningful signals.
The overall message is balanced. Large language models have improved quickly and can now assist with complex economic analysis. However, they still show bias and struggle with deeper reasoning tasks.
For global institutions facing growing workloads, the future may lie in collaboration rather than replacement. Economists bring experience, context and judgment. AI brings speed and consistency. Together, they may strengthen the way the world monitors financial risks.
- FIRST PUBLISHED IN:
- Devdiscourse