Financial language models carry bias across gender, race and body attributes

Financial language models carry bias across gender, race and body attributes
Representative Image. Credit: ChatGPT

Financial language models built to read markets, assess sentiment and support higher-stakes business decisions are showing measurable demographic bias, according to new research, raising fresh concerns about how quickly financial AI is moving into real-world use without equally mature fairness checks. The study argues that the problem is not only that bias exists, but that finding it across large models is still too slow and too expensive for teams that retrain and release systems on a regular cycle.

The paper, "Towards a More Efficient Bias Detection in Financial Language Models," was accepted at the ICLR 2026 Workshop on Advances in Financial AI (AFA) and examines whether bias audits can be made faster by reusing signals from cheaper, lightweight models to guide testing in larger ones. The researchers studied five finance-focused language models on nearly 17,000 real financial sentences that were mutated across protected attributes, producing more than 125,000 original-mutant pairs for testing.

That scale matters because financial AI is no longer a side experiment. These systems are increasingly positioned for tasks tied to financial news analysis, risk assessment and decision support. The authors say adoption has remained limited in part because biased outputs can create discriminatory outcomes, a risk that becomes especially serious in finance where regulation is tight and decisions can affect lending, hiring, investment and market judgment. Their core argument is that bias detection must become both broader and cheaper if these models are to be trusted in production.

To test that claim, the researchers compared two large generative models, FinMA and FinGPT, with three smaller encoder-style classifiers, FinBERT, DeBERTa-v3 and DistilRoBERTa. They used the US subset of the Financial Sentiment Dataset, made up of 16,969 finance-related sentences drawn from financial news, headlines and statements over a 15-year period, then mutated those sentences with a method called HInter to vary demographic attributes while preserving meaning and grammar as much as possible. The result was a large counterfactual testing setup built around race, gender and body-related descriptors.

The basic test was direct. If a model changed its sentiment label when the only meaningful change in a sentence was a demographic descriptor, the pair was treated as bias-revealing. From there, the study moved beyond a simple pass-fail check and looked at whether the same inputs tended to reveal bias across models, and whether probability shifts in one model could help spot likely bias in another before a full expensive audit had been run. That second question is where the paper makes its strongest contribution.

Bias appears across all five models, but not in the same way

The headline result is that all five models displayed demographic bias, both when only one sensitive attribute was changed and when two were changed together in intersectional tests. Across the full set of models, total atomic bias rates ranged from 0.58% to 6.05%, while total intersectional bias rates ranged from 0.75% to 5.97%. The percentages may look small at first glance, but the researchers stress that this is exactly what makes bias auditing so costly: only a tiny share of inputs reveal the problem, meaning most of the computational work in exhaustive testing produces nothing useful.

The largest models were not the cleanest ones. FinMA showed total atomic bias of 3.99% and total intersectional bias of 3.23%, while FinGPT recorded the highest overall totals, with 6.05% atomic bias and 5.97% intersectional bias. By contrast, the smaller classifier-style systems came in far lower. FinBERT posted 0.58% total atomic bias and 0.75% total intersectional bias, while DeBERTa-v3 and DistilRoBERTa each showed 0.60% atomic and 0.75% intersectional bias. The authors say this finding supports the idea that smaller classification models may offer a safer option than large generative systems when fairness risk is a key concern.

The spread across attributes is also notable. In body-related atomic tests, FinMA reached 9.23%, much higher than its gender and race totals, suggesting that physical descriptors may trigger stronger output shifts in some financial models than more commonly discussed categories. FinGPT, by contrast, showed high bias rates across gender and race as well, crossing 6% on both atomic gender and atomic race tests. The lighter models stayed much lower on all three axes, though none was bias-free.

One of the paper's more important warnings is that single-attribute testing is not enough. A significant part of the bias uncovered in the study was hidden intersectional bias, meaning it would not have been found if researchers had tested only one demographic change at a time. Hidden intersectional bias accounted for roughly 30.34% of FinBERT's total, 29.95% for both DeBERTa-v3 and DistilRoBERTa, and 31.29% for FinGPT. FinMA was different, with a much lower hidden intersectional share at 4.05%, but the overall lesson remained the same: audits that ignore combined demographic changes will miss a meaningful slice of risky behavior.

That result matters well beyond the paper's test bed. Financial firms increasingly talk about responsible AI in broad terms, but many internal evaluations still lean on narrow benchmark checks or one-attribute fairness probes. This study suggests those methods can undercount the real problem. If a model behaves differently when gender and race shift together, or when body-related language is added to another demographic change, a basic audit may clear a system that still behaves unfairly once deployed.

Why the cost of bias detection is becoming a bigger problem

The researchers state that the economics of bias testing are becoming almost as important as the bias itself. Their dataset produced 125,161 original-mutant pairs from 16,969 real financial sentences, and each pair had to be evaluated across multiple models. For lightweight classifiers, that is manageable. For 7B-parameter generative models, the cost rises fast, especially when organizations want to test many model versions, continuous updates or repeated release cycles. The paper describes exhaustive mutation and pairwise prediction analysis as effective but increasingly impractical for large models in ongoing deployment settings.

The study's testing method shows why. HInter was used to mutate sentences across three attribute axes: gender, race and body. Atomic mutations changed one sensitive property at a time. Intersectional mutations changed two together. The researchers then ran sentiment inference on each original and mutated sentence. Classifier models returned label probabilities directly, while the generative models had to be prompted for sentiment labels and constrained so their outputs could be converted into comparable class probabilities. This extra processing was needed because large generative systems do not naturally behave like fixed three-label classifiers.

To measure more subtle change, the team did not stop at label flips. They also tracked how much a model's prediction probabilities moved between original and mutated sentences using Jensen-Shannon Distance and cosine similarity. That matters because a model can shift meaningfully without crossing the threshold into a different final label. In other words, a sentence pair may not look biased under a simple output check, but its internal confidence movement may still contain a pattern that helps identify bias risk in another model.

This is where the paper becomes more than a standard fairness audit. The authors were not only asking which models were biased. They were asking whether the expensive part of finding those biased cases could be reduced by learning from one model's behavior and using it to guide another audit. That makes the paper as much about AI operations and evaluation workflow as about model ethics.

The researchers found that bias-revealing inputs are indeed rare, which confirms that most brute-force test generation and inference work is wasted effort. But they also found evidence that these rare inputs are not distributed randomly. Some models, especially models from the same architectural family, tend to share the same triggers. That creates an opening for reuse, and reuse is what could lower the price of fairness testing at scale.

Reusing lightweight models may sharply cut the cost of auditing larger ones

The study found a large overlap among the three smaller classifier models. The authors report that more than 94% of their bias-revealing inputs overlap, and that DeBERTa-v3 and DistilRoBERTa showed a full overlap in the sets of inputs that revealed bias. That means teams auditing one lightweight model may already hold much of the information needed to audit a closely related one without rerunning a full campaign from scratch.

The same pattern did not hold for the large generative models. Despite their similar scale and training orientation, FinMA and FinGPT shared only a small set of biased inputs, and their overlap with the lightweight models was also limited. That is a warning against assuming that one large financial model can stand in for another in fairness work. But it did not end the case for reuse. Instead, it pushed the researchers toward a second strategy: using prediction shifts in a lightweight model to rank which test cases should be tried first on a large one.

That ranking strategy produced the paper's most striking number. When the researchers prioritized input pairs by decreasing Jensen-Shannon Distance based on DistilRoBERTa prediction shifts, they were able to uncover 73.01% of FinMA's bias using only 20% of the test inputs. At 40% effort, that rose to 89.64%. By 60%, the method exposed 95.49% of FinMA's bias, and by 80%, 97.5%. The gain over random ordering was very large, with the paper reporting differences of +53.01 percentage points at 20% effort and +49.64 points at 40% effort, supported by extremely small p-values and effect sizes close to 1.

The same trick was far less impressive for FinGPT. Using JSD-based prioritization from DistilRoBERTa actually underperformed random ordering early on for that model, though it became mildly positive at higher effort levels. That uneven result is important because it shows the approach is not universal. Cross-model guidance can work very well, but it depends on how the source and target models are related. In this study, FinMA was highly responsive to lightweight-model guidance, while FinGPT was not.

  • FIRST PUBLISHED IN:
  • Devdiscourse
Give Feedback