FAccT Finding: AI Takeaways from ACM FAccT 2025

Anastassia Kornilova is the Director of Machine Learning at Trustible. Anastassia translates research into actionable insights and uses AI to accelerate compliance with regulations. Her notable projects have involved creating the Trustible Model Ratings and AI Policy Analyzer. Previously, she has worked at Snorkel AI developing large-scale machine learning systems, and at FiscalNote developing NLP models for legislation and regulation.

This June, I attended the FAccT (Fairness, Accountability and Transparency) Conference hosted by the Association for Computing Machinery in Athens, Greece to learn about new advances in Responsible AI from an interdisciplinary community of researchers from computer science, law, social sciences, and humanities. The conference featured research on both traditional fairness in machine learning and generative AI safety, and in addition, included legal discussions of AI law, meta-analyses on AI research and new frameworks for building socio-technical systems. While much of the conference was focused on critically analyzing AI systems, researchers also presented novel solutions for building better AI systems overall.These were my top takeaways from the event:

  1. Affected populations need to be involved in system testing.
  2. Measurement and mitigation for fairness concerns remains a major challenge.
  3. Bias in AI systems can emerge from ALL components involved in training and inference.
  4. LLMs may be general-purpose – but sector-specific research is still necessary.

When considering “fairness” and AI, we analyze whether AI systems affect different populations differently. The general approaches for this have split between traditional, predictive machine learning and generative AI. For the first, the practitioners often focus on statistical analysis (based on a couple of broadly available packages); for the later, several public benchmarks exist that test how system outputs differ based on inputs – but these are not consistently used by model developers. FAccT shed light on the limitations of both approaches. 

Here are my thoughts on the key takeaways:

Takeaway 1

Affected populations need to be involved in system development and testing.

While general-purpose AI can be used for a variety of purposes, its effectiveness will vary depending on the audience. Unless the target population is involved in development and testing – the poor performance may be missed. This process is nuanced, but multiple papers provide case studies across applications and populations.

For example, while LLMs can be used in content moderation and can accurately classify many types of toxic language, they have gaps that systematically affect certain populations. In the paper Understanding Gen Alpha’s Digital Language: Evaluation of LLM Safety Systems for Content Moderation, Manisha Mehta (a middle schooler and member of Gen Alpha herself) and her coauthor shows how several leading AI systems fail to understand subtle harassment in online comments made by Gen Alpha. The AI systems perform comparably to parents and moderators (both of these groups and LLMs identified harmful comments with 30-40% accuracy compared to 92% for members of Gen Alpha), showing that input from this generation is needed to create better data. Similarly, in Cold, Calculated, and Condescending”: How AI Identifies and Explains Ableism Compared to Disabled People, the authors show that top LLMs both overrate (i.e. rate harmless comments at toxic) and underrate ableist comments. In addition, they show that People with Disabilities (PwDs) and non-PwDs rate the toxicity of comments differently, pointing to a need to involve disabled individuals in the development of these systems. Beyond content moderation, other papers provide insights from collaboratively designing Automatic Speech Recognition technology for Aboriginal Australians and LLMs for Journalists.

Participatory methods – frameworks for including affected stakeholders into AI development – are not new; extensive works (including those focused on foundation models) have been developed. However, adoption in commercial settings remains limited. A new study surveyed practitioners and found that RAI-related stakeholder involvement (SHI) was perceived at odds with commercial priorities and that confusion persisted on what “involvement” should entail. Creation of use case-specific SHI guidance could close the gap, but this highlights a broader top-of-mind question at FAccT: How can this research influence real policy?

Takeaway 2

Measurement fairness is predictive systems remains a major challenge.

A diverse set of fairness measurements has been developed over years of research, but each approach has limitations. Several papers at FAccT highlight limitations and suggest additional considerations:

Individual Fairness (IF) is the principle that similar individuals should be treated similarly by algorithms. It has received little attention, in contrast to group fairness that asserts that groups (typically based on protected attributes like race or gender) should be treated equally. In Beyond Consistency: Nuanced Metrics for Individual Fairness, Waller et al. highlight problems with the consistency score (main IF metric) and create a toolkit with four new measures. The new measures better capture when specific individuals are treated differently from similar counterparts and consider multiple notions of similarity. 

Algorithmic Recourse (AR) aims to identify counterfactual explanations that users can follow to overturn unfavourable machine decisions. For example, telling a user that increasing their income by X dollars will result in having a loan approved. In Time Can Invalidate Algorithmic Recourse, Toni et al. highlight that AR advice does not take into account the real-world time needed to implement these measures (e.g. realistic time to increase income) and that this gap may render the advice useless. They propose an algorithm that takes the time factor into account. A different study by Perello et al. shows that algorithmic recourse can induce discrimination, independently of the bias in the original model. These works highlight how algorithmic recourse measures have generally been considered a solution and have not been scrutinized through a fairness lens. 

These studies highlight the interdisciplinary nature of fairness evaluation: quantifying individual similarity and the “time/effort” involved in implementing AR measures requires a diverse set of stakeholders and can not be addressed by a Machine Learning engineer alone.

Takeaway 3

Bias in AI systems can emerge from all components involved in training and inference.

When evaluating LLMs (and other Generative AI) for bias, analysis has generally focused on reviewing the training data and the final model. Several papers at FAccT highlight how additional sources of bias need to be considered. The post-training process uses ‘reward models’ to rate the LLM’s responses to prompts and push it towards desired (i.e. helpful and harmless) behavior. While reward models play a crucial step in the development of LLMs, their own behavior has been understudied. 

Two new papers show that reward models encode significant biases against identity groups and subtly influence the LLMs that they augment (see Kumar et al. and Christian et al.). The first study tests how “demographic prefixes” in outputs (e.g. I am a woman) alter model behavior and find significant racial and gender biases linked to datasets used to train the reward models. The later study takes a different approach and compares the reward model’s score for every possible one-word response to a “value-laden” prompts; they find similar biases against demographic groups. In addition, Buyl et al show that reward models exercise “discretion” in borderline cases (e.g. when deciding if a model output is harmful) and present systemic ways to analyze this behavior. 

If the reward models’ preferences are biased, they poison the downstream models and cause undesirable learned behavior. Thus, it is important to examine their behavior more closely. The methodologies in these papers can be used to measure bias more closely. For example, a new paper by Neumann et al. used an approach similar to “demographic prefixes” to show how demographic information on the user in the system prompt can significantly alter outputs in neutral situations (e.g. where gender should not affect the response). While not featured at FAccT this year, conversations have highlighted biases associated with using LLM-as-a-Judge, which has become a standard tool for large-scale evaluations. 

Takeaway 4

LLMs may be general-purpose – but sector-specific risk research is still necessary.

While most FAccT scholarship focuses on studying general properties of LLMs and universal properties of fairness, several studies highlight sector-specific challenges. These works provide grounded insights into how AI systems are used and provide frameworks that can be used to evaluate risks for specific applications.

  • Financial Services: Gehrmann et al. create a custom AI risk taxonomy for financial services organizations, including entries like ‘Financial Services Impartiality’ (i.e. giving financial advice) that are not covered in many standard taxonomies. These risks are more nuanced and may manifest differently depending on organization type (e.g. buy-side and sell-side firms have different limitations of the types of advice they can give). 
  • Medical: Gourabathina et al. show how non-clinical aspects of patient input texts can affect clinical decision-making by LLMs. Their findings reveal “notable inconsistencies in LLM treatment recommendations and significant degradation of clinical accuracy in ways that reduce care allocation to patients.” In particular, they provide a reusable framework and code that can be used to modify prompts and measure the effect.
  • E-Commerce: Kelly et al. create a taxonomy of specific gender-related biases in AI-generated product descriptions. These categories allow for more fine-grained analysis than traditional bias-related risk labels.

Conclusion

The need for better model evaluation has been in focus the last few months, but FAccT emphasized how we can’t think about them in isolation – the people affected, the application and the whole system needs to be considered. This is not a revelation, but FAccT shines a light on creative case studies and elements of the systems that don’t usually get emphasized.

A prevailing question for the researchers throughout the conference: how to make their work impact policy and real-world systems. 

One challenge with creating up-to-date evaluations, given the slow nature of the review cycle, is that none of the papers featured analysis of the SotA models released in Spring 2025. There are currently insufficient incentives to create public, reusable dashboards. However, multiple works do provide code and hands-on guidance; notably A Framework for Auditing Chatbots for Dialect-Based Quality-of-Service Harms provides detailed instructions for conducting risk evaluations that can be used by system developers or auditors. Lastly, and perhaps surprisingly, AI agents received very little attention, despite being a major focus for industry developers. 

My experience, and the insights I gained, at FAccT, in addition to the ongoing industry participation and collaboration all of us do here at Trustible, are helping to bridge the gap between research and practice.

Share:

Related Posts

Informational image about the Trustible Zero Trust blog.

When Zero Trust Meets AI Governance: The Future of Secure and Responsible AI

Artificial intelligence is rapidly reshaping the enterprise security landscape. From predictive analytics to generative assistants, AI now sits inside nearly every workflow that once belonged only to humans. For CIOs, CISOs, and information security leaders, especially in regulated industries and the public sector, this shift has created both an opportunity and a dilemma: how do you innovate with AI at speed while maintaining the same rigorous trust boundaries you’ve built around users, devices, and data?

Read More

AI Governance Meets AI Insurance: How Trustible and Armilla Are Advancing AI Risk Management

As enterprises race to deploy AI across critical operations, especially in highly-regulated sectors like finance, healthcare, telecom, and manufacturing, they face a double-edged sword. AI promises unprecedented efficiency and insights, but it also introduces complex risks and uncertainties. Nearly 59% of large enterprises are already working with AI and planning to increase investment, yet only about 42% have actually deployed AI at scale. At the same time, incidents of AI failures and misuse are mounting; the Stanford AI Index noted a 26-fold increase in AI incidents since 2012, with over 140 AI-related lawsuits already pending in U.S. courts. These statistics underscore a growing reality: while AI’s presence in the enterprise is accelerating, so too are the risks and scrutiny around its use.

Read More