How ‘Gender Shades’ Sheds Light on Bias in Machine Learning

To captivate and engage my AI and Data Protection students in the critical exploration of AI bias, I’ve unearthed a groundbreaking paper that cuts through the ubiquitous grey of opinion and belief. This landmark research offers a solid foundation for understanding, rather than the everyday commentary we encounter on machine learning biases. A prime example is the insightful paper “Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification” by Buolamwini and Gebru (2018). It’s an essential read for anyone delving into the nuanced realms of AI, providing clarity and depth in a field often clouded by subjective viewpoints.

Gender Shades confronts a critical and often overlooked issue in the realm of Artificial Intelligence (AI): the inherent biases in commercial gender classification systems. This is particularly significant given the rapid integration of AI in various facets of society. The research is grounded in the emerging recognition that AI, if not carefully developed and monitored, can perpetuate and exacerbate existing societal biases. Buolamwini and Gebru uniquely focus on the intersectionality of bias, examining how AI systems perform across different combinations of gender and skin type, a perspective that was relatively underexplored in AI research up until their study.



The methodology adopted by Buolamwini and Gebru is both comprehensive and innovative. They evaluate three leading commercial gender classification systems: those developed by IBM, Microsoft, and Face++. To assess the performance of these systems, the authors use the Pilot Parliaments Benchmark (PPB), a dataset they created which includes a balanced representation of genders and skin tones. The PPB consists of 1,270 images of parliamentarians from three African and three European countries, designed to balance the scales in terms of skin type and gender representation, a significant departure from the usual datasets that are skewed towards lighter-skinned individuals. This methodological approach not only provides a more accurate assessment of the AI systems but also sets a new standard for evaluating AI bias.



The results of the study are both revealing and concerning. The gender classification systems exhibited the highest accuracy for lighter-skinned males, with IBM achieving 99.7%, Microsoft 99.4%, and Face++ 99.1%. In stark contrast, the accuracy rates for darker-skinned females were significantly lower: 65.3% for IBM, 71.7% for Microsoft, and 78.7% for Face++. These figures highlight a glaring disparity: while the systems nearly perfected gender classification for lighter-skinned males, they frequently misclassified darker-skinned females.

Further dissecting these results, the study shows that for lighter-skinned females, the accuracy was somewhat better but still not on par with their male counterparts: 92.9% for IBM, 93.6% for Microsoft, and 95.6% for Face++. For darker-skinned males, the systems performed better than for darker-skinned females but still lagged behind lighter-skinned males, with accuracies of 88.0% for IBM, 94.4% for Microsoft, and 96.0% for Face++.

These statistics are alarming as they clearly illustrate systemic biases within these AI systems. The data underscores a significant underrepresentation of darker-skinned females in the AI training processes, leading to these systems being less accurate for this group. This not only raises questions about the fairness and inclusivity of AI systems but also poses real-world consequences, as these technologies are increasingly used in critical domains like security, employment, and law enforcement.



The implications of Buolamwini and Gebru’s findings extend far beyond the realm of technology into the broader social context. The significant discrepancies in AI performance across different demographics underscore a form of digital discrimination, where certain groups are more likely to be misidentified or excluded by automated systems. This bias in AI can reinforce and amplify existing societal prejudices, leading to a range of negative outcomes, from unfair job screening processes to biased law enforcement practices.

The paper’s findings demand an urgent re-evaluation of how AI systems are trained and deployed. It’s crucial for developers to use diverse datasets that represent the full spectrum of human diversity. Moreover, the findings advocate for increased transparency in AI development. Companies should be required to disclose the performance of their systems across different demographics, enabling users to understand and critique the technology they interact with.



In conclusion, “Gender Shades” by Buolamwini and Gebru is a seminal work in the field of AI ethics, shedding light on the pervasive issue of bias in AI, particularly in gender classification systems. The paper serves as a clarion call for the AI community to adopt more inclusive practices in the development of AI technologies. It emphasises the necessity of diversity, not only in datasets but also in the teams that create and refine these AI systems, to ensure a variety of perspectives are considered.

Based on their findings, Buolamwini and Gebru propose several key recommendations:


1 – Diverse Datasets

The authors emphasise the importance of using diverse datasets that include a balanced representation of all genders and skin tones. This would help in training AI systems that are more accurate and less biased.

2 – Intersectional Analysis

They advocate for intersectional analyses in AI testing. This involves considering multiple axes of identity (such as race, gender, age) simultaneously to understand how overlapping identities impact AI performance.

3 – Transparency and Accountability

The paper calls for increased transparency from companies developing AI technologies. They suggest that companies should disclose the demographics of the datasets used in training their models and the performance accuracy of their systems across different demographic groups.

4 – Inclusive Development Teams

The authors recommend the inclusion of diverse perspectives in AI development teams. This diversity can help in recognizing and addressing potential biases that might not be evident to a more homogenous group.

5 – Regulatory Oversight

Finally, they propose that there should be regulatory oversight to ensure that AI systems are fair and do not perpetuate existing biases.

These recommendations are aimed at guiding the AI community towards the development of more ethical, fair, and inclusive technologies. The paper “Gender Shades” not only exposes critical flaws in current AI systems but also provides a roadmap for addressing these challenges.

The authors’ provide practitioners with evidence and proposals that could lead to more equitable and just AI systems. This paper is a foundational text for those interested in understanding and rectifying AI bias, offering both a rigorous analysis of the problem and a hopeful pathway towards more ethical AI development.



Buolamwini, J. & Gebru, T. (2018). Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification. Proceedings of the 1st Conference on Fairness, Accountability and Transparency in Proceedings of Machine Learning Research 81:77-91 Available from


By Nigel Gooding

LLM Information Rights Law & Practice. FBCS

PG Dip Information Rights Law and Practice

PG Cert Data Protection Law and Information Governance

PG Cert Management

related posts

Get a Free Consultation