ML Project Revival

The Story

In 2018 I trained a machine learning model to predict whether someone would likely donate to a charity called CharityML. The model scored 86.78% accuracy. I submitted the project, got my grade, and moved on.

What I didn't know: the dataset I used. the UCI Adult Income dataset from the 1994 US Census. had already appeared in hundreds of research papers on AI fairness, privacy preservation, and model debugging, according to UC Berkeley researchers writing in 2021.

In 2021 UC Berkeley researchers published "Retiring Adult". a paper calling for this dataset to be retired, revealing that the $50k income threshold was the 76th percentile overall, but the 88th percentile for Black Americans and 89th percentile for women. The model didn't learn who donates. It learned who 1994 America paid well.

GitHub Copilot helped me find the deprecated code, modernize the implementation, and audit the fairness of the predictions. Here's what we found.

Predict Donor Likelihood

Enter census-style features to see how the 1994-trained model would classify this person. Notice how predictions shift across demographic groups.

Age

35

Education Level

Hours Per Week

40

Gender

Race

Fairness Note

What This Model Actually Learned

These charts show the fairness audit results. The model scored 85% overall accuracy, which sounds reasonable until you look inside. That number hides significant disparities across demographic groups. A charity using this model would overwhelmingly target White and Asian-Pac-Islander males while systematically overlooking Black and American Indian women, not because of anything those groups did, but because the $50k income threshold used to define a likely donor was structurally harder to reach for women and minorities in 1994. The model did not learn who donates. It learned who 1994 America paid well.

Prediction Rates by Group

False Positive Rates

False Negative Rates

The $50k Threshold Problem

$50k threshold context by demographic group

Key Finding

Asian-Pac-Islander males were predicted as likely donors at 32%. White males at 26%. Black females at 4%. American Indian females at nearly 0%. The model was not predicting donation likelihood. it was predicting who 1994 America paid well. The $50k threshold used as the positive class label was structurally harder to reach for women and minorities, baking in systematic disadvantage before training even began.

The Research Behind This Dataset

The UCI Adult Income dataset appeared in 20+ research papers from 2006 to 2019 spanning AI fairness, privacy, model debugging, and distributed systems. Here is a selection.

The What-If Tool: Interactive Probing of Machine Learning Models

Wexler et al.. ArXiv 2019. Fairness visualization and model probing

Paired-Consistency: An Example-Based Model-Agnostic Approach to Fairness Regularization

Horesh et al.. ArXiv 2019. Algorithmic fairness

Automated Directed Fairness Testing

Udeshi et al.. ASE 2018. Automated bias detection

Automated Data Slicing for Model Validation

Chung et al.. IEEE TKDE 2018. Subgroup performance analysis

Helix: Accelerating Human-in-the-loop Machine Learning

Xin et al.. ArXiv 2018. Iterative ML optimization

A Confidence-Based Approach for Balancing Fairness and Accuracy

Fish et al.. ArXiv 2016. Fairness-accuracy tradeoffs

Debugging Machine Learning Tasks

Chakarov et al.. ArXiv 2016. ML debugging methodology

Data Preprocessing Techniques for Classification Without Discrimination

Kamiran and Calders. KAIS 2011. Foundational fairness paper

Retiring Adult: New Datasets for Fair Machine Learning

Ding et al.. UC Berkeley 2021. Called for retirement of this dataset