Class Imbalance in ML

Resources

TL;DR

Imbalance is fine if your objective, loss, and metrics match the problem. It’s a problem when the minority class matters more or your loss/metric ignores that asymmetry. Fix with re-sampling or weights; calibrate probabilities; always evaluate on the original distribution.

Why it shows up

Real data is skewed. Fraud, churn, or a rare disease can sit at 0.01%. A trivial classifier that predicts the majority class can hit high accuracy and still be useless.

OLS mental model

Think OLS: if most observations have \(y=0\), a few \(y=1\) points won’t move the fit much. The loss is dominated by the zeros. That intuition generalizes to other models whose loss aggregates per-example errors.

In classical stats courses, this isn’t a “problem” because the sample is treated as a random draw. If the true \(P(y=1)=0.01\%\), imbalance is expected and estimators remain unbiased under correct specification. In practice however, we often care about decisions where minority errors cost more.

When is imbalance a practical problem?

Fixes

Downsample the majority

Train on fewer majority examples to rebalance. This induces bias toward a more balanced world. Correct it by up-weighting majority examples by the sample weights. Treat the rate and weights as hyper-parameters.

Oversample the minority

Duplicate minority examples (or simple variants). Conceptually similar to increasing their sample weights.

Sample weights

Directly weight classes. A simple starting point is inverse frequency:

\begin{equation} w_i = \frac{N}{\text{number of samples of class i}}. \end{equation}

Tune weights like any other hyper-parameter.

Probability calibration

If the model outputs probabilities on imbalanced data, calibrate to match real-world frequencies. Common choices: isotonic regression, beta calibration, or Platt scaling.

Evaluate on the original distribution

Even if you re-sample or re-weight for training, validate and report metrics on the unaltered data to reflect reality.