Class Imbalance in ML

Resources

TL;DR

Imbalance is fine if your objective, loss, and metrics match the problem. It’s a problem when the minority class matters more or your loss/metric ignores that asymmetry. Fix with re-sampling or weights; calibrate probabilities; always evaluate on the original distribution.

Why it shows up

Real data is skewed. Fraud, churn, or a rare disease can sit at 0.01%. A trivial classifier that predicts the majority class can hit high accuracy and still be useless.

OLS mental model

Think OLS: if most observations have \(y=0\), a few \(y=1\) points won’t move the fit much. The loss is dominated by the zeros. That intuition generalizes to other models whose loss aggregates per-example errors.

In classical stats courses, this isn’t a “problem” because the sample is treated as a random draw. If the true \(P(y=1)=0.01\%\), imbalance is expected and estimators remain unbiased under correct specification. In practice however, we often care about decisions where minority errors cost more.

When is imbalance a practical problem?

Classification: When the minority class is what you care about (e.g., missing a rare disease). Loss/metrics will bias toward the majority unless you counteract it.
Regression-style targets with spikes: E.g., conversion rates (sales per view) with many zeros (non-converters). Target is continuous, but the practical asymmetry remains.
It’s not always a problem: Over-correcting wastes signal, and imbalance matters only when your loss, inductive bias, or metric fails to reflect asymmetric costs.

Fixes

Downsample the majority

Train on fewer majority examples to rebalance. This induces bias toward a more balanced world. Correct it by up-weighting majority examples by the sample weights. Treat the rate and weights as hyper-parameters.

Oversample the minority

Duplicate minority examples (or simple variants). Conceptually similar to increasing their sample weights.

Sample weights

Directly weight classes. A simple starting point is inverse frequency:

\begin{equation} w_i = \frac{N}{\text{number of samples of class i}}. \end{equation}

Tune weights like any other hyper-parameter.

Probability calibration

If the model outputs probabilities on imbalanced data, calibrate to match real-world frequencies. Common choices: isotonic regression, beta calibration, or Platt scaling.

Evaluate on the original distribution

Even if you re-sample or re-weight for training, validate and report metrics on the unaltered data to reflect reality.