Imbalance is fine if your objective, loss, and metrics match the problem. It’s a problem when the minority class matters more or your loss/metric ignores that asymmetry. Fix with re-sampling or weights; calibrate probabilities; always evaluate on the original distribution.
Real data is skewed. Fraud, churn, or a rare disease can sit at 0.01%. A trivial classifier that predicts the majority class can hit high accuracy and still be useless.
Think OLS: if most observations have \(y=0\), a few \(y=1\) points won’t move the fit much. The loss is dominated by the zeros. That intuition generalizes to other models whose loss aggregates per-example errors.
In classical stats courses, this isn’t a “problem” because the sample is treated as a random draw. If the true \(P(y=1)=0.01\%\), imbalance is expected and estimators remain unbiased under correct specification. In practice however, we often care about decisions where minority errors cost more.
Train on fewer majority examples to rebalance. This induces bias toward a more balanced world. Correct it by up-weighting majority examples by the sample weights. Treat the rate and weights as hyper-parameters.
Duplicate minority examples (or simple variants). Conceptually similar to increasing their sample weights.
Directly weight classes. A simple starting point is inverse frequency:
\begin{equation} w_i = \frac{N}{\text{number of samples of class i}}. \end{equation}
Tune weights like any other hyper-parameter.
If the model outputs probabilities on imbalanced data, calibrate to match real-world frequencies. Common choices: isotonic regression, beta calibration, or Platt scaling.
Even if you re-sample or re-weight for training, validate and report metrics on the unaltered data to reflect reality.