ML Sample Weights

Resources

Wiki: Inverse-variance Weighting

Introduction

Sample weights in machine learning are per-row importance knobs. They scale how much each observation moves the loss, the gradients, and the final model. In applied ML, sample weights are usually very important.

Given a model $f(x;\theta)$ and per-row weights $w_i \ge 0$, each observation’s contribution to the objective is scaled by its weight:

\begin{equation} \min_{\theta} L(\theta) = \sum_{i=1}^{n} w_i \,\ell(y_i, f(x_i;\theta)) + \lambda \Omega(\theta) \end{equation}

where $\lambda \Omega(\theta)$ is regularization. Intuitively, this is like replicating row $i$ about $w_i$ times.

What Weights?

Weights decide how much each row moves the loss. For example, if we train a model on social media conversion rates (CVR) (sales divided by views), we need to think about both the statistical signals and business objectives when choosing the weights.

Statistical Signal

Small denominators are noisy. For CVR $\hat p = s/n$ with binomial trials:

\begin{equation} \mathrm{Var}(\hat p) = \frac{p(1-p)}{n} \end{equation}

A row with $n=10{,}000$ views is more reliable than one with $n=50$. Setting $w_i = n_i$ mirrors the information content of the binomial likelihood and is equivalent to training on the underlying Bernoulli trials in aggregate.

Practical options when $n_i$ is heavy-tailed:

Full trials: $w_i = n_i$ for maximum statistical efficiency.
Tempered trials: $w_i = \sqrt{n_i}$ or $w_i = \frac{n_i}{n_i + k}$ to reduce whale dominance while keeping direction.
Caps: $w_i = \min(n_i, c)$ to hard-limit influence.

If you smooth CVR first, keep weights tied to trials. With a Beta prior $\text{Beta}(a,b)$:

\begin{equation} \tilde p_i = \frac{s_i + a}{n_i + a + b} \end{equation}

Use $\tilde p_i$ as the target and keep $w_i \approx n_i$ or a tempered variant.

Intuition: weighting by trials handles heteroskedasticity. Low $n$ rows are louder per observation but less trustworthy. Weights turn the volume down in proportion to their noise.

Business Objective

On the other hand, the loss should reflect what you care about commercially, not only statistical purity.

Common choices:

Maximize total conversions: keep $w_i \propto n_i$. You fit patterns that drive aggregate outcomes, but bias toward large influencers.
Fairness across creator sizes: temper or cap weights, e.g., $w_i = \sqrt{n_i}$ or $w_i = \min(n_i, c)$. You trade a little efficiency for better coverage.
Value per conversion: if a conversion on row $i$ is worth $v_i$, set $w_i \propto n_i v_i$ so the model learns what makes money, not just what converts.
Brand balance in pooled training: prevent one brand from dominating by inverse brand frequency. If brand $b$ has total trials $N_b$ and row $i$ belongs to $b(i)$,

\begin{equation} w_i \propto \frac{n_i}{N_{b(i)}} \end{equation}

This equalizes brand influence while preserving within-brand trial information.

Cost sensitivity and risk: if false positives cost $c_{FP}$ and false negatives $c_{FN}$, scale class-specific weights accordingly.

A simple recipe for CVR rows:

Smooth the rate to avoid 0 and 1 (eg the Beta posterior mean above).
Use $w_i = \min\{\sqrt{n_i}, c\}$ or \min{n_i^{\beta}, c}$,$\beta \in [0.5,1$$ to avoid whale domination.
Optionally multiply by brand or value terms to reflect goals.
Normalize to keep training dynamics stable:

\begin{equation} \sum_i w_i = n \end{equation}

Rule of thumb: start with trials for statistical honesty, then apply the lightest business-motivated tempering needed to match how you measure success.

Conclusion

When training and deploying commercial ML models you have to think about both the statistical sound choices as well as what makes sense from a business perspective. In the end, the loss function should be constructed to maximize the business objectives.