ML Sample Weights

Resources

Introduction

Sample weights in machine learning are per-row importance knobs. They scale how much each observation moves the loss, the gradients, and the final model. In applied ML, sample weights are usually very important.

Given a model \(f(x;\theta)\) and per-row weights \(w_i \ge 0\), each observation’s contribution to the objective is scaled by its weight:

\begin{equation} \min_{\theta} L(\theta) = \sum_{i=1}^{n} w_i \,\ell(y_i, f(x_i;\theta)) + \lambda \Omega(\theta) \end{equation}

where \(\lambda \Omega(\theta)\) is regularization. Intuitively, this is like replicating row \(i\) about \(w_i\) times.

What Weights?

Weights decide how much each row moves the loss. For example, if we train a model on social media conversion rates (CVR) (sales divided by views), we need to think about both the statistical signals and business objectives when choosing the weights.

Statistical Signal

Small denominators are noisy. For CVR \(\hat p = s/n\) with binomial trials:

\begin{equation} \mathrm{Var}(\hat p) = \frac{p(1-p)}{n} \end{equation}

A row with \(n=10{,}000\) views is more reliable than one with \(n=50\). Setting \(w_i = n_i\) mirrors the information content of the binomial likelihood and is equivalent to training on the underlying Bernoulli trials in aggregate.

Practical options when \(n_i\) is heavy-tailed:

If you smooth CVR first, keep weights tied to trials. With a Beta prior \(\text{Beta}(a,b)\):

\begin{equation} \tilde p_i = \frac{s_i + a}{n_i + a + b} \end{equation}

Use \(\tilde p_i\) as the target and keep \(w_i \approx n_i\) or a tempered variant.

Intuition: weighting by trials handles heteroskedasticity. Low \(n\) rows are louder per observation but less trustworthy. Weights turn the volume down in proportion to their noise.

Business Objective

On the other hand, the loss should reflect what you care about commercially, not only statistical purity.

Common choices:

\begin{equation} w_i \propto \frac{n_i}{N_{b(i)}} \end{equation}

This equalizes brand influence while preserving within-brand trial information.

A simple recipe for CVR rows:

  1. Smooth the rate to avoid 0 and 1 (eg the Beta posterior mean above).
  2. Use \(w_i = \min\{\sqrt{n_i}, c\}\) or \min{n_i^{\beta}, c}\(,\)\beta \in [0.5,1$$ to avoid whale domination.
  3. Optionally multiply by brand or value terms to reflect goals.
  4. Normalize to keep training dynamics stable:

\begin{equation} \sum_i w_i = n \end{equation}

Rule of thumb: start with trials for statistical honesty, then apply the lightest business-motivated tempering needed to match how you measure success.

Conclusion

When training and deploying commercial ML models you have to think about both the statistical sound choices as well as what makes sense from a business perspective. In the end, the loss function should be constructed to maximize the business objectives.