Sample weights in machine learning are per-row importance knobs. They scale how much each observation moves the loss, the gradients, and the final model. In applied ML, sample weights are usually very important.
Given a model \(f(x;\theta)\) and per-row weights \(w_i \ge 0\), each observation’s contribution to the objective is scaled by its weight:
\begin{equation} \min_{\theta} L(\theta) = \sum_{i=1}^{n} w_i \,\ell(y_i, f(x_i;\theta)) + \lambda \Omega(\theta) \end{equation}
where \(\lambda \Omega(\theta)\) is regularization. Intuitively, this is like replicating row \(i\) about \(w_i\) times.
Weights decide how much each row moves the loss. For example, if we train a model on social media conversion rates (CVR) (sales divided by views), we need to think about both the statistical signals and business objectives when choosing the weights.
Small denominators are noisy. For CVR \(\hat p = s/n\) with binomial trials:
\begin{equation} \mathrm{Var}(\hat p) = \frac{p(1-p)}{n} \end{equation}
A row with \(n=10{,}000\) views is more reliable than one with \(n=50\). Setting \(w_i = n_i\) mirrors the information content of the binomial likelihood and is equivalent to training on the underlying Bernoulli trials in aggregate.
Practical options when \(n_i\) is heavy-tailed:
If you smooth CVR first, keep weights tied to trials. With a Beta prior \(\text{Beta}(a,b)\):
\begin{equation} \tilde p_i = \frac{s_i + a}{n_i + a + b} \end{equation}
Use \(\tilde p_i\) as the target and keep \(w_i \approx n_i\) or a tempered variant.
Intuition: weighting by trials handles heteroskedasticity. Low \(n\) rows are louder per observation but less trustworthy. Weights turn the volume down in proportion to their noise.
On the other hand, the loss should reflect what you care about commercially, not only statistical purity.
Common choices:
\begin{equation} w_i \propto \frac{n_i}{N_{b(i)}} \end{equation}
This equalizes brand influence while preserving within-brand trial information.
A simple recipe for CVR rows:
\begin{equation} \sum_i w_i = n \end{equation}
Rule of thumb: start with trials for statistical honesty, then apply the lightest business-motivated tempering needed to match how you measure success.
When training and deploying commercial ML models you have to think about both the statistical sound choices as well as what makes sense from a business perspective. In the end, the loss function should be constructed to maximize the business objectives.