Some other good resources on the topic:
I recently ran into a practical modeling issue at work: I needed to learn from multiple cohorts (brands) whose conversion rate (CVR) distributions differ in both mean and variance. The goal was not a brand-specific point estimate, but a stable 0–100 score CVR score. Since this is a quite common modeling problem, I thought I’d write a post about it.
Two problems appeared immediately: (i) extreme CVRs for low-exposure rows (think 0s or 1s) and (ii) brand base-rate shifts (a furniture brand is not a mobile game). The fix is a short, repeatable target-engineering recipe:
The model then predicts on this centered target; at inference we add back the global mean, transform back to probabilities, and map to a percentile for the 0–100 score.
So in conclusion, we smooth first to avoid boundary issues, transform to logit for additivity and variance control, then center and handle brand imbalances; map back with the sigmoid at prediction time.
Suppose two influencers have different exposures: one records 1 sale out of 10 views, another 30 out of 1 000. Is 0.10 really better than 0.03, or is the first just noise from a tiny denominator? Rather than arbitrarily filtering by a minimum \(n\), use Empirical Bayes to shrink noisy rates toward a sensible prior.
Let CVR be \(Y = \frac{k}{n}\) with
\begin{equation} k \sim \mathrm{Binomial}(n,\theta), \qquad \theta \in (0,1). \end{equation}
\(\theta\) is the true CVR (a probability). Place a Beta prior on \(\theta\),
\begin{equation} \theta \sim \mathrm{Beta}(\alpha,\beta), \end{equation}
The Beta prior is conjugate to the Binomial, giving closed-form updates; \(\alpha\) and \(\beta\) act like pseudo-successes and pseudo-failures. This keeps all rates in \((0,1)\) and stabilizes extremes (0 or 1).
So the posterior mean is
\begin{equation} \tilde p \;=\; \mathbb{E}[\theta \mid k,n] \;=\; \frac{k+\alpha}{n+\alpha+\beta}. \end{equation}
This is shrinkage: when \(n\) is small, \(\tilde p\) is pulled toward the prior mean \(\frac{\alpha}{\alpha+\beta}\); as \(n\) grows, \(\tilde p\) approaches \(k/n\).
In practice:
We use the logit because it puts probabilities on the right geometry for learning. Probabilities live in \([0,1]\) and do not add nicely; their log-odds \(\operatorname{logit}(p)=\log\frac{p}{1-p}\) live on \(\mathbb{R}\) and add. De-meaning by brand on this scale, \(y'=\operatorname{logit}(p)-\operatorname{logit}(p_b)\), is a log-odds ratio that reads as “CVR above the brand baseline.” Doing \(p-p_b\) or \(p/p_b\) lacks this invariance and behaves poorly near 0 or 1. Also, binomial noise is heteroskedastic (approximately \(\mathrm{Var}(\hat p)\approx \tfrac{p(1-p)}{n}\)), so training on logits with sample weights \(w=n\) better matches the binomial likelihood and prevents tiny-\(n\) rows from steering the model.
Binomial data are heteroskedastic. First note that Empirical-Bayes smoothing keeps the rate away from 0 and 1:
\begin{equation} 0 \;<\; \frac{\alpha}{\,n+\alpha+\beta\,} \;\le\; \tilde p \;\le\; 1 - \frac{\beta}{\,n+\alpha+\beta\,} \;<\; 1. \end{equation}
Here \(n\) is the number of trials for the row (views). Zero-conversion rows are handled via \(\tilde p\) and likely receive low weight because \(w=n\).
Now, work on the logit scale and weight by exposure:
\begin{equation} y \;=\; \operatorname{logit}(\tilde p) \;=\; \ln \left(\frac{\tilde p}{1-\tilde p}\right), \qquad w \;=\; n. \end{equation}
Train a tree regressor on \(y\) with sample weights \(w\); this approximates a binomial-likelihood fit and reduces sensitivity to low-exposure extremes. For numerical safety you may want to clip \(\tilde p \in [\varepsilon,\,1-\varepsilon]\) (e.g., \(\varepsilon=10^{-4}\)) before the logit. With EB smoothing we already solve 99% of this, but in tree regressors, those extreme targets can dominate early splits and inflate loss, effectively overfitting noise. So start by trying without any clipping and if you find the training unstable you can add it.
At prediction, map back with the sigmoid to obtain probabilities.
Additivity lives on logits. After smoothing, compute a weighted global mean and shrunk brand means:
\begin{equation} m_0 \;=\; \frac{\sum_i w_i y_i}{\sum_i w_i}, \qquad \bar y_b^* \;=\; \frac{\sum_{i \in b} w_i y_i \;+\; \kappa\, m_0}{\sum_{i \in b} w_i \;+\; \kappa}, \end{equation}
with a small shrinkage \(\kappa\) (e.g. \(\kappa=200\)) so tiny brands do not produce noisy means. Center the target:
\begin{equation} y_i’ \;=\; y_i \;-\; \bar y_b^*. \end{equation}
Train the model on \(y'\) with weights \(w\). At inference, add back the global mean to return to a universal scale and map to a percentile:
\begin{equation} \hat y \;=\; \hat y’ + m_0, \qquad \hat p \;=\; \sigma(\hat y), \qquad \text{Score} \;=\; \big\lfloor 100 \cdot \widehat{F}(\hat p)\big\rfloor, \end{equation}
where \(\widehat{F}\) is a fixed reference CDF (e.g. stored quantiles) used to convert probabilities to percentiles.
Brands contribute different total exposure \(S_b=\sum_{i\in b} n_i\). To avoid letting the largest brand dominate, apply a brand reweighting factor \(c_b\) in addition to \(w=n\):
\begin{equation} c_b \;=\; \sqrt{\frac{\mathrm{median}_b\, S_b}{S_b}} \quad \text{(cap, e.g. } c_b \le 5\text{)}, \qquad w_i^* \;=\; c_b \cdot n_i. \end{equation}
Use \(w^\*\) as the training sample weights. Keep LOBO (leave-one-brand-out) for evaluation.
For aggregated success-rate data like CVR, a robust recipe with tree models is:
The weights we have discussed are per-row sample weights you provide to the trainer that scale the loss. These weights are not hyperparameters of the model itself, but multipliers that scale the contribution of each training example to the loss function and thus to the gradient/Hessian calculations used to grow the trees. Think of \(w_i = n_i\) as “replicating that row \(n_i\) times”. E.g.,
from xgboost import XGBRegressor
model = XGBRegressor(
objective="reg:squarederror",
n_estimators=500, max_depth=6,
subsample=0.8, colsample_bytree=0.8,
learning_rate=0.05, reg_lambda=1.0, tree_method="hist"
)
model.fit(X_train, y_train, sample_weight=w_train)
When choosing sample weights, it helps to be clear about the question you want the model to answer. Setting \(w_i = n_i\) (exposures) makes the training loss proportional to binomial variance; this is correct if you care about per-impression CVR, but it also means high-exposure influencers dominate the fit. Setting \(w_i = 1\) instead gives equal weight per influencer, so small accounts matter just as much as large ones, though at the cost of chasing noise. A practical compromise is \(w_i = \sqrt{n_i}\), which still down-weights tiny denominators but prevents large accounts from completely swamping the training. In short: use \(w=n\) for per-impression modeling, \(w=1\) for per-influencer fairness, and \(w=\sqrt{n}\) if you want a balance.