Target Engineering and Bayesian Shrinkage

Resources

Some other good resources on the topic:

Empirical Bayes Method

Logit

Introduction

I recently ran into a practical modeling issue at work: I needed to learn from multiple cohorts (brands) whose conversion rate (CVR) distributions differ in both mean and variance. The goal was not a brand-specific point estimate, but a stable 0–100 score CVR score. Since this is a quite common modeling problem, I thought I’d write a post about it.

Two problems appeared immediately: (i) extreme CVRs for low-exposure rows (think 0s or 1s) and (ii) brand base-rate shifts (a furniture brand is not a mobile game). The fix is a short, repeatable target-engineering recipe:

  1. Empirical-Bayes smoothing of CVR to tame small-\(n\) noise.
  2. Logit transform to move to an additive scale appropriate for binomial data.
  3. Brand-level de-meaning on the logit scale (with shrinkage) to remove base-rate differences.
  4. Handling brand imbalance, such that large sample brands do not dominate.

The model then predicts on this centered target; at inference we add back the global mean, transform back to probabilities, and map to a percentile for the 0–100 score.

So in conclusion, we smooth first to avoid boundary issues, transform to logit for additivity and variance control, then center and handle brand imbalances; map back with the sigmoid at prediction time.

1. Empirical Bayes Method

Suppose two influencers have different exposures: one records 1 sale out of 10 views, another 30 out of 1 000. Is 0.10 really better than 0.03, or is the first just noise from a tiny denominator? Rather than arbitrarily filtering by a minimum \(n\), use Empirical Bayes to shrink noisy rates toward a sensible prior.

Let CVR be \(Y = \frac{k}{n}\) with

\begin{equation} k \sim \mathrm{Binomial}(n,\theta), \qquad \theta \in (0,1). \end{equation}

\(\theta\) is the true CVR (a probability). Place a Beta prior on \(\theta\),

\begin{equation} \theta \sim \mathrm{Beta}(\alpha,\beta), \end{equation}

The Beta prior is conjugate to the Binomial, giving closed-form updates; \(\alpha\) and \(\beta\) act like pseudo-successes and pseudo-failures. This keeps all rates in \((0,1)\) and stabilizes extremes (0 or 1).

So the posterior mean is

\begin{equation} \tilde p \;=\; \mathbb{E}[\theta \mid k,n] \;=\; \frac{k+\alpha}{n+\alpha+\beta}. \end{equation}

This is shrinkage: when \(n\) is small, \(\tilde p\) is pulled toward the prior mean \(\frac{\alpha}{\alpha+\beta}\); as \(n\) grows, \(\tilde p\) approaches \(k/n\).

In practice:

  1. Estimate \(\alpha,\beta\) from the training data (e.g., MLE or method of moments).
  2. Replace each raw rate \(k/n\) with the posterior mean \(\tilde p\).

2. Logit Transformation

We use the logit because it puts probabilities on the right geometry for learning. Probabilities live in \([0,1]\) and do not add nicely; their log-odds \(\operatorname{logit}(p)=\log\frac{p}{1-p}\) live on \(\mathbb{R}\) and add. De-meaning by brand on this scale, \(y'=\operatorname{logit}(p)-\operatorname{logit}(p_b)\), is a log-odds ratio that reads as “CVR above the brand baseline.” Doing \(p-p_b\) or \(p/p_b\) lacks this invariance and behaves poorly near 0 or 1. Also, binomial noise is heteroskedastic (approximately \(\mathrm{Var}(\hat p)\approx \tfrac{p(1-p)}{n}\)), so training on logits with sample weights \(w=n\) better matches the binomial likelihood and prevents tiny-\(n\) rows from steering the model.

Binomial data are heteroskedastic. First note that Empirical-Bayes smoothing keeps the rate away from 0 and 1:

\begin{equation} 0 \;<\; \frac{\alpha}{\,n+\alpha+\beta\,} \;\le\; \tilde p \;\le\; 1 - \frac{\beta}{\,n+\alpha+\beta\,} \;<\; 1. \end{equation}

Here \(n\) is the number of trials for the row (views). Zero-conversion rows are handled via \(\tilde p\) and likely receive low weight because \(w=n\).

Now, work on the logit scale and weight by exposure:

\begin{equation} y \;=\; \operatorname{logit}(\tilde p) \;=\; \ln \left(\frac{\tilde p}{1-\tilde p}\right), \qquad w \;=\; n. \end{equation}

Train a tree regressor on \(y\) with sample weights \(w\); this approximates a binomial-likelihood fit and reduces sensitivity to low-exposure extremes. For numerical safety you may want to clip \(\tilde p \in [\varepsilon,\,1-\varepsilon]\) (e.g., \(\varepsilon=10^{-4}\)) before the logit. With EB smoothing we already solve 99% of this, but in tree regressors, those extreme targets can dominate early splits and inflate loss, effectively overfitting noise. So start by trying without any clipping and if you find the training unstable you can add it.

At prediction, map back with the sigmoid to obtain probabilities.

3. De-mean by Brand on the Logit Scale

Additivity lives on logits. After smoothing, compute a weighted global mean and shrunk brand means:

\begin{equation} m_0 \;=\; \frac{\sum_i w_i y_i}{\sum_i w_i}, \qquad \bar y_b^* \;=\; \frac{\sum_{i \in b} w_i y_i \;+\; \kappa\, m_0}{\sum_{i \in b} w_i \;+\; \kappa}, \end{equation}

with a small shrinkage \(\kappa\) (e.g. \(\kappa=200\)) so tiny brands do not produce noisy means. Center the target:

\begin{equation} y_i’ \;=\; y_i \;-\; \bar y_b^*. \end{equation}

Train the model on \(y'\) with weights \(w\). At inference, add back the global mean to return to a universal scale and map to a percentile:

\begin{equation} \hat y \;=\; \hat y’ + m_0, \qquad \hat p \;=\; \sigma(\hat y), \qquad \text{Score} \;=\; \big\lfloor 100 \cdot \widehat{F}(\hat p)\big\rfloor, \end{equation}

where \(\widehat{F}\) is a fixed reference CDF (e.g. stored quantiles) used to convert probabilities to percentiles.

4. Handling Brand Imbalance

Brands contribute different total exposure \(S_b=\sum_{i\in b} n_i\). To avoid letting the largest brand dominate, apply a brand reweighting factor \(c_b\) in addition to \(w=n\):

\begin{equation} c_b \;=\; \sqrt{\frac{\mathrm{median}_b\, S_b}{S_b}} \quad \text{(cap, e.g. } c_b \le 5\text{)}, \qquad w_i^* \;=\; c_b \cdot n_i. \end{equation}

Use \(w^\*\) as the training sample weights. Keep LOBO (leave-one-brand-out) for evaluation.

Conclusion

For aggregated success-rate data like CVR, a robust recipe with tree models is:

Appendix

A Note on Sample Weights

The weights we have discussed are per-row sample weights you provide to the trainer that scale the loss. These weights are not hyperparameters of the model itself, but multipliers that scale the contribution of each training example to the loss function and thus to the gradient/Hessian calculations used to grow the trees. Think of \(w_i = n_i\) as “replicating that row \(n_i\) times”. E.g.,

from xgboost import XGBRegressor

model = XGBRegressor(
    objective="reg:squarederror",
    n_estimators=500, max_depth=6,
    subsample=0.8, colsample_bytree=0.8,
    learning_rate=0.05, reg_lambda=1.0, tree_method="hist"
)

model.fit(X_train, y_train, sample_weight=w_train)

When choosing sample weights, it helps to be clear about the question you want the model to answer. Setting \(w_i = n_i\) (exposures) makes the training loss proportional to binomial variance; this is correct if you care about per-impression CVR, but it also means high-exposure influencers dominate the fit. Setting \(w_i = 1\) instead gives equal weight per influencer, so small accounts matter just as much as large ones, though at the cost of chasing noise. A practical compromise is \(w_i = \sqrt{n_i}\), which still down-weights tiny denominators but prevents large accounts from completely swamping the training. In short: use \(w=n\) for per-impression modeling, \(w=1\) for per-influencer fairness, and \(w=\sqrt{n}\) if you want a balance.