Missing Values: A Feature, Not a Bug

Resources

Introduction

In most statistics courses, data is assumed complete. You invert the \(X\) matrix, derive OLS, and everything works. When missing data appears, you learn categories like MCAR, MAR, and MNAR, and then either drop rows or impute values. The goal is usually to recover an unbiased estimator.

In production ML systems, the situation is different. Missingness appears everywhere and often carries signal. A missing feature can indicate low effort, low trust, low experience, or some other structural difference in the underlying entity. And once a model runs in production, missing values are not just a theoretical annoyance. If the model cannot handle them, it cannot predict.

Tree based models treat missingness as an alternative type and learn what to do with it. This is one of the reasons they work so well on messy real-world data.

Cross-section Continuous Variable Examples

Suppose you want to predict conversion rate for products on an e-commerce marketplace. Your features include price, historical sales, review sentiment, and an image quality score from a computer vision model.

The image quality score lies between 0 and 1, but many products lack a main photo. Some sellers upload many images, some upload only one, and some upload none.

From a classical viewpoint, “image_score is missing” is a problem. In practice, it is useful:

The missingness itself carries predictive signal.

How different modelling choices behave:

  1. Drop rows with missing image score. You bias the dataset toward high-effort sellers and lose useful structure.

  2. Drop the feature entirely. You throw away predictive information.

  3. Impute with a mean or median. You flatten the signal and pretend that “no image” equals “average image”.

  4. Use a tree based model. XGBoost, LightGBM, and CatBoost learn how to route missingness directly.

A common alternative is converting the image feature to a discrete variable, for example 1 to 5, and encoding missing as 0. This keeps missingness as its own category but loses some granularity. Tree models can learn the same structure without manual encoding.

Tree Based Models and Missing Values

Traditional models like SVMs and neural networks cannot handle missing values directly. CART style trees can. They rely on surrogate splits or learn routing rules for missingness.

XGBoost is a widely used example. At every split, it evaluates the gain from sending missing values left or right. Each node has a default direction. If the training data contains missing values for that feature, XGBoost learns the optimal route by testing both options and choosing the one that improves loss the most.

If a feature has no missing values in the training data, the model has no basis for learning what missing means. Any missing value that appears in production will be routed along a fixed default branch. The model still predicts, but the behavior is not grounded in data. If you expect missingness in production, the training distribution should reflect that.

More details appear in section 3.4 of the XGBoost paper. In practical terms, XGBoost checks both possible split directions for missing values during training and keeps the option with the lowest loss.

A Practical Note

Missing values are often undesirable, but in many datasets they encode useful information. Tree based models let you use that signal without heavy preprocessing or artificial imputation. The main requirement is that the pattern of missingness during training matches what the model will see in production. If your live system will encounter missing values for a feature that was always present in training, it is better to retrain with representative missingness than rely on arbitrary default routing.