Embeddings for numerical features in tabular deep learning
embeddings
| Source: Mastodon | Original article
Transformer‑style models are now being equipped with dedicated embeddings for numeric columns, a shift that promises to close the long‑standing performance gap between deep learning and classic tree‑based methods on tabular data. A paper released this week by Yandex Research, titled “On Embeddings for Numerical Features in Tabular Deep Learning,” demonstrates that converting scalar values into high‑dimensional vectors before feeding them to the model’s backbone yields consistent gains across click‑through‑rate (CTR) prediction, fraud detection and credit‑scoring benchmarks.
The approach departs from the traditional multilayer perceptron pipeline, where raw numbers are simply concatenated with categorical embeddings. Instead, each numeric feature is passed through a small neural “embedding net” that learns a smooth mapping from the raw value to a dense vector. These vectors are then processed by a Transformer or a Deep & Cross architecture, allowing the model to capture non‑linear interactions and positional relationships that were previously hard to learn from raw scalars. The authors report up to 4 % relative improvement in AUC over state‑of‑the‑art MLP baselines and comparable results to gradient‑boosted trees, while retaining the scalability and end‑to‑end training advantages of deep nets.
Why it matters is twofold. First, it lowers the barrier for enterprises that have already invested in deep‑learning pipelines but have been reluctant to switch to tree ensembles for tabular workloads. Second, the technique dovetails with recent trends in large‑scale pre‑training, where embeddings serve as the lingua franca for heterogeneous data, opening the door to unified models that can ingest text, images and structured fields simultaneously.
Looking ahead, the research community will likely explore standardised libraries for numeric embeddings—Yandex has already open‑sourced a PyTorch package, rtdl‑num‑embeddings, and early adopters are integrating it into AutoML platforms. Watch for follow‑up studies that benchmark these embeddings against emerging tabular Transformers such as TabNet‑v2 and DeepFM, and for cloud providers to roll out managed services that expose the technique to non‑technical data scientists.
Sources
Back to AIPULSEN