TRL-Bench: Standardizing Cross-Paradigm Representation-Level Evaluation of Tabular Encoders
Abstract
TRL-Bench establishes a standardized benchmark for evaluating tabular representation learning models across multiple granularities, revealing that encoder performance varies by task type and requires capability-specific assessment rather than single leaderboard rankings.
Tabular encoders are usually evaluated inside task-specific end-to-end pipelines, so models from different training paradigms are difficult to compare directly even when they operate on similar tabular signals. We introduce TRL-Bench, a multi-granular tabular representation learning (TRL) benchmark that standardizes cross-paradigm representation-level evaluation: each encoder exports row-, column-, or table embeddings through its supported wrapper, and shared lightweight heads probe them across three suites: TRL-CTbench (column/table), TRL-Rbench (row), and TRL-DLTE (compositional Data-Lake Table Enrichment spanning all three granularities). To support this standardized setting, we release curated benchmark assets and task reformulations, including 50 OpenML tables with 123 verified targets, 16 row-pair linkage rewrites, and a 47,772-table DLTE lake derived from 1,379 parent tables. Across 20 models and 16 tasks, TRL-Bench shows that once downstream conditions are standardized, encoder quality is capability-specific rather than captured by a single leaderboard. In TRL-CTbench, generic text encoders often lead on tasks with strong surface-text signal, while tabular specialists win where their pretraining objective aligns with the task. In TRL-Rbench, within-table prediction and cross-table linkage favor different training regimes, with atomic linkage performance correlating strongly with the row-matching stage of DLTE pipelines. In TRL-DLTE, the strongest pipelines combine capability-matched specialists rather than reuse a single encoder, and top end-to-end quality depends on non-additive compositional fit rather than per-stage marginal rank alone. TRL-Bench provides a common protocol for measuring reusable signal in exported tabular representations under shared downstream conditions. Code and data: https://github.com/LOGO-CUHKSZ/TRL-Bench
Community
๐ Releasing TRL-Bench โ a unified framework + library for tabular representation learning, one stop for tabular representation learning.
๐งฉ 20 encoders ยท 16 tasks ยท 87 datasets across 3 suites
๐ Built to make heterogeneous tabular models directly comparable, and reusable as embedding models
Tabular encoders come in every shape: different input formats, training objectives, and output heads. So even two models built for the same job are hard to compare head-to-head.
We built TRL-Bench to make them comparable.
It unifies everything at the level of the representation: each model is wrapped behind one shared interface that exports row-, column-, and table-embeddings, and shared lightweight heads probe those embeddings under common task definitions, so 20 encoders from every paradigm finally sit on the same axes.
It's also a library: 20 different types of tabular models are adapted into embedding models that export row, column, and table embeddings for the community to reuse.
It spans three suites:
๐งฉ TRL-CTbench โ 13 column/table tasks: schema, joinability, unionability, grounding
๐ TRL-Rbench โ multi-target row prediction (50 subtasks, 123 targets) + record linkage (16 datasets)
๐ TRL-DLTE โ a 47,772-table data-lake enrichment pipeline spanning all three granularities
The main takeaway is clear: there is no single best tabular encoder, strengths are split across different table jobs. The choice of tabular models should be task-aware.
We also find that:
๐ Off-the-shelf text encoders are surprisingly strong when the signal is in the surface text (column names and cell values); cross-table alignment and matching instead reward structure-aware specialists
๐ Predicting a value inside a table and matching the same record across tables call for different encoders: one rewards adapting to a single table, the other rewards embeddings that stay comparable across tables
๐ Stacking the best per-stage encoders does not give the best compositional pipeline, and neither does reusing one encoder end-to-end; the winning recipe matches a different specialist to each step (find related tables โ align columns โ match rows)
TRL-Bench is meant to serve both as a diagnostic benchmark and as a practical library for building on tabular representations.
๐ Paper: https://arxiv.org/abs/2606.09323
๐ Website: https://logo-cuhksz.github.io/trl-bench.github.io/
๐ค Datasets: https://huggingface.co/datasets/logo-lab/trl-ctbench ยท trl-rbench ยท trl-dlte
๐ป Code: https://github.com/LOGO-CUHKSZ/TRL-Bench
Good work!
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Towards Universal Tabular Embeddings: A Benchmark Across Data Tasks (2026)
- Towards Pretraining Text Encoders for TabPFN (2026)
- VT-Bench: A Unified Benchmark for Visual-Tabular Multi-Modal Learning (2026)
- TabEmbed: Benchmarking and Learning Generalist Embeddings for Tabular Understanding (2026)
- STRABLE: Benchmarking Tabular Machine Learning with Strings (2026)
- MulTaBench: Benchmarking Multimodal Tabular Learning with Text and Image (2026)
- TabSwift: An Efficient Tabular Foundation Model with Row-Wise Attention (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2606.09323 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 1
Datasets citing this paper 3
logo-lab/trl-ctbench
logo-lab/trl-rbench
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper
