Papers
arxiv:2606.09323

TRL-Bench: Standardizing Cross-Paradigm Representation-Level Evaluation of Tabular Encoders

Published on Jun 8
ยท Submitted by
Wei Pang
on Jun 11
Authors:
,
,
,
,
,
,
,
,
,
,

Abstract

TRL-Bench establishes a standardized benchmark for evaluating tabular representation learning models across multiple granularities, revealing that encoder performance varies by task type and requires capability-specific assessment rather than single leaderboard rankings.

Tabular encoders are usually evaluated inside task-specific end-to-end pipelines, so models from different training paradigms are difficult to compare directly even when they operate on similar tabular signals. We introduce TRL-Bench, a multi-granular tabular representation learning (TRL) benchmark that standardizes cross-paradigm representation-level evaluation: each encoder exports row-, column-, or table embeddings through its supported wrapper, and shared lightweight heads probe them across three suites: TRL-CTbench (column/table), TRL-Rbench (row), and TRL-DLTE (compositional Data-Lake Table Enrichment spanning all three granularities). To support this standardized setting, we release curated benchmark assets and task reformulations, including 50 OpenML tables with 123 verified targets, 16 row-pair linkage rewrites, and a 47,772-table DLTE lake derived from 1,379 parent tables. Across 20 models and 16 tasks, TRL-Bench shows that once downstream conditions are standardized, encoder quality is capability-specific rather than captured by a single leaderboard. In TRL-CTbench, generic text encoders often lead on tasks with strong surface-text signal, while tabular specialists win where their pretraining objective aligns with the task. In TRL-Rbench, within-table prediction and cross-table linkage favor different training regimes, with atomic linkage performance correlating strongly with the row-matching stage of DLTE pipelines. In TRL-DLTE, the strongest pipelines combine capability-matched specialists rather than reuse a single encoder, and top end-to-end quality depends on non-additive compositional fit rather than per-stage marginal rank alone. TRL-Bench provides a common protocol for measuring reusable signal in exported tabular representations under shared downstream conditions. Code and data: https://github.com/LOGO-CUHKSZ/TRL-Bench

Community

Paper submitter

๐Ÿ“Š Releasing TRL-Bench โ€” a unified framework + library for tabular representation learning, one stop for tabular representation learning.
๐Ÿงฉ 20 encoders ยท 16 tasks ยท 87 datasets across 3 suites
๐Ÿ” Built to make heterogeneous tabular models directly comparable, and reusable as embedding models

pipeline

Tabular encoders come in every shape: different input formats, training objectives, and output heads. So even two models built for the same job are hard to compare head-to-head.
We built TRL-Bench to make them comparable.

It unifies everything at the level of the representation: each model is wrapped behind one shared interface that exports row-, column-, and table-embeddings, and shared lightweight heads probe those embeddings under common task definitions, so 20 encoders from every paradigm finally sit on the same axes.

It's also a library: 20 different types of tabular models are adapted into embedding models that export row, column, and table embeddings for the community to reuse.
It spans three suites:
๐Ÿงฉ TRL-CTbench โ€” 13 column/table tasks: schema, joinability, unionability, grounding
๐Ÿ”— TRL-Rbench โ€” multi-target row prediction (50 subtasks, 123 targets) + record linkage (16 datasets)
๐ŸŒŠ TRL-DLTE โ€” a 47,772-table data-lake enrichment pipeline spanning all three granularities

The main takeaway is clear: there is no single best tabular encoder, strengths are split across different table jobs. The choice of tabular models should be task-aware.

We also find that:

๐Ÿ“Œ Off-the-shelf text encoders are surprisingly strong when the signal is in the surface text (column names and cell values); cross-table alignment and matching instead reward structure-aware specialists

๐Ÿ“Œ Predicting a value inside a table and matching the same record across tables call for different encoders: one rewards adapting to a single table, the other rewards embeddings that stay comparable across tables

๐Ÿ“Œ Stacking the best per-stage encoders does not give the best compositional pipeline, and neither does reusing one encoder end-to-end; the winning recipe matches a different specialist to each step (find related tables โ†’ align columns โ†’ match rows)

TRL-Bench is meant to serve both as a diagnostic benchmark and as a practical library for building on tabular representations.

๐Ÿ“„ Paper: https://arxiv.org/abs/2606.09323
๐ŸŒ Website: https://logo-cuhksz.github.io/trl-bench.github.io/
๐Ÿค— Datasets: https://huggingface.co/datasets/logo-lab/trl-ctbench ยท trl-rbench ยท trl-dlte
๐Ÿ’ป Code: https://github.com/LOGO-CUHKSZ/TRL-Bench

Good work!

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.09323
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 1

Datasets citing this paper 3

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.09323 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.