Better TAMA Models with Limited Data

In [1], we reveal that with limited instruction tuning data, we can achieve competitive performance on table tasks. This compact setup enables quick instruction tuning with advanced base models.

We present TAMA models built on Qwen 2.5 and Qwen 3. These models achieve strong results on the MMTU benchmark [2], outperforming recent table reasoning models [3] and competitive table LLMs like Table-GPT 2 [4], which is tuned on 2.36M datapoints.

Notably, TAMA-QWen3 achieves the best overall performance of 33.9, surpassing QWen-3-8B (32.9) and TableGPT-2 (30.0).

Models	Paper Source	Training Corpora Size	Base Model	Model Size	Overall	Table Understanding and QA	Table Transformation and Manipulation	Entity and Schema Matching	SQL and Table Navigation	Semantic Analysis and Relationships	Cell and Column Annotation	Error Detection	Formula Prediction
TableLlama	TableLlama: Towards Open Large Generalist Models for Tables	2M	Yukang/Llama-2-7b-longlora-8k	7B	0.0	0	0	0	0	0	0	0	0
TableLLM	TableLLM: Enabling Tabular Data Manipulation by LLMs in Real Office Usage Scenarios	309K	codellama/CodeLlama-7b-Instruct-hf	7B	2.5	2.9	0.1	16.6	2.3	1.4	2.6	0	0
TableBenchLLM-Llama-3.1-8B	TableBench: A Comprehensive and Complex Benchmark for Table Question Answering	20K	meta-llama/Llama-3.1-8B	8B	3.4	0.1	3.1	23.3	0.8	2.1	0	0	0
Llama-3.1-8B-Instruct	The Llama 3 Herd of Models	-	meta-llama/Llama-3.1-8B-Instruct	8B	25.3	38.1	17.1	67.8	26.2	28	24.9	4.5	0.1
TAMA-vB	Rethinking Table Instruction Tuning	2.6K	meta-llama/Llama-3.1-8B-Instruct	8B	21.1	30.9	15.2	73.7	14.3	20.9	18.7	5.9	0.1
TAMA-vA	Rethinking Table Instruction Tuning	2.6K	meta-llama/Llama-3.1-8B-Instruct	8B	16.9	22.6	13.5	41.8	12.7	21.1	17.1	6.3	0.1
Qwen2.5-7B	Qwen2.5 Technical Report	-	Qwen/Qwen2.5-7B-Instruct	7B	28.5	38.5	18.6	87.3	30.5	32.3	23.2	7.9	0.3
TableGPT2-7B	TableGPT2: A Large Multimodal Model with Tabular Data Integration	2.36M	Qwen/Qwen2.5-7B-Instruct	7B	30.0	42.6	22.9	84.9	31.1	32.4	24.7	14.8	0.3
Table-R1-Zero-8B	Table-R1: Inference-Time Scaling for Table Reasoning	48.6K	meta-llama/Llama-3.1-8B-Instruct	8B	26.6	39.4	19.6	66	26.2	31	25.9	3.8	0.2
TAMA-QWen2.5	Rethinking Table Instruction Tuning	2.6K	Qwen/Qwen2.5-7B-Instruct	7B	27.6	36.5	17.1	86.8	29.9	30.8	23.6	8.4	0.2
QWen-3-8B	Qwen3 Technical Report	-	Qwen/Qwen3-8B	8B	32.9	37.4	25.7	83.2	28.9	38.9	27.6	14.2	0.6
TAMA-QWen3	Rethinking Table Instruction Tuning	2.6K	Qwen/Qwen3-8B	8B	33.9	38.2	24.8	83	29.3	38.3	26.6	13.9	0.6

Evaluation Details

We adopt the official MMTU evaluation script to compute scores. For overall performance, we use the evaluation function described here. Category scores are the arithmetic mean across datasets in that category. For QWen 3 model and TAMA-QWen3, we turned off the thinking mode.

References

[1] Deng, Naihao, and Rada Mihalcea. "Rethinking Table Instruction Tuning." arXiv:2501.14693 (2025).

[2] Xing, Junjie, et al. "MMTU: A Massive Multi-Task Table Understanding and Reasoning Benchmark." arXiv:2506.05587 (2025).

[3] Yang, Zheyuan, et al. "Table-r1: Inference-time scaling for table reasoning." arXiv:2505.23621 (2025).

[4] Su, Aofeng, et al. "TableGPT2: A Large Multimodal Model with Tabular Data Integration." arXiv:2411.02059 (2024).

MichiganNLP
/

TAMA-QWen3

Better TAMA Models with Limited Data

Evaluation Details

References

Model tree for MichiganNLP/TAMA-QWen3

Collection including MichiganNLP/TAMA-QWen3

TAMA