Improve model card: Add metadata tags, correct license, and expand content

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +87 -11
README.md CHANGED
@@ -1,12 +1,24 @@
1
  ---
2
- license: mit
3
  base_model:
4
  - Qwen/Qwen2.5-14B-Instruct
 
 
 
5
  ---
6
 
7
-
8
  <h1 align="center"> ✨ DataMind </h1>
9
 
 
 
 
 
 
 
 
 
 
 
 
10
 
11
  ## 🔧 Installation
12
 
@@ -46,28 +58,88 @@ conda activate DataMind
46
  pip install -r requirements.txt
47
  ```
48
 
 
49
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
50
 
51
  ## 🧐 Evaluation
52
 
53
  > Note:
54
  >
55
- > - **Ensure** that your working directory is set to the **`eval`** folder in a virtual environment.
56
- > - If you have more questions, feel free to open an issue with us.
57
- > - If you need to use local model, you need to deploy it according to **(Optional)`local_model.sh`**.
58
 
59
- **Step 1: Prepare the parameter configuration**
 
60
 
61
- The evaluation datasets we used are in [QRData](https://github.com/xxxiaol/QRData) and [DiscoveryBench](https://github.com/allenai/discoverybench). The script expects data to be at `data/QRData/benchmark/data/*.csv` and `data/DiscoveryBench/*.csv`.
62
 
63
- You can also download our sft models directly from Hugging Face: [DataMind-Qwen2.5-7B](https://huggingface.co/zjunlp/DataMind-Qwen2.5-7B) ,[DataMind-Qwen2.5-14B ](https://huggingface.co/zjunlp/DataMind-Qwen2.5-14B).
 
 
 
 
 
64
 
65
  Here is the example:
66
  **`config.yaml`**
67
 
68
  ```yaml
69
  api_key: your_api_key # your API key for the model with API service. No need for open-source models.
70
- data_root: /path/to/your/project/DataMind/eval/data # Root directory for data. (absolute path)
71
  ```
72
 
73
  **`run_eval.sh`**
@@ -97,7 +169,7 @@ CUDA_VISIBLE_DEVICES=$i python -m vllm.entrypoints.openai.api_server \
97
  --port $port # API port number, which is consistent with the `api_port` above.
98
  ```
99
 
100
- **Step 2: Run the shell script**
101
 
102
  **(Optional)** Deploy the local model if you need.
103
 
@@ -111,15 +183,19 @@ Run the shell script to start the process.
111
  bash run_eval.sh
112
  ```
113
 
 
114
 
 
 
115
 
116
 
 
117
 
118
  ## ✍️ Citation
119
 
120
  If you find our work helpful, please use the following citations.
121
 
122
- ```
123
  @article{zhu2025open,
124
  title={Why Do Open-Source LLMs Struggle with Data Analysis? A Systematic Empirical Study},
125
  author={Zhu, Yuqi and Zhong, Yi and Zhang, Jintian and Zhang, Ziheng and Qiao, Shuofei and Luo, Yujie and Du, Lun and Zheng, Da and Chen, Huajun and Zhang, Ningyu},
 
1
  ---
 
2
  base_model:
3
  - Qwen/Qwen2.5-14B-Instruct
4
+ license: apache-2.0
5
+ pipeline_tag: text-generation
6
+ library_name: transformers
7
  ---
8
 
 
9
  <h1 align="center"> ✨ DataMind </h1>
10
 
11
+ This repository contains the **DataMind** model, a fine-tuned Qwen2.5-14B-Instruct model presented in the paper [Why Do Open-Source LLMs Struggle with Data Analysis? A Systematic Empirical Study](https://huggingface.co/papers/2506.19794).
12
+
13
+ Code: [https://github.com/zjunlp/DataMind](https://github.com/zjunlp/DataMind)
14
+
15
+ ## Abstract
16
+
17
+ Large Language Models (LLMs) hold promise in automating data analysis tasks, yet open-source models face significant limitations in these kinds of reasoning-intensive scenarios. In this work, we investigate strategies to enhance the data analysis capabilities of open-source LLMs. By curating a seed dataset of diverse, realistic scenarios, we evaluate model behavior across three core dimensions: data understanding, code generation, and strategic planning. Our analysis reveals three key findings: (1) Strategic planning quality serves as the primary determinant of model performance; (2) Interaction design and task complexity significantly influence reasoning capabilities; (3) Data quality demonstrates a greater impact than diversity in achieving optimal performance. We leverage these insights to develop a data synthesis methodology, demonstrating significant improvements in open-source LLMs' analytical reasoning capabilities.
18
+
19
+ ## 🔔 News
20
+
21
+ - **[2025-06]** We release a new paper: "[Why Do Open-Source LLMs Struggle with Data Analysis? A Systematic Empirical Study](https://arxiv.org/pdf/2506.19794)".
22
 
23
  ## 🔧 Installation
24
 
 
58
  pip install -r requirements.txt
59
  ```
60
 
61
+ ## 💻 Training
62
 
63
+ Our model training was completed using the powerful and user-friendly **[LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory)** framework, which provided us with an efficient fine-tuning workflow.
64
+
65
+ ##### 1. Training Data
66
+
67
+ Our training dataset is available in `train/datamind-da-dataset.json`
68
+
69
+ ##### 2. Training Configuration
70
+
71
+ The following is an example configuration for full-parameter fine-tuning using DeepSpeed ZeRO-3. You can save it as a YAML file (e.g., `datamind_sft.yaml`).
72
+
73
+ ```yaml
74
+ ### model
75
+ model_name_or_path: Qwen/Qwen2.5-7B-Instruct # Or Qwen/Qwen2.5-14B-Instruct
76
+
77
+ ### method
78
+ stage: sft
79
+ do_train: true
80
+ finetuning_type: full
81
+ deepspeed: examples/deepspeed/ds_z3_config.json
82
+ flash_attn: fa2
83
+
84
+
85
+ ### dataset
86
+ dataset: datamind-da-dataset
87
+ template: qwen
88
+ cutoff_len: 8192
89
+ overwrite_cache: true
90
+ preprocessing_num_workers: 16
91
+
92
+ ### output
93
+ output_dir: checkpoints/your-model-name
94
+ logging_steps: 1
95
+ save_strategy: epoch
96
+ plot_loss: true
97
+ overwrite_output_dir: true
98
+ report_to: none
99
+
100
+ ### train
101
+ per_device_train_batch_size: 1
102
+ gradient_accumulation_steps: 4
103
+ learning_rate: 1.0e-5
104
+ num_train_epochs: 3.0
105
+ lr_scheduler_type: cosine
106
+ warmup_ratio: 0.1
107
+ bf16: true
108
+ ddp_timeout: 180000000
109
+ ```
110
+
111
+ ##### 3. Launch Training
112
+
113
+ ```bash
114
+ CUDA_VISIBLE_DEVICES=0,1,2,3 llama-factory-cli train datamind_sft.yaml
115
+ ```
116
 
117
  ## 🧐 Evaluation
118
 
119
  > Note:
120
  >
121
+ > - **Ensure** that your working directory is set to the **`eval`** folder in a virtual environment.
122
+ > - If you have more questions, feel free to open an issue with us.
123
+ > - If you need to use local model, you need to deploy it according to **(Optional)`local_model.sh`**.
124
 
125
+ **Step 1: Download the evaluation datasets and our sft models**
126
+ The evaluation datasets we used are in [QRData](https://github.com/xxxiaol/QRData) and [DiscoveryBench](https://github.com/allenai/discoverybench). The script expects data to be at `data/QRData/benchmark/data/*.csv` and `data/DiscoveryBench/*.csv`.
127
 
128
+ You can also download our sft models directly from Hugging Face: [DataMind-Qwen2.5-7B](https://huggingface.co/zjunlp/DataMind-Qwen2.5-7B) ,[DataMind-Qwen2.5-14B ](https://huggingface.co/zjunlp/DataMind-Qwen2.5-14B).
129
 
130
+ You can use the following `bash` script to download the dataset:
131
+ ```bash
132
+ bash download_eval_data.sh
133
+ ```
134
+
135
+ **Step 2: Prepare the parameter configuration**
136
 
137
  Here is the example:
138
  **`config.yaml`**
139
 
140
  ```yaml
141
  api_key: your_api_key # your API key for the model with API service. No need for open-source models.
142
+ data_root: /path/to/your/project/DataMind/eval/data # Root directory for data. (absolute path !!!)
143
  ```
144
 
145
  **`run_eval.sh`**
 
169
  --port $port # API port number, which is consistent with the `api_port` above.
170
  ```
171
 
172
+ **Step 3: Run the shell script**
173
 
174
  **(Optional)** Deploy the local model if you need.
175
 
 
183
  bash run_eval.sh
184
  ```
185
 
186
+ ## 🎉Contributors
187
 
188
+ <a href="https://github.com/zjunlp/DataMind/graphs/contributors">
189
+ <img src="https://contrib.rocks/image?repo=zjunlp/DataMind" /></a>
190
 
191
 
192
+ We deeply appreciate the collaborative efforts of everyone involved. We will continue to enhance and maintain this repository over the long term. If you encounter any issues, feel free to submit them to us!
193
 
194
  ## ✍️ Citation
195
 
196
  If you find our work helpful, please use the following citations.
197
 
198
+ ```bibtex
199
  @article{zhu2025open,
200
  title={Why Do Open-Source LLMs Struggle with Data Analysis? A Systematic Empirical Study},
201
  author={Zhu, Yuqi and Zhong, Yi and Zhang, Jintian and Zhang, Ziheng and Qiao, Shuofei and Luo, Yujie and Du, Lun and Zheng, Da and Chen, Huajun and Zhang, Ningyu},