Improve model card: Add pipeline tag, library, correct license, paper abstract, and usage example (#1)
Browse files- Improve model card: Add pipeline tag, library, correct license, paper abstract, and usage example (78d0bd813b80701f1f80255a3592cf633f678a08)
Co-authored-by: Niels Rogge <[email protected]>
README.md
CHANGED
@@ -1,12 +1,23 @@
|
|
1 |
---
|
2 |
-
license: mit
|
3 |
base_model:
|
4 |
- Qwen/Qwen2.5-7B-Instruct
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
5 |
---
|
6 |
|
|
|
7 |
|
8 |
-
|
|
|
|
|
|
|
9 |
|
|
|
10 |
|
11 |
## 🔧 Installation
|
12 |
|
@@ -46,7 +57,64 @@ conda activate DataMind
|
|
46 |
pip install -r requirements.txt
|
47 |
```
|
48 |
|
|
|
|
|
|
|
49 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
50 |
|
51 |
## 🧐 Evaluation
|
52 |
|
@@ -58,9 +126,9 @@ pip install -r requirements.txt
|
|
58 |
|
59 |
**Step 1: Prepare the parameter configuration**
|
60 |
|
61 |
-
The evaluation datasets we used are in [QRData](https://github.com/xxxiaol/QRData) and [DiscoveryBench](https://github.com/allenai/discoverybench).
|
62 |
|
63 |
-
|
64 |
|
65 |
Here is the example:
|
66 |
**`config.yaml`**
|
@@ -111,10 +179,6 @@ Run the shell script to start the process.
|
|
111 |
bash run_eval.sh
|
112 |
```
|
113 |
|
114 |
-
|
115 |
-
|
116 |
-
|
117 |
-
|
118 |
## ✍️ Citation
|
119 |
|
120 |
If you find our work helpful, please use the following citations.
|
|
|
1 |
---
|
|
|
2 |
base_model:
|
3 |
- Qwen/Qwen2.5-7B-Instruct
|
4 |
+
license: apache-2.0
|
5 |
+
pipeline_tag: text-generation
|
6 |
+
library_name: transformers
|
7 |
+
tags:
|
8 |
+
- data-analysis
|
9 |
+
- code-generation
|
10 |
+
- qwen
|
11 |
---
|
12 |
|
13 |
+
This repository contains the **DataMind-Qwen2.5-7B** model, which was presented in the paper [Why Do Open-Source LLMs Struggle with Data Analysis? A Systematic Empirical Study](https://huggingface.co/papers/2506.19794).
|
14 |
|
15 |
+
**Paper Abstract:**
|
16 |
+
Large Language Models (LLMs) hold promise in automating data analysis tasks, yet open-source models face significant limitations in these kinds of reasoning-intensive scenarios. In this work, we investigate strategies to enhance the data analysis capabilities of open-source LLMs. By curating a seed dataset of diverse, realistic scenarios, we evaluate model behavior across three core dimensions: data understanding, code generation, and strategic planning. Our analysis reveals three key findings: (1) Strategic planning quality serves as the primary determinant of model performance; (2) Interaction design and task complexity significantly influence reasoning capabilities; (3) Data quality demonstrates a greater impact than diversity in achieving optimal performance. We leverage these insights to develop a data synthesis methodology, demonstrating significant improvements in open-source LLMs' analytical reasoning capabilities.
|
17 |
+
|
18 |
+
For more details, visit the official [DataMind GitHub repository](https://github.com/zjunlp/DataMind).
|
19 |
|
20 |
+
<h1 align="center"> ✨ DataMind </h1>
|
21 |
|
22 |
## 🔧 Installation
|
23 |
|
|
|
57 |
pip install -r requirements.txt
|
58 |
```
|
59 |
|
60 |
+
## Usage (Text Generation for Data Analysis)
|
61 |
+
|
62 |
+
You can use this model with the Hugging Face `transformers` library for text generation, particularly for data analysis and code generation tasks.
|
63 |
|
64 |
+
First, ensure you have the `transformers` library installed:
|
65 |
+
|
66 |
+
```bash
|
67 |
+
pip install transformers torch
|
68 |
+
```
|
69 |
+
|
70 |
+
Then, you can load and use the model as follows:
|
71 |
+
|
72 |
+
```python
|
73 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
74 |
+
import torch
|
75 |
+
|
76 |
+
model_name = "zjunlp/DataMind-Qwen2.5-7B" # Or zjunlp/DataMind-Qwen2.5-14B, if available
|
77 |
+
|
78 |
+
# Load the model and tokenizer
|
79 |
+
# Use torch_dtype=torch.bfloat16 for better performance on compatible GPUs
|
80 |
+
# Use device_map="auto" to automatically distribute the model across available devices
|
81 |
+
model = AutoModelForCausalLM.from_pretrained(
|
82 |
+
model_name,
|
83 |
+
torch_dtype=torch.bfloat16,
|
84 |
+
device_map="auto",
|
85 |
+
trust_remote_code=True,
|
86 |
+
)
|
87 |
+
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
|
88 |
+
|
89 |
+
# Example: Generate Python code for data analysis
|
90 |
+
messages = [
|
91 |
+
{"role": "user", "content": "I have a CSV file named 'sales_data.csv' with columns 'Date', 'Product', 'Quantity', 'Price'. Write Python code using pandas to calculate the total revenue for each product and save it to a new CSV file named 'product_revenue.csv'."}
|
92 |
+
]
|
93 |
+
|
94 |
+
# Apply chat template for Qwen models
|
95 |
+
text = tokenizer.apply_chat_template(
|
96 |
+
messages,
|
97 |
+
tokenize=False,
|
98 |
+
add_generation_prompt=True
|
99 |
+
)
|
100 |
+
|
101 |
+
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
|
102 |
+
|
103 |
+
# Generate response
|
104 |
+
generated_ids = model.generate(
|
105 |
+
model_inputs.input_ids,
|
106 |
+
max_new_tokens=512,
|
107 |
+
do_sample=True,
|
108 |
+
temperature=0.7,
|
109 |
+
top_p=0.8,
|
110 |
+
repetition_penalty=1.05,
|
111 |
+
eos_token_id=tokenizer.eos_token_id, # Ensure generation stops at EOS token
|
112 |
+
)
|
113 |
+
|
114 |
+
# Decode and print the generated text
|
115 |
+
response = tokenizer.batch_decode(generated_ids[0][len(model_inputs.input_ids[0]):], skip_special_tokens=True)[0]
|
116 |
+
print(response)
|
117 |
+
```
|
118 |
|
119 |
## 🧐 Evaluation
|
120 |
|
|
|
126 |
|
127 |
**Step 1: Prepare the parameter configuration**
|
128 |
|
129 |
+
The evaluation datasets we used are in [QRData](https://github.com/xxxiaol/QRData) and [DiscoveryBench](https://github.com/allenai/discoverybench). The script expects data to be at `data/QRData/benchmark/data/*.csv` and `data/DiscoveryBench/*.csv`.
|
130 |
|
131 |
+
You can also download our sft models directly from Hugging Face: [DataMind-Qwen2.5-7B](https://huggingface.co/zjunlp/DataMind-Qwen2.5-7B) ,[DataMind-Qwen2.5-14B ](https://huggingface.co/zjunlp/DataMind-Qwen2.5-14B).
|
132 |
|
133 |
Here is the example:
|
134 |
**`config.yaml`**
|
|
|
179 |
bash run_eval.sh
|
180 |
```
|
181 |
|
|
|
|
|
|
|
|
|
182 |
## ✍️ Citation
|
183 |
|
184 |
If you find our work helpful, please use the following citations.
|