nielsr HF Staff commited on
Commit
78d0bd8
·
verified ·
1 Parent(s): d46ecde

Improve model card: Add pipeline tag, library, correct license, paper abstract, and usage example

Browse files

This PR significantly enhances the model card for DataMind-Qwen2.5-7B by:

- **Updating metadata:**
- Changing the `license` from `mit` to `apache-2.0` as indicated in the official GitHub repository.
- Adding `pipeline_tag: text-generation` to ensure proper categorization and discoverability for this data analysis and code generation model.
- Adding `library_name: transformers` to enable the "Use in Transformers" widget, providing easy access to inference code.
- Adding relevant `tags` such as `data-analysis`, `code-generation`, and `qwen`.
- **Enriching content:**
- Adding the paper title and its Hugging Face link for quick reference.
- Including the paper abstract to provide a comprehensive overview of the model's research context and findings.
- Adding a direct link to the GitHub repository.
- Adding a "Usage" section with a practical Python code example for text generation (specifically for data analysis queries) using the `transformers` library.

These improvements make the model card more informative, discoverable, and user-friendly on the Hugging Face Hub.

Files changed (1) hide show
  1. README.md +72 -8
README.md CHANGED
@@ -1,12 +1,23 @@
1
  ---
2
- license: mit
3
  base_model:
4
  - Qwen/Qwen2.5-7B-Instruct
 
 
 
 
 
 
 
5
  ---
6
 
 
7
 
8
- <h1 align="center"> ✨ DataMind </h1>
 
 
 
9
 
 
10
 
11
  ## 🔧 Installation
12
 
@@ -46,7 +57,64 @@ conda activate DataMind
46
  pip install -r requirements.txt
47
  ```
48
 
 
 
 
49
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
50
 
51
  ## 🧐 Evaluation
52
 
@@ -58,9 +126,9 @@ pip install -r requirements.txt
58
 
59
  **Step 1: Prepare the parameter configuration**
60
 
61
- The evaluation datasets we used are in [QRData](https://github.com/xxxiaol/QRData) and [DiscoveryBench](https://github.com/allenai/discoverybench). The script expects data to be at `data/QRData/benchmark/data/*.csv` and `data/DiscoveryBench/*.csv`.
62
 
63
- You can also download our sft models directly from Hugging Face: [DataMind-Qwen2.5-7B](https://huggingface.co/zjunlp/DataMind-Qwen2.5-7B) ,[DataMind-Qwen2.5-14B ](https://huggingface.co/zjunlp/DataMind-Qwen2.5-14B).
64
 
65
  Here is the example:
66
  **`config.yaml`**
@@ -111,10 +179,6 @@ Run the shell script to start the process.
111
  bash run_eval.sh
112
  ```
113
 
114
-
115
-
116
-
117
-
118
  ## ✍️ Citation
119
 
120
  If you find our work helpful, please use the following citations.
 
1
  ---
 
2
  base_model:
3
  - Qwen/Qwen2.5-7B-Instruct
4
+ license: apache-2.0
5
+ pipeline_tag: text-generation
6
+ library_name: transformers
7
+ tags:
8
+ - data-analysis
9
+ - code-generation
10
+ - qwen
11
  ---
12
 
13
+ This repository contains the **DataMind-Qwen2.5-7B** model, which was presented in the paper [Why Do Open-Source LLMs Struggle with Data Analysis? A Systematic Empirical Study](https://huggingface.co/papers/2506.19794).
14
 
15
+ **Paper Abstract:**
16
+ Large Language Models (LLMs) hold promise in automating data analysis tasks, yet open-source models face significant limitations in these kinds of reasoning-intensive scenarios. In this work, we investigate strategies to enhance the data analysis capabilities of open-source LLMs. By curating a seed dataset of diverse, realistic scenarios, we evaluate model behavior across three core dimensions: data understanding, code generation, and strategic planning. Our analysis reveals three key findings: (1) Strategic planning quality serves as the primary determinant of model performance; (2) Interaction design and task complexity significantly influence reasoning capabilities; (3) Data quality demonstrates a greater impact than diversity in achieving optimal performance. We leverage these insights to develop a data synthesis methodology, demonstrating significant improvements in open-source LLMs' analytical reasoning capabilities.
17
+
18
+ For more details, visit the official [DataMind GitHub repository](https://github.com/zjunlp/DataMind).
19
 
20
+ <h1 align="center"> ✨ DataMind </h1>
21
 
22
  ## 🔧 Installation
23
 
 
57
  pip install -r requirements.txt
58
  ```
59
 
60
+ ## Usage (Text Generation for Data Analysis)
61
+
62
+ You can use this model with the Hugging Face `transformers` library for text generation, particularly for data analysis and code generation tasks.
63
 
64
+ First, ensure you have the `transformers` library installed:
65
+
66
+ ```bash
67
+ pip install transformers torch
68
+ ```
69
+
70
+ Then, you can load and use the model as follows:
71
+
72
+ ```python
73
+ from transformers import AutoModelForCausalLM, AutoTokenizer
74
+ import torch
75
+
76
+ model_name = "zjunlp/DataMind-Qwen2.5-7B" # Or zjunlp/DataMind-Qwen2.5-14B, if available
77
+
78
+ # Load the model and tokenizer
79
+ # Use torch_dtype=torch.bfloat16 for better performance on compatible GPUs
80
+ # Use device_map="auto" to automatically distribute the model across available devices
81
+ model = AutoModelForCausalLM.from_pretrained(
82
+ model_name,
83
+ torch_dtype=torch.bfloat16,
84
+ device_map="auto",
85
+ trust_remote_code=True,
86
+ )
87
+ tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
88
+
89
+ # Example: Generate Python code for data analysis
90
+ messages = [
91
+ {"role": "user", "content": "I have a CSV file named 'sales_data.csv' with columns 'Date', 'Product', 'Quantity', 'Price'. Write Python code using pandas to calculate the total revenue for each product and save it to a new CSV file named 'product_revenue.csv'."}
92
+ ]
93
+
94
+ # Apply chat template for Qwen models
95
+ text = tokenizer.apply_chat_template(
96
+ messages,
97
+ tokenize=False,
98
+ add_generation_prompt=True
99
+ )
100
+
101
+ model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
102
+
103
+ # Generate response
104
+ generated_ids = model.generate(
105
+ model_inputs.input_ids,
106
+ max_new_tokens=512,
107
+ do_sample=True,
108
+ temperature=0.7,
109
+ top_p=0.8,
110
+ repetition_penalty=1.05,
111
+ eos_token_id=tokenizer.eos_token_id, # Ensure generation stops at EOS token
112
+ )
113
+
114
+ # Decode and print the generated text
115
+ response = tokenizer.batch_decode(generated_ids[0][len(model_inputs.input_ids[0]):], skip_special_tokens=True)[0]
116
+ print(response)
117
+ ```
118
 
119
  ## 🧐 Evaluation
120
 
 
126
 
127
  **Step 1: Prepare the parameter configuration**
128
 
129
+ The evaluation datasets we used are in [QRData](https://github.com/xxxiaol/QRData) and [DiscoveryBench](https://github.com/allenai/discoverybench). The script expects data to be at `data/QRData/benchmark/data/*.csv` and `data/DiscoveryBench/*.csv`.
130
 
131
+ You can also download our sft models directly from Hugging Face: [DataMind-Qwen2.5-7B](https://huggingface.co/zjunlp/DataMind-Qwen2.5-7B) ,[DataMind-Qwen2.5-14B ](https://huggingface.co/zjunlp/DataMind-Qwen2.5-14B).
132
 
133
  Here is the example:
134
  **`config.yaml`**
 
179
  bash run_eval.sh
180
  ```
181
 
 
 
 
 
182
  ## ✍️ Citation
183
 
184
  If you find our work helpful, please use the following citations.