Improve model card: Add pipeline tag, library, correct license, paper abstract, and usage example (#1)

Browse files

- Improve model card: Add pipeline tag, library, correct license, paper abstract, and usage example (78d0bd813b80701f1f80255a3592cf633f678a08)

Co-authored-by: Niels Rogge <[email protected]>

Files changed (1) hide show

README.md +72 -8

README.md CHANGED Viewed

@@ -1,12 +1,23 @@
 ---
-license: mit
 base_model:
 - Qwen/Qwen2.5-7B-Instruct
 ---
-<h1 align="center"> ✨ DataMind </h1>
 ## 🔧 Installation
@@ -46,7 +57,64 @@ conda activate DataMind
 pip install -r requirements.txt
 ```
 ## 🧐 Evaluation
@@ -58,9 +126,9 @@ pip install -r requirements.txt
 **Step 1: Prepare the parameter configuration**
-The evaluation datasets we used are in [QRData](https://github.com/xxxiaol/QRData) and [DiscoveryBench](https://github.com/allenai/discoverybench).  The script expects data to be at `data/QRData/benchmark/data/*.csv` and `data/DiscoveryBench/*.csv`.
- You can also download our sft models directly from Hugging Face:  [DataMind-Qwen2.5-7B](https://huggingface.co/zjunlp/DataMind-Qwen2.5-7B) ,[DataMind-Qwen2.5-14B ](https://huggingface.co/zjunlp/DataMind-Qwen2.5-14B).
 Here is the example:
 **`config.yaml`**
@@ -111,10 +179,6 @@ Run the shell script to start the process.
 bash run_eval.sh
 ```
 ## ✍️ Citation
 If you find our work helpful, please use the following citations.

 ---
 base_model:
 - Qwen/Qwen2.5-7B-Instruct
+license: apache-2.0
+pipeline_tag: text-generation
+library_name: transformers
+tags:
+- data-analysis
+- code-generation
+- qwen
 ---
+This repository contains the **DataMind-Qwen2.5-7B** model, which was presented in the paper [Why Do Open-Source LLMs Struggle with Data Analysis? A Systematic Empirical Study](https://huggingface.co/papers/2506.19794).
+**Paper Abstract:**
+Large Language Models (LLMs) hold promise in automating data analysis tasks, yet open-source models face significant limitations in these kinds of reasoning-intensive scenarios. In this work, we investigate strategies to enhance the data analysis capabilities of open-source LLMs. By curating a seed dataset of diverse, realistic scenarios, we evaluate model behavior across three core dimensions: data understanding, code generation, and strategic planning. Our analysis reveals three key findings: (1) Strategic planning quality serves as the primary determinant of model performance; (2) Interaction design and task complexity significantly influence reasoning capabilities; (3) Data quality demonstrates a greater impact than diversity in achieving optimal performance. We leverage these insights to develop a data synthesis methodology, demonstrating significant improvements in open-source LLMs' analytical reasoning capabilities.
+For more details, visit the official [DataMind GitHub repository](https://github.com/zjunlp/DataMind).
+<h1 align="center"> ✨ DataMind </h1>
 ## 🔧 Installation
 pip install -r requirements.txt
 ```
+## Usage (Text Generation for Data Analysis)
+You can use this model with the Hugging Face `transformers` library for text generation, particularly for data analysis and code generation tasks.
+First, ensure you have the `transformers` library installed:
+```bash
+pip install transformers torch
+```
+Then, you can load and use the model as follows:
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+import torch
+model_name = "zjunlp/DataMind-Qwen2.5-7B" # Or zjunlp/DataMind-Qwen2.5-14B, if available
+# Load the model and tokenizer
+# Use torch_dtype=torch.bfloat16 for better performance on compatible GPUs
+# Use device_map="auto" to automatically distribute the model across available devices
+model = AutoModelForCausalLM.from_pretrained(
+    model_name,
+    torch_dtype=torch.bfloat16,
+    device_map="auto",
+    trust_remote_code=True,
+)
+tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
+# Example: Generate Python code for data analysis
+messages = [
+    {"role": "user", "content": "I have a CSV file named 'sales_data.csv' with columns 'Date', 'Product', 'Quantity', 'Price'. Write Python code using pandas to calculate the total revenue for each product and save it to a new CSV file named 'product_revenue.csv'."}
+]
+# Apply chat template for Qwen models
+text = tokenizer.apply_chat_template(
+    messages,
+    tokenize=False,
+    add_generation_prompt=True
+)
+model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
+# Generate response
+generated_ids = model.generate(
+    model_inputs.input_ids,
+    max_new_tokens=512,
+    do_sample=True,
+    temperature=0.7,
+    top_p=0.8,
+    repetition_penalty=1.05,
+    eos_token_id=tokenizer.eos_token_id, # Ensure generation stops at EOS token
+)
+# Decode and print the generated text
+response = tokenizer.batch_decode(generated_ids[0][len(model_inputs.input_ids[0]):], skip_special_tokens=True)[0]
+print(response)
+```
 ## 🧐 Evaluation
 **Step 1: Prepare the parameter configuration**
+The evaluation datasets we used are in [QRData](https://github.com/xxxiaol/QRData) and [DiscoveryBench](https://github.com/allenai/discoverybench). The script expects data to be at `data/QRData/benchmark/data/*.csv` and `data/DiscoveryBench/*.csv`.
+You can also download our sft models directly from Hugging Face: [DataMind-Qwen2.5-7B](https://huggingface.co/zjunlp/DataMind-Qwen2.5-7B) ,[DataMind-Qwen2.5-14B ](https://huggingface.co/zjunlp/DataMind-Qwen2.5-14B).
 Here is the example:
 **`config.yaml`**
 bash run_eval.sh
 ```
 ## ✍️ Citation
 If you find our work helpful, please use the following citations.