Upload 10 files

Browse files

Files changed (11) hide show

.gitattributes +4 -0
README.md +220 -3
README_zh.md +222 -0
demo/demo1.png +3 -0
demo/demo2.png +3 -0
demo/demo3.png +3 -0
demo/demo4.png +3 -0
score/demo1.png +0 -0
score/demo2.png +0 -0
score/demo3.png +0 -0
score/demo4.png +0 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,7 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+demo/demo1.png filter=lfs diff=lfs merge=lfs -text
+demo/demo2.png filter=lfs diff=lfs merge=lfs -text
+demo/demo3.png filter=lfs diff=lfs merge=lfs -text
+demo/demo4.png filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -1,3 +1,220 @@
----
-license: apache-2.0
----

+<div align="center">
+# 🌟 InnoSpark 🌟
+[![Official Website](https://img.shields.io/badge/🌐-Official%20Website-blue?style=for-the-badge)](https://innospark.aiecnu.cn/innospark/)
+[![Hugging Face](https://img.shields.io/badge/🤗-Hugging%20Face-yellow?style=for-the-badge)](https://huggingface.co/sii-research)
+[![GitHub](https://img.shields.io/badge/💻-GitHub-black?style=for-the-badge)](https://github.com/Inno-Spark/elmes)
+<div style="background: linear-gradient(135deg, #667eea 0%, #764ba2 100%); padding: 2px; border-radius: 10px; margin: 20px 0;">
+  <div style="background: white; padding: 20px; border-radius: 8px;">
+    <h3>🚀 Advanced Educational Large Language Model</h3>
+  </div>
+</div>
+**Language / 语言**: English | [中文](README_zh.md)
+</div>
+---
+## 📖 Project Introduction
+**InnoSpark** is an advanced educational large language model independently developed by Shanghai Innovation Institute and East China Normal University. It aims to explore the deep application of artificial intelligence technology in the field of education. Based on the domestic Qwen large language model with secondary pre-training, combined with subdomain fine-tuning and reinforcement learning for educational scenarios, we have launched InnoSpark-1.0.
+## 🔗 Related Resources
+### 📱 Main Products
+- **Homepage**: [InnoSpark Official](https://innospark.aiecnu.cn/innospark/)
+- **RM Model**: [InnoSpark-HPC-RM-32B](https://huggingface.co/sii-research/InnoSpark-HPC-RM-32B)
+- **Educational Evaluation System**: [ELMES](https://github.com/Inno-Spark/elmes)
+### 🤖 Model Series
+| Model Version | Parameters | Link |
+|---------------|------------|------|
+| **InnoSpark-min** | 0.5B | [🔗 Download](https://huggingface.co/sii-research/InnoSpark-0.5B-0717) |
+| **InnoSpark-turbo** | 7B | [🔗 Download](https://huggingface.co/sii-research/InnoSpark-7B-0715) |
+| **InnoSpark-plus** | 72B | [🔗 Standard](https://huggingface.co/sii-research/InnoSpark-72B-0710) / [🔗 Reasoning](https://huggingface.co/sii-research/InnoSpark-R-72B-0701) |
+### 📊 Datasets
+- **Model Scoring Dataset**: [HPC-LLM-8k](https://huggingface.co/datasets/ECNU-InnoSpark/HPC-LLM-8k)
+- **Human Scoring Dataset**: [HPC-Human-8k](https://huggingface.co/datasets/ECNU-InnoSpark/HPC-Human-8k)
+## 🚀 Quickstart
+Here provides a code snippet with `apply_chat_template` to show you how to load the tokenizer and model and how to generate contents.
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+device = "cuda" # the device to load the model onto
+model = AutoModelForCausalLM.from_pretrained(
+    "sii-research/InnoSpark-72B-0710",
+    torch_dtype="auto",
+    device_map="auto"
+)
+tokenizer = AutoTokenizer.from_pretrained("sii-research/InnoSpark-72B-0710")
+prompt = "Introduce yourself in detail."
+messages = [
+    {"role": "system", "content": "You are InnoSpark（启创）, created by Shanghai Innovation Institute （上海创智学院） and East China Normal University(华东师范大学). You are a helpful assistant."},
+    {"role": "user", "content": prompt}
+]
+text = tokenizer.apply_chat_template(
+    messages,
+    tokenize=False,
+    add_generation_prompt=True
+)
+model_inputs = tokenizer([text], return_tensors="pt").to(device)
+generated_ids = model.generate(
+    model_inputs.input_ids,
+    max_new_tokens=512
+)
+generated_ids = [
+    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
+]
+response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
+```
+### VLLM
+We recommend deploying our model using 4 A100 GPUs. You can run the vllm server-side with the following code in terminal:
+```python
+python -m vllm.entrypoints.openai.api_server --served-model-name InnoSpark --model path/to/InnoSpark --gpu-memory-utilization 0.98 --tensor-parallel-size 4 --port 6000
+```
+Then, you can use the following code to deploy client-side:
+```python
+import requests
+import json
+def Innospark_stream(inputs,history):
+    url = 'http://loaclhost:6000/v1/chat/completions'
+    history+=[{"role": "user", "content": inputs},]
+    headers = {"User-Agent": "vLLM Client"}
+    pload = {
+        "model": "InnoSpark",
+        "stream": True,
+        "messages": history
+    }
+    response = requests.post(url,
+                             headers=headers,
+                             json=pload,
+                             stream=True)
+    for chunk in response.iter_lines(chunk_size=1,
+                                     decode_unicode=False,
+                                     delimiter=b"\n"):
+        if chunk:
+            string_data = chunk.decode("utf-8")
+            try:
+                json_data = json.loads(string_data[6:])
+                delta_content = json_data["choices"][0]["delta"]["content"]
+                assistant_reply+=delta_content
+                yield delta_content
+            except KeyError as e:
+                delta_content = json_data["choices"][0]["delta"]["role"]
+            except json.JSONDecodeError as e:
+                history+=[{
+                        "role": "assistant",
+                        "content": assistant_reply,
+                        "tool_calls": []
+                    },]
+                delta_content='[DONE]'
+                assert '[DONE]'==chunk.decode("utf-8")[6:]
+inputs='hi'
+history=[]
+for response_text in Innospark_stream(inputs,history):
+    print(response_text,end='')
+```
+## 🌟 Core Features
+### 🎯 Open Source Product Matrix
+<div align="left">
+**1. 📚 InnoSpark Model Series**
+   - 4 models with different parameter scales: min(0.5B), turbo(7B), plus(72B) and their corresponding inference model R versions
+**2. 🔍 ELMES Evaluation System**
+   - Education Language Model Evaluation System
+   - Automated evaluation system for educational tasks
+   - Helps continuously optimize large model capabilities in teaching scenarios
+**3. 🛠️ COCLP Data Cleaning Pipeline**
+   - Corpus Cleansing Pipeline
+   - Visual node-based framework based on ComfyUI
+   - Supports OCR, audio/video transcription, format conversion, PII removal, text filtering, and other functions
+**4. ⭐ HPC-RM Reward Model**
+   - Helpful, Personalization, and Creativity Reward Model
+   - Provides scoring in three educational dimensions: helpfulness, personalization, and creativity
+   - Includes corresponding model scoring and human scoring datasets
+</div>
+## 📈 Performance Results
+We achieved optimal performance in 4 key educational scenarios:
+### 🏆 Evaluation Results
+| Scenario | Performance |
+|----------|-------------|
+| 📝 Knowledge Explanation | ![Knowledge Explanation](score/demo1.png) |
+| 🧭 Guided Problem Solving | ![Guided Problem Solving](score/demo2.png) |
+| 📚 Interdisciplinary Lesson Plans | ![Interdisciplinary Lesson Plans](score/demo3.png) |
+| 🎭 Contextual Question Generation | ![Contextual Question Generation](score/demo4.png) |
+### 📊 Detailed Evaluation Tables
+| Scenario | Evaluation Table |
+|----------|------------------|
+| 📝 Knowledge Explanation | ![Knowledge Explanation Table](table/table1.png) |
+| 🧭 Guided Problem Solving | ![Guided Problem Solving Table](table/table2.png) |
+| 📚 Interdisciplinary Lesson Plans | ![Interdisciplinary Lesson Plans Table](table/table3.png) |
+| 🎭 Contextual Question Generation | ![Contextual Question Generation Table](table/table4.png) |
+### 🎨 Application Examples
+| Scenario | Demo |
+|----------|------|
+| 📖 Knowledge Explanation | ![Knowledge Explanation Demo](demo/demo1.png) |
+| 🎯 Guided Problem Solving | ![Guided Problem Solving Demo](demo/demo2.png) |
+| 🌟 Interdisciplinary Lesson Plans | ![Interdisciplinary Lesson Plans Demo](demo/demo3.png) |
+| 🎪 Contextual Question Generation | ![Contextual Question Generation Demo](demo/demo4.png) |
+## 🏛️ Technical Support
+This project is jointly developed by East China Normal University and Shanghai Innovation Institute. The reward model was trained using the SiiRL training framework provided by Shanghai Innovation Institute.
+## 📄 License
+Please refer to the relevant model pages for specific license information.
+---
+<div align="center">
+## 🤝 Contact & Collaboration
+**East China Normal University**
+[![Website](https://img.shields.io/badge/🌐-Visit%20Our%20Website-brightgreen)](https://innospark.aiecnu.cn/innospark/)
+[![Email](https://img.shields.io/badge/📧-Contact%20Us-red)](mailto:[email protected])
+---
+<sub>🚀 Empowering Education with AI</sub>
+</div>

README_zh.md ADDED Viewed

	@@ -0,0 +1,222 @@

+<div align="center">
+# 🌟 启创·InnoSpark 🌟
+[![Official Website](https://img.shields.io/badge/🌐-Official%20Website-blue?style=for-the-badge)](https://innospark.aiecnu.cn/innospark/)
+[![Hugging Face](https://img.shields.io/badge/🤗-Hugging%20Face-yellow?style=for-the-badge)](https://huggingface.co/sii-research)
+[![GitHub](https://img.shields.io/badge/💻-GitHub-black?style=for-the-badge)](https://github.com/Inno-Spark/elmes)
+<div style="background: linear-gradient(135deg, #667eea 0%, #764ba2 100%); padding: 2px; border-radius: 10px; margin: 20px 0;">
+  <div style="background: white; padding: 20px; border-radius: 8px;">
+    <h3>🚀 先进教育大语言模型</h3>
+  </div>
+</div>
+**Language / 语言**: [English](README.md) | 中文
+</div>
+---
+## 📖 项目简介
+**启创·InnoSpark** 是由上海创智学院和华东师范大学自主研发的先进教育大模型，旨在探索人工智能技术在教育领域中的深度应用。该模型基于国产 Qwen 大语言模型进行二次预训练，并结合子域微调和教育场景的强化学习，推出了 InnoSpark-1.0 版本。
+## 🔗 相关资源
+### 📱 主要产品
+- **主页**: [InnoSpark Official](https://innospark.aiecnu.cn/innospark/)
+- **RM模型**: [InnoSpark-HPC-RM-32B](https://huggingface.co/sii-research/InnoSpark-HPC-RM-32B)
+- **教育评测系统**: [ELMES](https://github.com/Inno-Spark/elmes)
+### 🤖 模型系列
+| 模型版本 | 参数规模 | 链接 |
+|---------|---------|------|
+| **InnoSpark-min** | 0.5B | [🔗 下载](https://huggingface.co/sii-research/InnoSpark-0.5B-0717) |
+| **InnoSpark-turbo** | 7B | [🔗 下载](https://huggingface.co/sii-research/InnoSpark-7B-0715) |
+| **InnoSpark-plus** | 72B | [🔗 标准版](https://huggingface.co/sii-research/InnoSpark-72B-0710) / [🔗 推理版](https://huggingface.co/sii-research/InnoSpark-R-72B-0701) |
+### 📊 数据集
+- **模型打分数据集**: [HPC-LLM-8k](https://huggingface.co/datasets/ECNU-InnoSpark/HPC-LLM-8k)
+- **人工打分数据集**: [HPC-Human-8k](https://huggingface.co/datasets/ECNU-InnoSpark/HPC-Human-8k)
+## 🚀 快速开始
+这里提供了一个使用 `apply_chat_template` 的代码示例，展示如何加载分词器和模型以及如何生成内容。
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+device = "cuda" # 加载模型的设备
+model = AutoModelForCausalLM.from_pretrained(
+    "sii-research/InnoSpark-72B-0710",
+    torch_dtype="auto",
+    device_map="auto"
+)
+tokenizer = AutoTokenizer.from_pretrained("sii-research/InnoSpark-72B-0710")
+prompt = "详细介绍一下你自己。"
+messages = [
+    {"role": "system", "content": "You are InnoSpark（启创）, created by Shanghai Innovation Institute （上海创智学院） and East China Normal University(华东师范大学). You are a helpful assistant."},
+    {"role": "user", "content": prompt}
+]
+text = tokenizer.apply_chat_template(
+    messages,
+    tokenize=False,
+    add_generation_prompt=True
+)
+model_inputs = tokenizer([text], return_tensors="pt").to(device)
+generated_ids = model.generate(
+    model_inputs.input_ids,
+    max_new_tokens=512
+)
+generated_ids = [
+    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
+]
+response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
+```
+### VLLM 部署
+我们推荐使用 4 块 A100 GPU 部署我们的模型。您可以在终端中使用以下代码运行 vllm 服务端：
+```python
+python -m vllm.entrypoints.openai.api_server --served-model-name InnoSpark --model path/to/InnoSpark --gpu-memory-utilization 0.98 --tensor-parallel-size 4 --port 6000
+```
+然后，您可以使用以下代码部署客户端：
+```python
+import requests
+import json
+def Innospark_stream(inputs,history):
+    url = 'http://loaclhost:6000/v1/chat/completions'
+    history+=[{"role": "user", "content": inputs},]
+    headers = {"User-Agent": "vLLM Client"}
+    pload = {
+        "model": "InnoSpark",
+        "stream": True,
+        "messages": history
+    }
+    response = requests.post(url,
+                             headers=headers,
+                             json=pload,
+                             stream=True)
+    for chunk in response.iter_lines(chunk_size=1,
+                                     decode_unicode=False,
+                                     delimiter=b"\n"):
+        if chunk:
+            string_data = chunk.decode("utf-8")
+            try:
+                json_data = json.loads(string_data[6:])
+                delta_content = json_data["choices"][0]["delta"]["content"]
+                assistant_reply+=delta_content
+                yield delta_content
+            except KeyError as e:
+                delta_content = json_data["choices"][0]["delta"]["role"]
+            except json.JSONDecodeError as e:
+                history+=[{
+                        "role": "assistant",
+                        "content": assistant_reply,
+                        "tool_calls": []
+                    },]
+                delta_content='[DONE]'
+                assert '[DONE]'==chunk.decode("utf-8")[6:]
+inputs='hi'
+history=[]
+for response_text in Innospark_stream(inputs,history):
+    print(response_text,end='')
+```
+## 🌟 核心特性
+### 🎯 开源产品矩阵
+<div align="left">
+**1. 📚 InnoSpark模型系列**
+   - 包含6个不同参数规模的模型：min(0.5B)、turbo(7B)、plus(72B)及其对应的推理模型R版本
+**2. 🔍 ELMES评估系统**
+   - Education Language Model Evaluation System
+   - 面向教育任务的自动化评估系统
+   - 助力教学场景下的大模型能力持续优化
+**3. 🛠️ COCLP数据清洗管线**
+   - Corpus Cleansing Pipeline
+   - 基于ComfyUI的可视化节点式框架
+   - 支持OCR、音视频转录、格式转换、PII去除、文本过滤等功能
+**4. ⭐ HPC-RM奖励模型**
+   - Helpful, Personalization, and Creativity Reward Model
+   - 提供有用性、个性化、创造力3个教育维度的打分
+   - 配套模型打分和人工打分数据集
+</div>
+## 📈 性能表现
+我们在4个关键教育场景中均取得了最优表现：
+### 🏆 评测结果
+| 场景 | 表现 |
+|------|------|
+| 📝 知识点讲解 | ![知识点讲解](score/demo1.png) |
+| 🧭 引导式讲题 | ![引导式讲题](score/demo2.png) |
+| 📚 跨学科教案 | ![跨学科教案](score/demo3.png) |
+| 🎭 情景化出题 | ![情景化出题](score/demo4.png) |
+### 📊 详细评估表格
+| 场景 | 评估表格 |
+|------|----------|
+| 📝 知识点讲解 | ![知识点讲解表格](table/table1.png) |
+| 🧭 引导式讲题 | ![引导式讲题表格](table/table2.png) |
+| 📚 跨学科教案 | ![跨学科教案表格](table/table3.png) |
+| 🎭 情景化出题 | ![情景化出题表格](table/table4.png) |
+### 🎨 应用示例
+| 场景 | 演示 |
+|------|------|
+| 📖 知识点讲解 | ![知识点讲解示例](demo/demo1.png) |
+| 🎯 引导式讲题 | ![引导式讲题示例](demo/demo2.png) |
+| 🌟 跨学科教案 | ![跨学科教案示例](demo/demo3.png) |
+| 🎪 情景化出题 | ![情景化出题示例](demo/demo4.png) |
+## 🏛️ 技术支持
+本项目由华东师范大学智能教育学院和上海创智学院（Shanghai Innovation Institute）联合开发，奖励模型使用了上海创智学院提供的SiiRL训练框架进行训练。
+## 📄 许可证
+请查看相关模型页面了解具体的许可证信息。
+---
+<div align="center">
+## 🤝 联系与合作
+**华东师范大学**
+[![Website](https://img.shields.io/badge/🌐-访问官网-brightgreen)](https://innospark.aiecnu.cn/innospark/)
+[![Email](https://img.shields.io/badge/📧-联系我们-red)](mailto:[email protected])
+---
+<sub>🚀 用AI赋能教育</sub>
+</div>
+</div>