openseek / index.html
ldwang's picture
Upload 6 files
cdb47d1 verified
<!DOCTYPE html>
<html>
<head>
<title>OpenSeek</title>
<meta charset="UTF-8">
<script src="https://cdn.jsdelivr.net/npm/marked/marked.min.js"></script>
<style>
/* Basic styling for better readability */
body {
font-family: "Microsoft YaHei", sans-serif;
line-height: 1.6;
margin: 20px;
}
img {
max-width: 100%;
/* Make images responsive */
height: auto;
}
.center {
text-align: center;
}
table {
width: 100%;
border-collapse: collapse;
margin-bottom: 20px;
}
th,
td {
border: 1px solid #ddd;
padding: 8px;
text-align: left;
}
th {
background-color: #f2f2f2;
}
/* Add more styles as needed */
</style>
</head>
<body>
<div id="markdown-content" style="width:56vw;margin: 0 auto;"></div>
<script>
const markdownTextEN = `
<div align="center">
<img src="./openseek_logo.jpg" alt="OpenSeek Logo" width="150">
</div>
<div align="center">
OpenSeek aims to unite the global open source community to drive collaborative innovation in algorithms, data and systems to develop next-generation models that surpass DeepSeek.
<div>English| <span style="color: #0969DA; cursor: pointer;">简体中文</span></div>
[![GitHub license](https://img.shields.io/github/license/FlagAI-Open/OpenSeek)](https://github.com/FlagAI-Open/OpenSeek/blob/main/LICENSE)
[![GitHub stars](https://img.shields.io/github/stars/FlagAI-Open/OpenSeek)](https://github.com/FlagAI-Open/OpenSeek/stargazers)
[![GitHub forks](https://img.shields.io/github/forks/FlagAI-Open/OpenSeek)](https://github.com/FlagAI-Open/OpenSeek/network)
[![GitHub issues](https://img.shields.io/github/issues/FlagAI-Open/OpenSeek)](https://github.com/FlagAI-Open/OpenSeek/issues)
</div>
# 📌 Project Overview
OpenSeek is an open source project initiated by the Beijing Academy of Artificial Intelligence (BAAI), aiming to unite the global open source communities to drive collaborative innovation in algorithms, data and systems to develop next-generation models that surpass DeepSeek. Drawing inspiration from large model initiatives like Bigscience and OPT, the project is dedicated to building an independent open source algorithmic innovation system. Since the open sourcing of the DeepSeek model, academia has seen numerous algorithmic improvements and breakthroughs, but these innovations often lack complete code implementations, necessary computational resources, and high-quality data support. The OpenSeek project hopes to explore high-quality dataset construction mechanisms through uniting the open source community, promote open sourcing of the entire large model training pipeline, build innovative training and inference code to support various AI chips besides Nvidia, and promote independent technological innovation and application development.
**Core Objectives of OpenSeek:**
- Innovative data synthesis technology: Address the challenge of acquiring high-quality data and break through data barriers.
- Support for multiple AI chips: Reduce dependency on specific chips and improve model universality and adaptability.
- Build an independent open source algorithmic innovation system: Promote independent algorithmic innovation and technology sharing through open source collaboration.
**Project Repository:** https://github.com/FlagAI-Open/OpenSeek
# 📢 News
- 🔥[02/13/2025] Completed validation of the OpenSeek-PT-1T dataset on a 3B size model, released model checkpoints, data ratios, training codes with hyperparameters, and wandb logs.
# 👁 Project Highlights
- High-quality data open and accessible
- Open source large-scale high-quality Chinese and English datasets (>4TB), covering a wide variety of data types and scenarios.
- Open source high-quality dataset construction plans, supporting diverse high-quality data synthesis based on human data, helping developers achieve innovation at the data level.
- Multi-AI chip distributed training framework
- Support for Triton operators, multi-chip training, compatible with various hardware architectures, ensuring efficient utilization of different devices.
- Implement more efficient computation, communication, and memory access collaborative hybrid parallel schemes, providing cluster training logs and performance data to help developers optimize large-scale training tasks.
- Model structure optimization and improvement
- Explore optimization of two different model sizes, OpenSeek-small and OpenSeek-Mid, to meet the needs of different application scenarios.
- Provide training experiences and optimization plans for small-sized models to help developers achieve high-performance development and deployment in resource-constrained environments.
# ☎️ Open Source Co-construction Plan
As a member of the open source community, we deeply understand that the power of open source comes from the wisdom and enthusiasm of every developer. We firmly believe that through the joint efforts of global developers, every contribution will push the project towards maturity and perfection.
Welcome to check our [Contribution Guide](CONTRIBUTING.md) for more details.
Whether you are:
- A deep learning expert with experience in large model training;
- A data scientist dedicated to data construction and algorithm innovation;
- Or a beginner passionate about open source projects;
You can find a platform to showcase your talents at OpenSeek. You can contribute in the following ways:
- Code and technical solution contributions
- If you have unique insights into training processes, code implementation, or performance optimization, feel free to submit a Pull Request and help us advance the project.
- Data, algorithm, and resource support
- If you have high-quality datasets, innovative algorithms, or other valuable resources and wish to contribute in non-code forms, please contact us directly to discuss collaboration methods.
- Participate in technical discussions and documentation improvement
- Share your insights, experiences, and suggestions to help us continuously improve project documentation and technical details.
Let's explore the infinite possibilities of large model training with the power of open source and promote continuous technological progress!
<div align="center">
<img src="./wechat.png" alt="wechat" width="200">
</div>
# ⏰ RoadMap
| Direction | One: Complete the creation of OpenSeek-data-1.3TB, support OpenSeek-Small distributed training | Two: Expand data scale and optimize distributed training performance, complete OpenSeek-small training on the final version of OpenSeek-PT-1.3T data | Three: Support larger scale data and distributed training, complete OpenSeek-Mid training on OpenSeek-PT-8T data, achieve full process training support | Four: Upgrade multi-chip support, open source datasets and model weights |
|-----------|------------------------------------------------------------|-----------------------------------------------------------------|-----------------------------------------------------------------|-------------------------------------------------------------|
| Data | ☐ Build data processing + data synthesis pipeline<br>☐ Build OpenSeek-PT-1.3T-v0.1<br>☐ Construct OpenSeek-data-1.3T official version based on OpenSeek-Small data ratio experiment results | ☐ Expand data scale, build OpenSeek-PT-8T<br>☐ Construct Long-CoT-Backward synthetic dataset and verify effects | ☐ Build OpenSeek-Zero dataset<br>☐ Build OpenSeek-RL dataset<br>☐ Build OpenSeek-SFT dataset<br>☐ Construct Long-CoT-Forward synthetic dataset and verify effects | ☐ Release official version of OpenSeek series datasets<br>☐ Construct Long-CoT-RAG synthetic dataset and verify effects |
| Training | ☐ Validate 3B model effects on OpenSeek-PT-1.3T-v0.1 (Baseline)<br>☐ Complete experimental training of OpenSeek-Small (~100B) | ☐ Complete hyperparameter experiments for OpenSeek-Small<br>☐ Validate OpenSeek-PT-4T effects<br>☐ Complete full training of OpenSeek-Small on OpenSeek-PT-1.3T-v1.0 | ☐ Produce OpenSeek-Small-Zero<br>☐ Produce OpenSeek-Small-SFT<br>☐ Produce OpenSeek-Small-RL<br>☐ Complete hyperparameter experiments for OpenSeek-Mid<br>☐ Validate OpenSeek-PT-8T effects<br>☐ Complete full training of OpenSeek-Mid on OpenSeek-PT-8T | ☐ Produce OpenSeek-Mid-Zero<br>☐ Produce OpenSeek-Mid-SFT<br>☐ Produce OpenSeek-Mid-RL |
| System | ☐ Support the distributed training for MLA, DeepSeek MoE, MTP, Auxiliary-Loss-Free etc. <br>☐ Convert and load DeepSeek V3 parameters | ☐ Support Node-limited Routing MoE<br>☐ Support FP8 distributed training<br>☐ Integrate Triton-based operator library FlagGems | ☐ Support DualPipe pipeline parallelism<br>☐ Further optimize computation-communication overlap and memory optimization | ☐ Adapt training and precision alignment for different chips<br>☐ Implement customized parallel and optimization strategies for specific chips |
# 📚 Data
## 1. Data Source Preparation
The pre-training dataset is mainly composed of collected and selected open source datasets.
### English Common Crawl
- https://data.commoncrawl.org/contrib/Nemotron/Nemotron-CC/index.html
- https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu
### Chinese Common Crawl
- https://huggingface.co/datasets/BAAI/CCI3-HQ
- https://huggingface.co/datasets/opencsg/Fineweb-Edu-Chinese-V2.1
### Other Domains
#### Wiki & Books & Arixv
- English: https://huggingface.co/datasets/allenai/dolma
- Chinese: Self-built Chinese encyclopedia, books, and literature data
#### Math
- https://huggingface.co/datasets/OpenCoder-LLM/opc-fineweb-math-corpus
- https://huggingface.co/datasets/EleutherAI/proof-pile-2
- https://huggingface.co/datasets/HuggingFaceTB/finemath
#### Code
- https://huggingface.co/datasets/OpenCoder-LLM/opc-fineweb-code-corpus
- https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus
- https://huggingface.co/datasets/bigcode/the-stack-v2
## 2. Data Synthesis
- **General Knowledge Tagging System Construction**: Refer to the paper "Key-Point-Driven Data Synthesis with its Enhancement on Mathematical Reasoning". Based on Qwen2.5-72B, analyze common knowledge points involved in open source data in fields such as mathematics, code, and common sense Q&A, and construct a general knowledge tagging system.
- **Raw Corpus Annotation and Filtering**: Combine the knowledge tagging system and apply Qwen2.5-72B to tag the corpus. Sample and distinguish between corpus suitable for simple QA synthesis and those suitable for long CoT QA synthesis based on article knowledge point types.
- **Pre-training QA Data Synthesis**
1. Simple QA Synthesis: Extract Question-Answer pairs from raw corpus based on open source models.
2. Long-CoT-Backward Data Synthesis: Segment and summarize original documents, organize CoT processes, and summarize queries. Use {Query, CoT process, original document} as a training sample.
3. Long-CoT-Forward Data Synthesis: On the basis of Backward data synthesis, call open source strong reasoning models to optimize and refine the CoT process in Backward data, and provide high-quality CoT answers corresponding to the query. Use {Query, optimized CoT process, model answer} as a training sample.
4. Long-CoT-RAG Data Synthesis: Refer to the paper "Search-o1: Agentic Search-Enhanced Large Reasoning Models". Collect open source instructions and provide high-quality responses to instructions using inference + RAG methods.
- **RL Data**: Based on the general knowledge tagging system, further sample high-quality reasoning type data (mathematics, code, difficult common sense, etc.) and non-reasoning data (writing, translation, etc.) from synthetic data.
- **Quality Filtering**: Use reward models, rule verification, etc., to score and filter data quality.
## 3. Data Preprocessing
### Deduplication
- **Global Fuzzy Deduplication Based on MiniHash**
- https://github.com/huggingface/datatrove/blob/main/examples/minhash_deduplication.py
- **Exact Substring Deduplication**
- https://github.com/google-research/deduplicate-text-datasets
### Rule-based Filtering
Developed based on the data-juicer tool https://github.com/modelscope/data-juicer, the main rules include:
- Document character length
- Average sentence character length in documents
- Traditional Chinese to Simplified Chinese conversion
- Sensitive word and safety word filtering
### Quality Classifier
- Chinese quality classifier based on education level estimation
- English quality classifier based on multiple education level estimations
# 🖥️ System
This project uses [FlagScale](https://github.com/FlagOpen/FlagScale.git) as the distributed training framework. This framework is an end-to-end framework for large models across multiple chips, built entirely on open source technology by the Beijing Academy of Artificial Intelligence (BAAI) in collaboration with ecosystem partners, maximizing computational resource efficiency while ensuring model effectiveness.
<div align="center">
<img src="./flagscale.png" alt="FlagScale Architecture" width="600">
</div>
The FlagScale architecture can be divided into three layers:
1. **Frontend**: Provides a unified user interface and automation tools, such as a unified launcher and auto-tuning, for a good user experience.
2. **Middleware**: Includes multiple high-performance execution engines, both self-developed and open source, covering training, compression, inference, and service stages, enhancing system flexibility and scalability.
3. **Backend**: Contains underlying operator libraries and communication libraries to ensure efficient and reliable performance, especially the Triton-based operator library [FlagGems](https://github.com/FlagOpen/FlagGems) and unified heterogeneous communication library [FlagCX](https://github.com/FlagOpen/FlagCX), enabling computation and communication across different chips.
This project will utilize the FlagScale framework and leverage the power of the open source community to produce the distributed training system technology of DeepSeek V3 & R1, striving to ensure the stability and practical effectiveness of the system in the end-to-end training process. On this basis, we hope to further explore the collaborative optimization of model algorithms and system efficiency, including:
- **Model Structure Improvement**: Further improve MLA, MTP, and MoE, etc. to optimize performance and training efficiency .
- **Computation and Communication Scheduling Optimization**: Develop general computation and communication scheduling technologies suitable for more chips, enhancing cross-hardware platform compatibility and computational efficiency.
- **Low Precision Training Optimization**: Explore more stable training schemes for low precision numerical formats like FP8 and develop corresponding operator optimizations to reduce computational costs and improve training stability.
Through these technological innovations, we hope to promote the efficiency, compatibility, and scalability of distributed training systems, providing stronger support for large-scale AI training.
# 🚀 Training
## Phase 1: V3 Pre-training
| Category | Data | ckpt | Evaluation Results | Training Hyperparameters | Wandb | Discussion |
|----------|------|------|--------------------|--------------------------|-------|------------|
| Content | Aquila-3B data validation model<br>OpenSeek-PT-1.3T v0.1 | -- | <img src="./3B-results.jpeg" alt="3B-results" width="180"><br> | seqlen: 4096<br>gbs: 8M<br>lr: 3.0e-3<br>lr_decay_style: WSD | <div style="width:240px;text-align:center"><img src="./3B-loss.png" alt="3B-results" width="240"><br>https://wandb.ai/aquila3/OpenSeek-3B-v0.1/runs/aquila_3b_exp02-rank-63 </div>| -- |
# 📜 License Agreement
- Code is licensed under Apache 2.0
- Model weights are licensed under Apache 2.0
- Data is licensed under CC BY-SA 4.0
**Note**: Full reproduction requires at least 8 H100 GPUs, and it is recommended to use the SLURM cluster management system. Datasets need to be applied for or generated independently, and some sensitive data is not included in the open source package.
`
const markdownTextCN = `
<div align="center">
<img src="./openseek_logo.jpg" alt="OpenSeek Logo" width="150">
</div>
<div align="center">
OpenSeek旨在联合全球开源社区,推动算法、数据和系统的协同创新,开发出超越DeepSeek的下一代模型。
<div><span style="color: #0969DA; cursor: pointer;">English</span>| 简体中文</div>
[![GitHub license](https://img.shields.io/github/license/FlagAI-Open/OpenSeek)](https://github.com/FlagAI-Open/OpenSeek/blob/main/LICENSE)
[![GitHub stars](https://img.shields.io/github/stars/FlagAI-Open/OpenSeek)](https://github.com/FlagAI-Open/OpenSeek/stargazers)
[![GitHub forks](https://img.shields.io/github/forks/FlagAI-Open/OpenSeek)](https://github.com/FlagAI-Open/OpenSeek/network)
[![GitHub issues](https://img.shields.io/github/issues/FlagAI-Open/OpenSeek)](https://github.com/FlagAI-Open/OpenSeek/issues)
</div>
# 📌项目概述
OpenSeek是由北京智源人工智能研究院(BAAI)发起的开源项目,旨在联合全球开源社区,推动算法、数据和系统的协同创新,开发出超越DeepSeek的下一代模型。 该项目从Bigscience和OPT等大模型计划中汲取灵感,致力于构建一个开源自主的算法创新体系。 自DeepSeek模型开源以来,学术界涌现出众多算法改进和突破,但这些创新往往缺乏完整的代码实现、必要的计算资源和高质量的数据支持。 OpenSeek项目期望通过联合开源社区,探索高质量数据集构建机制,推动大模型训练全流程的开源开放,构建创新的训练和推理代码以支持多种AI芯片,促进自主技术创新和应用发展。
**OpenSeek核心目标:**
- 创新数据合成技术:解决高质量数据获取的挑战,推动数据壁垒的突破。
- 支持多AI芯片:降低成本,减少对特定芯片的依赖,提升模型的通用性和适应性。
- 构建开源自主的算法创新体系:通过开源合作,促进算法的自主创新和技术共享。
**项目地址:** https://github.com/FlagAI-Open/OpenSeek
# 📢News
- 🔥[02/13/2025] 完成3B尺寸模型上验证了OpenSeek-PT-1T数据集效果, release 模型ckpt,数据配比,训练代码与超参以及wandb
# 👁 项目核心亮点
- 高质量数据开源开放
- 开源大规模高质量中英文数据集(>4TB),涵盖丰富多样的数据类型和场景。
- 开源高质量数据集构建方案,支持基于人工数据进行多样性高质量数据合成,助力开发者在数据层面实现创新。
- 多AI芯片高性能分布式训练框架
- 支持Triton算子,支持多元芯片训练,兼容多种硬件架构,确保不同设备的高效利用。
- 实现更高效计算、通信与访存联合协同的混合并行方案,提供集群实训日志和性能数据,助力开发者优化大规模训练任务。
- 模型结构优化改进
- 探索OpenSeek-small和OpenSeek-Mid等两个不同尺寸的模型结构优化,以满足不同应用场景的需求。
- 提供小尺寸模型的训练经验与优化方案,帮助开发者在资源受限的环境中实现高性能开发部署。
# ☎️开源共建计划
作为开源社区的一员,我们深知开源的力量源自每一位开发者的智慧与热情。我们坚信,通过全球开发者的共同努力,每一份贡献都将推动项目不断迈向成熟与完善。
欢迎查看我们的[贡献指南](CONTRIBUTING.md)了解更多详细信息。
无论你是:
- 拥有大模型训练经验的深度学习专家;
- 致力于数据构建与算法创新的数据科学家;
- 亦或是对开源项目充满热情的初学者;
你都能在 OpenSeek 找到展示才华的平台。你可以通过以下方式贡献力量:
- 代码与技术方案贡献
- 如果你对训练流程、代码实现或性能优化有独到见解,欢迎提交 Pull Request,与我们一起推动项目进展。
- 数据、算法与资源支持
- 如果你拥有高质量数据集、创新算法或其他有价值的资源,并希望以非代码形式贡献力量,请直接联系我们,共同探讨合作方式。
- 参与技术讨论与文档完善
- 分享你的见解、经验和建议,帮助我们不断完善项目文档和技术细节。
让我们一起用开源的力量探索大模型训练的无限可能,推动技术不断进步!
<div align="center">
<img src="./wechat.png" alt="wechat" width="200">
</div>
# ⏰ RoadMap
| 方向 | 一:完成制作OpenSeek-data-1.3TB,支持OpenSeek-Small分布式训练 | 二:扩展数据规模和优化分布式训练性能,在最终版OpenSeek-PT-1.3T数据上完成OpenSeek-small训练 | 三:支持更大规模数据和分布式训练,在OpenSeek-PT-8T数据上完成OpenSeek-Mid训练,实现全流程训练支持 | 四:升级多芯片支持,开源数据集和模型权重 |
|------|------|------|------|------|
| 数据 | ☐ 构建数据处理+数据合成的数据pipline<br>☐ 构建OpenSeek-PT-1.3T-v0.1<br>☐ 基于OpenSeek-Small数据配比实验结果构建OpenSeek-data-1.3T 正式版 | ☐ 扩大数据规模, 构建OpenSeek-PT-8T<br>☐ 构建Long-CoT-Backward合成数据集并验证效果 | ☐ 构建 OpenSeek-Zero数据集<br>☐ 构建 OpenSeek-RL数据集<br>☐ 构建 OpenSeek-SFT数据集<br>☐ 构建Long-CoT-Forward合成数据集并验证效果 | ☐ 发布正式版本OpenSeek系列数据集<br>☐ 构建Long-CoT-RAG合成数据集并验证效果 |
| 训练 | ☐ 完成3B模型在OpenSeek-PT-1.3T-v0.1上的效果验证(Baseline)<br>☐ 完成OpenSeek-Small实验性训练(~100B) | ☐ 完成OpenSeek-Small的超参实验<br>☐ 验证OpenSeek-PT-4T效果<br>☐ 完成OpenSeek-Small在OpenSeek-PT-1.3T-v1.0的完整训练 | ☐ 完成OpenSeek-Small-Zero开发<br>☐ 完成OpenSeek-Small-SFT开发<br>☐ 完成OpenSeek-Small-RL开发<br>☐ 完成OpenSeek-Mid的超参实验<br>☐ 验证OpenSeek-PT-8T效果<br>☐ 完成OpenSeek-Mid在OpenSeek-PT-8T的完整训练 | ☐ 完成OpenSeek-Mid-Zero开发<br>☐ 完成OpenSeek-Mid-SFT开发<br>☐ 完成OpenSeek-Mid-RL开发 |
| 系统 | ☐ 对MLA、DeepSeek MoE、MTP、Auxiliary-Loss-Free等分布式训练支持<br>☐ DeepSeek V3参数转换并加载 | ☐ 支持Node-limited Routing MoE<br>☐ FP8分布式训练支持与验证<br>☐ 集成基于Triton的算子库FlagGems | ☐ DualPipe流水线并行支持<br>☐ 进一步计算通信重叠与显存优化 | ☐ 对不同芯片进行训练适配和精度对齐<br>☐ 针对特定芯片,实现定制化的并行策略和优化策略 |
# 📚 数据
## 1. 数据来源准备
预训练数据集主要通过收集和选择开源数据集组成。
### 英文Common Crawl
- https://data.commoncrawl.org/contrib/Nemotron/Nemotron-CC/index.html
- https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu
### 中文Common Crawl
- https://huggingface.co/datasets/BAAI/CCI3-HQ
- https://huggingface.co/datasets/opencsg/Fineweb-Edu-Chinese-V2.1
### 其他Domain
#### Wiki & Books & Arixv
- 英文:https://huggingface.co/datasets/allenai/dolma
- 中文:自建的中文百科、图书和文献数据
#### Math
- https://huggingface.co/datasets/OpenCoder-LLM/opc-fineweb-math-corpus
- https://huggingface.co/datasets/EleutherAI/proof-pile-2
- https://huggingface.co/datasets/HuggingFaceTB/finemath
#### Code
- https://huggingface.co/datasets/OpenCoder-LLM/opc-fineweb-code-corpus
- https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus
- https://huggingface.co/datasets/bigcode/the-stack-v2
## 2. 数据合成
- **通用知识标签体系构建**:参考论文Key-Point-Driven Data Synthesis with its Enhancement on Mathematical Reasoning。基于Qwen2.5-72B,分析数学、代码、常识问答等领域开源数据涉及的常见知识点,构建通用知识标签体系。
- **原始语料标注、筛选**:结合知识标签体系,应用Qwen2.5-72B对语料进行打标。根据文章知识点类型采样、区分适合合成简单QA的语料与适合合成长CoT QA的语料。
- **预训练QA数据合成**
1. 简单QA合成:基于开源模型,从原始语料中抽取Question-Answer对。
2. Long-CoT-Backward数据合成:对原始文档进行分段摘要、组织CoT过程、总结Query。以 {Query, CoT过程, 原始文档} 作为一条训练样本。
3. Long-CoT-Forward数据合成:在Backward数据合成基础上,调用开源强推理模型,优化、精炼Backward数据中的CoT过程,重新给出Query对应的高质量CoT解答。以 {Query, 优化后的CoT过程, 模型回答} 作为一条训练样本。
4. Long-CoT-RAG数据合成:参考论文Search-o1: Agentic Search-Enhanced Large Reasoning Models。搜集开源指令,采用推理+RAG的方式给出指令的高质量回复。
- **RL数据**:基于通用知识标签体系,从合成数据中进一步采样高质量的推理类型数据(数学、代码、较难常识等)及非推理数据(写作、翻译等)。
- **质量过滤**:结合奖励模型、规则验证等对数据的质量进行打分及过滤。
## 3. 数据预处理
### 去重
- **基于MiniHash的全局模糊去重**
- https://github.com/huggingface/datatrove/blob/main/examples/minhash_deduplication.py
- **Exact substring deduplication**
- https://github.com/google-research/deduplicate-text-datasets
### 规则过滤
基于data-juicer工具https://github.com/modelscope/data-juicer 进行二次开发,主要规则包括以下:
- 文档字符长度
- 文档平局句子字符长度
- 繁体中文转简体中文
- 敏感词和安全词过滤
### 质量分类器
- 中文基于教育水平的质量分类器进行预估
- 英文综合多个教育水平的质量分类器进行综合预估
# 🖥️ 系统
本项目采用[FlagScale](https://github.com/FlagOpen/FlagScale.git) 作为分布式训练框架,该框架是由北京智源人工智能研究院(BAAI)联合生态伙伴完全基于开源技术构建的面向多种芯片的大模型端到端框架,在确保模型效果的同时,最大化计算资源的效率。
<div align="center">
<img src="./flagscale.png" alt="FlagScale Architecture" width="600">
</div>
FlagScale 架构可以分为三层:
1. **前端(Frontend)** 提供统一的用户界面和自动化工具,如统一启动器和自动调优,为用户良好使用体验。
2. **中间件(Middleware)** 包括自研和开源的多个高性能执行引擎,涵盖训练、压缩、推理和服务等各个阶段,增强系统的灵活性和扩展性。
3. **后端(Backend)** 包含底层算子库和通信库,确保高效可靠的性能,尤其是基于Triton的算子库[FlagGems](https://github.com/FlagOpen/FlagGems)和异构统一通信库[FlagCX](https://github.com/FlagOpen/FlagCX),能够实现不同芯片上的计算与通信。
本项目将利用 FlagScale 框架,并结合开源社区的力量,致力于开发 DeepSeek V3 & R1 的分布式训练系统技术,并努力确保该系统在端到端训练过程中的稳定性和实际效果。在此基础上,我们希望进一步探索模型算法与系统效率协同优化的技术,包括:
- **模型结构改进**:进一步改进 MLA、MTP、MoE等,以优化模型性能和训练效率。
- **计算与通信调度优化**:研发适用于更多芯片的高通用性计算与通信调度技术,提升跨硬件平台的兼容性和计算效率。
- **低精度训练优化**:探索 FP8 等低精度数值格式的稳定训练方案,并开发相应的算子优化,以降低计算成本并提高训练稳定性。
通过这些技术创新,我们希望推动分布式训练系统的高效性、兼容性与可扩展性,为大规模 AI 训练提供更强的支撑。
# 🚀 训练
## 阶段1:V3预训练
| 类别 | 数据 | ckpt | 评测结果 | 训练超参 | Wandb | 讨论 |
|------|------|------|-----------|----------|--------|------|
| 内容 | Aquila-3B数据验证模型<br>OpenSeek-PT-1.3T v0.1 | -- | <img src="./3B-results.jpeg" alt="3B-results" width="180"><br> | seqlen: 4096<br>gbs: 8M<br>lr: 3.0e-3<br>lr_decay_style: WSD | <div style="width:240px;text-align:center"><img src="./3B-loss.png" alt="3B-results" width="240"><br>https://wandb.ai/aquila3/OpenSeek-3B-v0.1/runs/aquila_3b_exp02-rank-63 </div>| -- |
# 📜 许可协议
- 代码采用Apache 2.0许可证
- 模型权重采用Apache 2.0许可协议
- 数据采用CC BY-SA 4.0许可协议
**注意事项**:完整开发需至少8张H100 GPU,建议使用SLURM集群管理系统。数据集需自行申请或生成,部分敏感数据不包含在开源包内。
`
const markdownContent = document.getElementById('markdown-content')
const html = marked.parse(markdownTextEN);
markdownContent.innerHTML = html;
function changeLang(name) {
if (name === "简体中文") {
markdownContent.innerHTML = marked.parse(markdownTextCN);
} else {
markdownContent.innerHTML = marked.parse(markdownTextEN);
}
}
markdownContent.addEventListener("click", (event) => changeLang(event.target.innerHTML));
</script>
</body>
</html>