ProCreations commited on
Commit
1ffffb4
·
verified ·
1 Parent(s): 04041fd

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +115 -5
README.md CHANGED
@@ -1,5 +1,115 @@
1
- ---
2
- license: other
3
- license_name: procreations-development
4
- license_link: LICENSE
5
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ ---
3
+ language: en
4
+ license: apache-2.0
5
+ tags:
6
+ - intellite
7
+ - "small-language-model"
8
+ - transformers
9
+ - pytorch
10
+ - conversational-ai
11
+ - huggingface
12
+ - low-resource
13
+ ---
14
+
15
+ # IntellIte Chat
16
+
17
+ IntellIte Chat is a lightweight conversational AI model (~45M parameters) designed for warm, engaging dialogue and basic reasoning. Part of the *IntellIte* series, it delivers efficient performance on modest hardware, complete with streaming data loading, episodic memory buffers, and RAG-based knowledge augmentation.
18
+
19
+ ---
20
+
21
+ ## ⚙️ Key Features
22
+
23
+ - **Small & Efficient**: ~45M parameters, ideal for edge devices and academic projects.
24
+ - **Streaming Data**: Uses Hugging Face `IterableDataset` for on-the-fly data without local storage constraints.
25
+ - **Memory Buffer**: Maintains the last 200 messages for coherent multi-turn conversations.
26
+ - **RAG Integration**: FAISS-based retrieval for up-to-date knowledge augmentation.
27
+ - **Content Safety**: Built-in filters to enforce conversational guidelines.
28
+ - **Extensible API**: Hook into `generate_with_plugins()` for custom prompts or downstream tasks.
29
+
30
+ ---
31
+
32
+ ## 💾 Installation
33
+
34
+ ```bash
35
+ pip install transformers datasets faiss-cpu torch huggingface-hub
36
+ ```
37
+
38
+ ---
39
+
40
+ ## 🚀 Quick Start
41
+
42
+ ```python
43
+ from il import generate_with_plugins
44
+
45
+ response = generate_with_plugins(
46
+ prompt="Hello, how's it going?",
47
+ source="wiki",
48
+ k=3,
49
+ max_new_tokens=100
50
+ )
51
+ print(response)
52
+ ```
53
+
54
+ ---
55
+
56
+ ## 🛠️ Training Pipeline
57
+
58
+ Run the main training script:
59
+
60
+ ```bash
61
+ export HF_TOKEN=<your_hf_token>
62
+ python il.py --hf_token $HF_TOKEN --seed 42
63
+ ```
64
+
65
+ The script will:
66
+
67
+ 1. Stream Wikipedia, CodeParrot, and grade-school math datasets.
68
+ 2. Apply cosine LR scheduling, weight decay, and label smoothing.
69
+ 3. Run simple evals (2 chat, 1 code prompt) at each epoch end.
70
+ 4. Save & push the best model to `ProCreations/IntellIte` on Hugging Face.
71
+
72
+ ---
73
+
74
+ ## 📊 Evaluation & Monitoring
75
+
76
+ A `SimpleEvalCallback` runs designated chat/code prompts each epoch, logging outputs for quick sanity checks.
77
+
78
+ ---
79
+
80
+ ## 🔧 Configuration Options
81
+
82
+ Edit `il.py` to customize:
83
+
84
+ - **Batch Sizes, LR, Scheduler** via `TrainingArguments`.
85
+ - **Retrieval Sources**: adjust `k` and index sources.
86
+ - **Memory Buffer**: change size or filter rules.
87
+
88
+ ---
89
+
90
+ ## 🌱 Fine‑Tuning on Custom Data
91
+
92
+ 1. Prepare your dataset as a Hugging Face `Dataset` or `IterableDataset`.
93
+ 2. Interleave with base streams and pass to the Trainer.
94
+ 3. Use `--resume_from_checkpoint` to continue an interrupted run.
95
+
96
+ ---
97
+
98
+ ## 🤝 Contributing
99
+
100
+ Contributions welcome! Steps:
101
+
102
+ 1. Fork the repo.
103
+ 2. Create a feature branch.
104
+ 3. Submit a PR with clear descriptions and tests.
105
+
106
+ ---
107
+
108
+ ## 📜 License
109
+
110
+ This project is licensed under the [Apache 2.0 License](https://opensource.org/licenses/Apache-2.0).
111
+
112
+ ---
113
+
114
+ ❤️ Developed by ProCreations under the *IntellIte* brand.
115
+