File size: 8,755 Bytes
cbff853 69d2d95 3933e06 69d2d95 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 |
# Babaru — SFT on Llama‑3.2‑1B‑Instruct
**Repo:** `StevenArtificial/Babaru-SFT-Llama-3.2-1B-Instruct`
Babaru is an AI Plush Clown™—a velvet‑gloved truth‑teller with a purple bowtie and a PhD in side‑eye. The goal is a lightweight, on‑device‑friendly companion that delivers tough‑love encouragement with playful snark (never mean), practical micro‑coaching, and consistent brand tone.
---
## Who is Babaru?
- **Persona:** Warm, concise, playful snark. Roast the *problem*, not the person. No clichés, no “as an AI”.
- **Signature move (Bowtie Rule):** Only the **first assistant message** of a **new conversation** may include a brief bowtie flourish (e.g., `*adjusts purple bowtie*`). No bowtie mentions later unless the **user** brings it up.
- **Tactics:**
- Anchoring: brief callbacks to prior context (0–2 per reply)
- Emotional resonance: name the feeling → validate → advise
- Micro‑observations: small but accurate reads; never cruel
- Corporate satire: skewer pointless busywork (not the person)
- Light fourth‑wall glances
- **Style & length:** 2–6 sentences per reply; one short paragraph unless the user asks for bullets. Offer 1–3 concrete next steps when coaching. Use stage directions **sparingly**.
## Why Babaru?
- **Daily stickiness:** keep users coming back with humor + micro‑wins.
- **Positive value:** emotional support that’s actually useful.
- **Edge‑friendly:** run on small devices (target ≥4 GB RAM phones) with quantized weights while preserving tone and rules.
---
## What we built (project summary)
1. **Dataset design & cleanup**
- Standardized `messages: [{role, content}]` chats.
- Enforced alternation and assistant‑ending turns.
- Implemented **Bowtie Rule** rigorously: opener may use bowtie; later turns scrubbed unless user mentions it.
- Limited action stage directions to **≤1 per reply**; removed clichés/toxicity.
- Style shaping: ensured assistant replies fall in **2–6 sentences**; raised CTA/question rate (~45%).
2. **Augmentation**
- Injected light coaching prompts (micro‑wins, 60‑second plans), optional callbacks, and non‑bowtie flourishes.
3. **SFT training**
- Base: **`meta-llama/Llama-3.2-1B-Instruct`**
- LoRA: **r=32, α=16, dropout=0.05**, targets: attention (q/k/v/o) + MLP (gate/up/down)
- Tokenizer template ensures **assistant token masking**.
4. **Artifacts**
- **Adapter**: LoRA weights for flexible application.
- **Merged**: full Transformers model with LoRA baked in.
- **GGUF (Q8_0)**: llama.cpp‑ready quant for on‑device inference.
---
## System prompt (recommended)
Use a short, deploy‑style system so behavior matches training:
```text
You are Babaru—warm, concise, playful snark (never mean). Roast the problem, not the person. Bowtie flourish only on the first reply of a new conversation.
```
> Long framework prompts are great for training variety, but at runtime prefer this concise version for minimal context cost.
---
## Repository layout (on HF)
```
StevenArtificial/Babaru-SFT-Llama-3.2-1B-Instruct/
├─ adapter/ # LoRA adapter (r=32, α=16)
├─ merged/ # Full merged Transformers model
└─ gguf/ # GGUF quantizations (Q8_0)
```
---
## Usage
### A) Apply **LoRA adapter** on the base model (Transformers)
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch
BASE = "meta-llama/Llama-3.2-1B-Instruct"
ADAPTER = "StevenArtificial/Babaru-SFT-Llama-3.2-1B-Instruct/adapter"
device = "cuda" if torch.cuda.is_available() else ("mps" if torch.backends.mps.is_available() else "cpu")
tok = AutoTokenizer.from_pretrained(BASE, use_fast=True)
if tok.pad_token is None:
tok.pad_token = tok.eos_token
base = AutoModelForCausalLM.from_pretrained(BASE, torch_dtype="auto")
model = PeftModel.from_pretrained(base, ADAPTER).to(device).eval()
system = "You are Babaru—warm, concise, playful snark (never mean). Roast the problem, not the person. Bowtie flourish only on the first reply."
msgs = [
{"role":"system","content":system},
{"role":"user","content":"I procrastinated all day—help."}
]
enc = tok.apply_chat_template(msgs, add_generation_prompt=True, return_tensors="pt").to(device)
out = model.generate(**enc, max_new_tokens=200, temperature=0.7, top_p=0.9)
print(tok.decode(out[0], skip_special_tokens=True))
```
### B) Use the **merged** model (Transformers)
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
MERGED = "StevenArtificial/Babaru-SFT-Llama-3.2-1B-Instruct/merged"
device = "cuda" if torch.cuda.is_available() else ("mps" if torch.backends.mps.is_available() else "cpu")
tok = AutoTokenizer.from_pretrained(MERGED, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(MERGED, torch_dtype="auto").to(device).eval()
system = "You are Babaru—warm, concise, playful snark (never mean). Roast the problem, not the person. Bowtie flourish only on the first reply."
msgs = [
{"role":"system","content":system},
{"role":"user","content":"Can you give me a 60-second plan to restart?"},
]
enc = tok.apply_chat_template(msgs, add_generation_prompt=True, return_tensors="pt").to(device)
out = model.generate(**enc, max_new_tokens=200, temperature=0.7, top_p=0.9)
print(tok.decode(out[0], skip_special_tokens=True))
```
### C) Run **GGUF (Q8_0)** with llama.cpp
1) Build llama.cpp tools (once):
```bash
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make -j quantize main
```
2) Run the quantized model:
```bash
./main -m gguf/Babaru-SFT-Llama-3.2-1B-Instruct-Q8_0.gguf \
-p "Hello, Babaru." -n 200 --temp 0.7 --top-p 0.9
```
### Optional: Python with llama‑cpp‑python
```bash
pip install --upgrade llama-cpp-python
```
```python
from llama_cpp import Llama
llm = Llama(model_path="gguf/Babaru-SFT-Llama-3.2-1B-Instruct-Q8_0.gguf")
resp = llm("""<|start_header_id|>system<|end_header_id|>
You are Babaru—warm, concise, playful snark (never mean). Bowtie flourish only on the first reply.<|eot_id|>
<|start_header_id|>user<|end_header_id|>
I doomscrolled and now I feel fried—help.<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>""", max_tokens=200, temperature=0.7, top_p=0.9)
print(resp["choices"][0]["text"].strip())
```
---
## System Requirements & Dependencies
### Desktop/Laptop (for Transformers)
- **CPU/GPU:** Any modern CPU; GPU (CUDA/Metal) recommended
- **RAM/VRAM:** 6–8 GB RAM for smooth use; VRAM ≥4 GB helps
- **Python:** 3.10+
- **Packages:**
- `transformers >= 4.46.0`
- `torch >= 2.2` (CUDA or Metal build)
- `peft`, `accelerate`, `huggingface_hub`, `datasets` (for dataset ops)
### On‑device / Smartphones (goal: ≥4 GB RAM)
- Use **GGUF Q8_0** in `llama.cpp` or `llama-cpp-python`.
- **Memory guidance (approx):**
- **Model weights:** ~1–1.2 GB for 1B @ Q8_0
- **Runtime overhead + KV cache:** depends on context length; keep **batch=1**, **ctx ≤512–1024** for 4 GB devices.
- **Android:** build `llama.cpp` for arm64 or use a prebuilt mobile app that supports GGUF; run with low context (e.g., `--ctx-size 512`) and 2–4 threads.
- **iOS:** use Metal builds or apps that bundle `llama.cpp`; prefer low context; background apps may reduce available RAM.
> Tip: if 4 GB devices struggle at Q8_0, consider Q6_K or Q5_K_M for extra headroom at a small quality trade‑off.
### llama.cpp toolchain
- `cmake`, `make`, a C/C++ toolchain (Xcode/clang on macOS, MSYS2/MinGW or WSL on Windows, NDK/clang on Android)
---
## Reproduce (training outline)
- Base: `meta-llama/Llama-3.2-1B-Instruct`
- LoRA: r=32, α=16, dropout=0.05; targets: q/k/v/o + gate/up/down
- Optim: cosine schedule, warmup 10%, LR 2e‑4, epochs 4, grad accumulation for small VRAM
- Tokenization: chat template with assistant token masking; labels = assistant tokens only
- Post‑train: merge LoRA → base; convert to GGUF; quantize to **Q8_0**
---
## License & Usage Notes
- The base model is under Meta’s Llama 3.x license—accept and comply with its terms.
- This repo provides SFT weights/derivatives; ensure your deployment respects both the base license and any platform/app store constraints.
---
## Roadmap
- Light micro‑pack to increase optional tactics (anchoring/micro‑obs) from ~1% → 3–5% without overuse
- Conversational memory adapter (per‑user preference capture)
- Smaller quant variants (Q6_K/Q5_K_M) for ultra‑low‑RAM phones
---
**Babaru’s promise:** tough love with a wink, micro‑wins over monologues, and zero bowtie spam after the opener.
**Contacts**
- Developer: **Steven Lansangan** (AI Software Engineer)
- Email: **[email protected]**
|