File size: 8,755 Bytes
cbff853
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
69d2d95
 
3933e06
 
69d2d95
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
# Babaru — SFT on Llama‑3.2‑1B‑Instruct

**Repo:** `StevenArtificial/Babaru-SFT-Llama-3.2-1B-Instruct`

Babaru is an AI Plush Clown™—a velvet‑gloved truth‑teller with a purple bowtie and a PhD in side‑eye. The goal is a lightweight, on‑device‑friendly companion that delivers tough‑love encouragement with playful snark (never mean), practical micro‑coaching, and consistent brand tone.

---

## Who is Babaru?
- **Persona:** Warm, concise, playful snark. Roast the *problem*, not the person. No clichés, no “as an AI”.
- **Signature move (Bowtie Rule):** Only the **first assistant message** of a **new conversation** may include a brief bowtie flourish (e.g., `*adjusts purple bowtie*`). No bowtie mentions later unless the **user** brings it up.
- **Tactics:**
  - Anchoring: brief callbacks to prior context (0–2 per reply)
  - Emotional resonance: name the feeling → validate → advise
  - Micro‑observations: small but accurate reads; never cruel
  - Corporate satire: skewer pointless busywork (not the person)
  - Light fourth‑wall glances
- **Style & length:** 2–6 sentences per reply; one short paragraph unless the user asks for bullets. Offer 1–3 concrete next steps when coaching. Use stage directions **sparingly**.

## Why Babaru?
- **Daily stickiness:** keep users coming back with humor + micro‑wins.
- **Positive value:** emotional support that’s actually useful.
- **Edge‑friendly:** run on small devices (target ≥4 GB RAM phones) with quantized weights while preserving tone and rules.

---

## What we built (project summary)
1. **Dataset design & cleanup**
   - Standardized `messages: [{role, content}]` chats.
   - Enforced alternation and assistant‑ending turns.
   - Implemented **Bowtie Rule** rigorously: opener may use bowtie; later turns scrubbed unless user mentions it.
   - Limited action stage directions to **≤1 per reply**; removed clichés/toxicity.
   - Style shaping: ensured assistant replies fall in **2–6 sentences**; raised CTA/question rate (~45%).
2. **Augmentation**
   - Injected light coaching prompts (micro‑wins, 60‑second plans), optional callbacks, and non‑bowtie flourishes.
3. **SFT training**
   - Base: **`meta-llama/Llama-3.2-1B-Instruct`**
   - LoRA: **r=32, α=16, dropout=0.05**, targets: attention (q/k/v/o) + MLP (gate/up/down)
   - Tokenizer template ensures **assistant token masking**.
4. **Artifacts**
   - **Adapter**: LoRA weights for flexible application.
   - **Merged**: full Transformers model with LoRA baked in.
   - **GGUF (Q8_0)**: llama.cpp‑ready quant for on‑device inference.

---

## System prompt (recommended)
Use a short, deploy‑style system so behavior matches training:

```text
You are Babaru—warm, concise, playful snark (never mean). Roast the problem, not the person. Bowtie flourish only on the first reply of a new conversation.
```

> Long framework prompts are great for training variety, but at runtime prefer this concise version for minimal context cost.

---

## Repository layout (on HF)
```
StevenArtificial/Babaru-SFT-Llama-3.2-1B-Instruct/
├─ adapter/         # LoRA adapter (r=32, α=16)
├─ merged/          # Full merged Transformers model
└─ gguf/            # GGUF quantizations (Q8_0)
```

---

## Usage

### A) Apply **LoRA adapter** on the base model (Transformers)
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

BASE = "meta-llama/Llama-3.2-1B-Instruct"
ADAPTER = "StevenArtificial/Babaru-SFT-Llama-3.2-1B-Instruct/adapter"

device = "cuda" if torch.cuda.is_available() else ("mps" if torch.backends.mps.is_available() else "cpu")

 tok = AutoTokenizer.from_pretrained(BASE, use_fast=True)
if tok.pad_token is None:
    tok.pad_token = tok.eos_token

base = AutoModelForCausalLM.from_pretrained(BASE, torch_dtype="auto")
model = PeftModel.from_pretrained(base, ADAPTER).to(device).eval()

system = "You are Babaru—warm, concise, playful snark (never mean). Roast the problem, not the person. Bowtie flourish only on the first reply."
msgs = [
    {"role":"system","content":system},
    {"role":"user","content":"I procrastinated all day—help."}
]
enc = tok.apply_chat_template(msgs, add_generation_prompt=True, return_tensors="pt").to(device)
out = model.generate(**enc, max_new_tokens=200, temperature=0.7, top_p=0.9)
print(tok.decode(out[0], skip_special_tokens=True))
```

### B) Use the **merged** model (Transformers)
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

MERGED = "StevenArtificial/Babaru-SFT-Llama-3.2-1B-Instruct/merged"

device = "cuda" if torch.cuda.is_available() else ("mps" if torch.backends.mps.is_available() else "cpu")

tok = AutoTokenizer.from_pretrained(MERGED, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(MERGED, torch_dtype="auto").to(device).eval()

system = "You are Babaru—warm, concise, playful snark (never mean). Roast the problem, not the person. Bowtie flourish only on the first reply."
msgs = [
    {"role":"system","content":system},
    {"role":"user","content":"Can you give me a 60-second plan to restart?"},
]
enc = tok.apply_chat_template(msgs, add_generation_prompt=True, return_tensors="pt").to(device)
out = model.generate(**enc, max_new_tokens=200, temperature=0.7, top_p=0.9)
print(tok.decode(out[0], skip_special_tokens=True))
```

### C) Run **GGUF (Q8_0)** with llama.cpp
1) Build llama.cpp tools (once):
```bash
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make -j quantize main
```
2) Run the quantized model:
```bash
./main -m gguf/Babaru-SFT-Llama-3.2-1B-Instruct-Q8_0.gguf \
       -p "Hello, Babaru." -n 200 --temp 0.7 --top-p 0.9
```

### Optional: Python with llama‑cpp‑python
```bash
pip install --upgrade llama-cpp-python
```
```python
from llama_cpp import Llama
llm = Llama(model_path="gguf/Babaru-SFT-Llama-3.2-1B-Instruct-Q8_0.gguf")
resp = llm("""<|start_header_id|>system<|end_header_id|>
You are Babaru—warm, concise, playful snark (never mean). Bowtie flourish only on the first reply.<|eot_id|>
<|start_header_id|>user<|end_header_id|>
I doomscrolled and now I feel fried—help.<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>""", max_tokens=200, temperature=0.7, top_p=0.9)
print(resp["choices"][0]["text"].strip())
```

---

## System Requirements & Dependencies

### Desktop/Laptop (for Transformers)
- **CPU/GPU:** Any modern CPU; GPU (CUDA/Metal) recommended
- **RAM/VRAM:** 6–8 GB RAM for smooth use; VRAM ≥4 GB helps
- **Python:** 3.10+
- **Packages:**
  - `transformers >= 4.46.0`
  - `torch >= 2.2` (CUDA or Metal build)
  - `peft`, `accelerate`, `huggingface_hub`, `datasets` (for dataset ops)

### On‑device / Smartphones (goal: ≥4 GB RAM)
- Use **GGUF Q8_0** in `llama.cpp` or `llama-cpp-python`.
- **Memory guidance (approx):**
  - **Model weights:** ~1–1.2 GB for 1B @ Q8_0
  - **Runtime overhead + KV cache:** depends on context length; keep **batch=1**, **ctx ≤512–1024** for 4 GB devices.
- **Android:** build `llama.cpp` for arm64 or use a prebuilt mobile app that supports GGUF; run with low context (e.g., `--ctx-size 512`) and 2–4 threads.
- **iOS:** use Metal builds or apps that bundle `llama.cpp`; prefer low context; background apps may reduce available RAM.

> Tip: if 4 GB devices struggle at Q8_0, consider Q6_K or Q5_K_M for extra headroom at a small quality trade‑off.

### llama.cpp toolchain
- `cmake`, `make`, a C/C++ toolchain (Xcode/clang on macOS, MSYS2/MinGW or WSL on Windows, NDK/clang on Android)

---

## Reproduce (training outline)
- Base: `meta-llama/Llama-3.2-1B-Instruct`
- LoRA: r=32, α=16, dropout=0.05; targets: q/k/v/o + gate/up/down
- Optim: cosine schedule, warmup 10%, LR 2e‑4, epochs 4, grad accumulation for small VRAM
- Tokenization: chat template with assistant token masking; labels = assistant tokens only
- Post‑train: merge LoRA → base; convert to GGUF; quantize to **Q8_0**

---

## License & Usage Notes
- The base model is under Meta’s Llama 3.x license—accept and comply with its terms.
- This repo provides SFT weights/derivatives; ensure your deployment respects both the base license and any platform/app store constraints.

---

## Roadmap
- Light micro‑pack to increase optional tactics (anchoring/micro‑obs) from ~1% → 3–5% without overuse
- Conversational memory adapter (per‑user preference capture)
- Smaller quant variants (Q6_K/Q5_K_M) for ultra‑low‑RAM phones

---

**Babaru’s promise:** tough love with a wink, micro‑wins over monologues, and zero bowtie spam after the opener.

**Contacts**

- Developer: **Steven Lansangan** (AI Software Engineer)
- Email: **[email protected]**