George-API commited on
Commit
a57357b
·
verified ·
1 Parent(s): 44dd860

Upload folder using huggingface_hub

Browse files
.gitignore ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ .env
2
+ *.pyc
3
+ __pycache__
README.md CHANGED
@@ -1,12 +1,192 @@
1
- ---
2
- title: Phi4training
3
- emoji: 🌖
4
- colorFrom: blue
5
- colorTo: yellow
6
- sdk: gradio
7
- sdk_version: 5.20.1
8
- app_file: app.py
9
- pinned: false
10
- ---
11
-
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Phase 1: Domain Adaptation (Unsupervised)
2
+
3
+ This directory contains the code and configuration for domain adaptation of the phi-4-unsloth-bnb-4bit model to the cognitive science domain. This phase produces our domain-adapted model: [George-API/phi-4-research-assistant](https://huggingface.co/George-API/phi-4-research-assistant).
4
+
5
+ ## Overview
6
+
7
+ Domain adaptation is the first phase of our training process, where we expose the model to a large corpus of cognitive science texts to help it learn domain-specific vocabulary, concepts, and patterns. This phase prepares the model for the more focused supervised fine-tuning in Phase 2.
8
+
9
+ ## Files
10
+
11
+ ### Core Training Files
12
+ - `run_transformers_training.py`: Main script for domain adaptation
13
+ - `transformers_config.json`: Model and training parameters
14
+ - `hardware_config.json`: Hardware-specific optimizations
15
+ - `dataset_config.json`: Dataset loading and processing settings
16
+ - `requirements.txt`: Required Python packages
17
+
18
+ ### Analysis & Utilities
19
+ - `check_tokenization.py`: Script to analyze token distributions
20
+ - `update_space.py`: Hugging Face Space update utility
21
+ - `.env`: Environment variables (API tokens, etc.)
22
+
23
+ ## Setup
24
+
25
+ 1. **Environment Setup**:
26
+ ```bash
27
+ python -m venv venv
28
+ source venv/bin/activate # or `venv\Scripts\activate` on Windows
29
+ pip install -r requirements.txt
30
+ ```
31
+
32
+ 2. **Environment Variables**:
33
+ Create `.env` file with:
34
+ ```
35
+ HUGGINGFACE_TOKEN=your_token_here
36
+ ```
37
+
38
+ 3. **Verify Setup**:
39
+ ```bash
40
+ python check_tokenization.py # Ensures tokenizer works
41
+ ```
42
+
43
+ ## How It Works
44
+
45
+ 1. **Data Loading**: Loads pre-tokenized data from the Hugging Face dataset
46
+ 2. **Sequential Processing**: Processes data in order, maintaining the integrity of research papers
47
+ 3. **Efficient Training**: Uses pre-quantized Unsloth 4-bit model for memory-efficient and faster training
48
+ 4. **Checkpointing**: Saves regular checkpoints and pushes to Hub
49
+ 5. **Monitoring**: Logs detailed metrics and statistics during training
50
+ 6. **Model Publishing**: Pushes the trained model to Hugging Face Hub
51
+
52
+ ## Key Features
53
+
54
+ ### Memory-Efficient Training
55
+
56
+ The training setup is optimized for A10G GPUs:
57
+ - Uses pre-quantized 4-bit model (no additional quantization needed)
58
+ - Gradient checkpointing for memory efficiency
59
+ - Flash attention for faster training
60
+ - bfloat16 mixed precision training
61
+ - Optimized batch sizes for maximum throughput
62
+
63
+ ### Sequential Processing
64
+
65
+ The training script ensures that chunks from the same research paper are processed together by:
66
+ - Sorting the dataset by ID
67
+ - Using a SequentialSampler to maintain order
68
+ - Processing chunks sequentially (average 1,673 tokens per chunk)
69
+
70
+ ### Data Collator
71
+
72
+ The `SimpleDataCollator` class:
73
+ - Preserves pre-tokenized data format
74
+ - Processes each entry independently
75
+ - Provides detailed logging of processing statistics
76
+ - Handles errors gracefully
77
+
78
+ ### Checkpointing
79
+
80
+ The training process saves checkpoints:
81
+ - Every 200 steps
82
+ - Pushes to Hub on every save
83
+ - Maintains up to 5 recent checkpoints
84
+ - Automatically resumes from the latest checkpoint if interrupted
85
+
86
+ ## Hardware Requirements
87
+
88
+ This training setup is optimized for:
89
+ - 2x NVIDIA A10G GPUs (24GB VRAM each)
90
+ - 92GB System RAM
91
+ - CUDA 11.8 or higher
92
+
93
+ Memory breakdown per GPU:
94
+ - Model (4-bit): ~3.5GB
95
+ - Optimizer states: ~1GB
96
+ - Batch memory: ~2GB
97
+ - Peak usage: 18-20GB
98
+ - Safe headroom: 4-6GB
99
+
100
+ ## Configuration
101
+
102
+ Key parameters in `transformers_config.json`:
103
+
104
+ - `model_name`: unsloth/phi-4-unsloth-bnb-4bit
105
+ - `learning_rate`: 2e-5
106
+ - `num_train_epochs`: 3
107
+ - `per_device_train_batch_size`: 16
108
+ - `gradient_accumulation_steps`: 4
109
+ - `effective_batch_size`: 128 (16 * 4 * 2 GPUs)
110
+ - `max_seq_length`: 2048
111
+ - `lr_scheduler_type`: "cosine"
112
+ - `warmup_ratio`: 0.03
113
+ - `neftune_noise_alpha`: 5
114
+
115
+ The configuration is optimized for:
116
+ - Maximum memory efficiency with pre-quantized model
117
+ - Stable training with cosine learning rate schedule
118
+ - Effective gradient updates with accumulation
119
+ - Regular checkpointing and Hub updates
120
+
121
+ ## Running Domain Adaptation
122
+
123
+ To start domain adaptation:
124
+
125
+ ```bash
126
+ python run_transformers_training.py
127
+ ```
128
+
129
+ The script will:
130
+ 1. Load the pre-quantized model and dataset
131
+ 2. Apply optimized training parameters
132
+ 3. Process the data sequentially
133
+ 4. Train the model for 3 epochs
134
+ 5. Save and push checkpoints to Hub regularly
135
+
136
+ ## Using the Model
137
+
138
+ After training, you can use the domain-adapted model:
139
+
140
+ ```python
141
+ from transformers import AutoModelForCausalLM, AutoTokenizer
142
+
143
+ # Load the domain-adapted model
144
+ model_name = "George-API/phi-4-research-assistant"
145
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
146
+ model = AutoModelForCausalLM.from_pretrained(model_name,
147
+ device_map="auto",
148
+ torch_dtype="bfloat16")
149
+
150
+ # Generate text
151
+ input_text = "The hippocampus is involved in"
152
+ inputs = tokenizer(input_text, return_tensors="pt")
153
+ outputs = model.generate(**inputs, max_length=100)
154
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
155
+ ```
156
+
157
+ ## Chat Format Example
158
+
159
+ Phi-4 works best with its native chat template:
160
+
161
+ ```python
162
+ from transformers import pipeline
163
+
164
+ pipeline = pipeline(
165
+ "text-generation",
166
+ model="George-API/phi-4-research-assistant",
167
+ model_kwargs={"torch_dtype": "bfloat16"},
168
+ device_map="auto",
169
+ )
170
+
171
+ messages = [
172
+ {"role": "system", "content": "You are an expert in cognitive science."},
173
+ {"role": "user", "content": "Explain the role of the hippocampus in memory formation."},
174
+ ]
175
+
176
+ outputs = pipeline(messages, max_new_tokens=256)
177
+ print(outputs[0]["generated_text"])
178
+ ```
179
+
180
+ ## Expected Outcomes
181
+
182
+ After domain adaptation, the model should:
183
+ - Have a better understanding of cognitive science terminology
184
+ - Show improved performance on domain-specific tasks
185
+ - Be ready for supervised fine-tuning in Phase 2
186
+
187
+ ## Next Steps
188
+
189
+ After completing domain adaptation:
190
+ 1. Evaluate the model's performance on cognitive science texts
191
+ 2. Proceed to Phase 2 (Supervised Fine-Tuning)
192
+ 3. Use TensorBoard to analyze training metrics
app.py ADDED
@@ -0,0 +1,162 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import gradio as gr
2
+ import os
3
+ import subprocess
4
+ import sys
5
+ import json
6
+ import re
7
+ from threading import Thread
8
+ import datetime
9
+ import torch
10
+ import threading
11
+
12
+ def load_env_variables():
13
+ """Load environment variables from system or .env file."""
14
+ if os.environ.get("SPACE_ID"):
15
+ print("Running in Hugging Face Space")
16
+ if "/" in os.environ.get("SPACE_ID", ""):
17
+ username = os.environ.get("SPACE_ID").split("/")[0]
18
+ os.environ["HF_USERNAME"] = username
19
+ print(f"Set HF_USERNAME from SPACE_ID: {username}")
20
+ else:
21
+ try:
22
+ from dotenv import load_dotenv
23
+ env_path = os.path.join(os.path.dirname(os.path.abspath(__file__)), ".env")
24
+ if os.path.exists(env_path):
25
+ load_dotenv(env_path)
26
+ print(f"Loaded environment variables from {env_path}")
27
+ except ImportError:
28
+ print("python-dotenv not installed, skipping .env loading")
29
+
30
+ def check_environment():
31
+ """Check the environment for GPU availability and other requirements."""
32
+ env_info = {
33
+ "System": {
34
+ "Platform": sys.platform,
35
+ "Python Version": sys.version.split()[0]
36
+ },
37
+ "GPU": {
38
+ "CUDA Available": torch.cuda.is_available(),
39
+ "Device Count": torch.cuda.device_count() if torch.cuda.is_available() else 0
40
+ },
41
+ "Environment Variables": {
42
+ "HF_TOKEN": bool(os.environ.get("HF_TOKEN")),
43
+ "HF_USERNAME": bool(os.environ.get("HF_USERNAME")),
44
+ "HF_SPACE_NAME": bool(os.environ.get("HF_SPACE_NAME"))
45
+ }
46
+ }
47
+
48
+ if torch.cuda.is_available():
49
+ env_info["GPU"]["Device Name"] = torch.cuda.get_device_name(0)
50
+ env_info["GPU"]["Memory (GB)"] = round(torch.cuda.get_device_properties(0).total_memory / (1024**3), 2)
51
+
52
+ return env_info
53
+
54
+ def run_training_process():
55
+ """Run the training process using the configuration files."""
56
+ try:
57
+ current_dir = os.path.dirname(os.path.abspath(__file__))
58
+ training_script = os.path.join(current_dir, "run_transformers_training.py")
59
+
60
+ # Start the training process
61
+ process = subprocess.Popen(
62
+ [sys.executable, training_script],
63
+ stdout=subprocess.PIPE,
64
+ stderr=subprocess.STDOUT,
65
+ text=True,
66
+ bufsize=1
67
+ )
68
+
69
+ # Process the output line by line
70
+ for line in process.stdout:
71
+ print(line.strip())
72
+
73
+ process.wait()
74
+ return process.returncode
75
+ except Exception as e:
76
+ print(f"Error in training process: {e}")
77
+ return 1
78
+
79
+ def start_training(learning_rate, num_train_epochs, per_device_train_batch_size,
80
+ gradient_accumulation_steps):
81
+ """Start the training process with the specified parameters."""
82
+ try:
83
+ load_env_variables()
84
+ current_dir = os.path.dirname(os.path.abspath(__file__))
85
+
86
+ # Load and update transformers config
87
+ with open(os.path.join(current_dir, "transformers_config.json"), "r") as f:
88
+ config = json.load(f)
89
+
90
+ # Update training parameters
91
+ config["training"].update({
92
+ "num_train_epochs": num_train_epochs,
93
+ "learning_rate": learning_rate,
94
+ "per_device_train_batch_size": per_device_train_batch_size,
95
+ "gradient_accumulation_steps": gradient_accumulation_steps
96
+ })
97
+
98
+ # Update hub settings if username is available
99
+ if os.environ.get("HF_USERNAME"):
100
+ config["huggingface_hub"].update({
101
+ "hub_model_id": f"{os.environ['HF_USERNAME']}/Phi4-Cognitive-Science"
102
+ })
103
+
104
+ # Save updated config
105
+ with open(os.path.join(current_dir, "transformers_config.json"), "w") as f:
106
+ json.dump(config, f, indent=4)
107
+
108
+ # Start training in a separate thread
109
+ thread = threading.Thread(target=run_training_process)
110
+ thread.daemon = True
111
+ thread.start()
112
+
113
+ return "Training started! Check the Hugging Face Space logs for progress."
114
+ except Exception as e:
115
+ return f"Error starting training: {str(e)}"
116
+
117
+ with gr.Blocks(title="Phi-4 Training Interface") as demo:
118
+ gr.Markdown("# Phi-4 Unsupervised Training for Cognitive Science")
119
+
120
+ with gr.Tab("Training"):
121
+ with gr.Row():
122
+ with gr.Column():
123
+ gr.Markdown("## Model Configuration")
124
+ gr.Markdown("**Model**: unsloth/phi-4-unsloth-bnb-4bit")
125
+ gr.Markdown("**Dataset**: George-API/cognitive-data")
126
+
127
+ gr.Markdown("## Training Parameters")
128
+ learning_rate = gr.Slider(minimum=1e-6, maximum=1e-4, value=2e-5, step=1e-6,
129
+ label="Learning Rate")
130
+ num_train_epochs = gr.Slider(minimum=1, maximum=5, value=3, step=1,
131
+ label="Number of Epochs")
132
+ per_device_train_batch_size = gr.Slider(minimum=4, maximum=24, value=12, step=4,
133
+ label="Per Device Train Batch Size (Unsloth Optimized)")
134
+ gradient_accumulation_steps = gr.Slider(minimum=1, maximum=8, value=4, step=1,
135
+ label="Gradient Accumulation Steps")
136
+
137
+ start_btn = gr.Button("Start Training", variant="primary")
138
+ training_output = gr.Textbox(label="Training Output", interactive=False)
139
+
140
+ with gr.Tab("Environment"):
141
+ with gr.Row():
142
+ with gr.Column():
143
+ gr.Markdown("## Environment Information")
144
+ env_info = gr.JSON(label="Environment Info")
145
+ check_env_btn = gr.Button("Check Environment")
146
+
147
+ # Set up event handlers
148
+ start_btn.click(
149
+ fn=start_training,
150
+ inputs=[learning_rate, num_train_epochs, per_device_train_batch_size, gradient_accumulation_steps],
151
+ outputs=training_output
152
+ )
153
+
154
+ check_env_btn.click(
155
+ fn=check_environment,
156
+ inputs=[],
157
+ outputs=env_info
158
+ )
159
+
160
+ if __name__ == "__main__":
161
+ load_env_variables()
162
+ demo.launch()
check_tokenization.py ADDED
@@ -0,0 +1,103 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python
2
+
3
+ import json
4
+ from transformers import AutoTokenizer
5
+ import numpy as np
6
+ from tqdm import tqdm
7
+ import matplotlib.pyplot as plt
8
+
9
+ def load_tokenizers():
10
+ """Load both tokenizers."""
11
+ print("Loading tokenizers...")
12
+ phi_tokenizer = AutoTokenizer.from_pretrained(
13
+ "unsloth/phi-4-unsloth-bnb-4bit",
14
+ trust_remote_code=True
15
+ )
16
+ deepseek_tokenizer = AutoTokenizer.from_pretrained(
17
+ "deepseek-ai/deepseek-llama-7b-base",
18
+ trust_remote_code=True
19
+ )
20
+ return phi_tokenizer, deepseek_tokenizer
21
+
22
+ def analyze_token_counts(jsonl_path, phi_tokenizer, deepseek_tokenizer, sample_size=100):
23
+ """Analyze token count differences between tokenizers."""
24
+ token_counts = {
25
+ 'phi': [],
26
+ 'deepseek': [],
27
+ 'differences': []
28
+ }
29
+
30
+ print(f"Analyzing token counts from {jsonl_path}")
31
+ with open(jsonl_path, 'r', encoding='utf-8') as f:
32
+ data = [json.loads(line) for line in f]
33
+
34
+ # Take a random sample if sample_size specified
35
+ if sample_size and sample_size < len(data):
36
+ data = np.random.choice(data, sample_size, replace=False)
37
+
38
+ for item in tqdm(data, desc="Processing entries"):
39
+ text = item.get('text', '') or item.get('content', '')
40
+
41
+ # Get token counts
42
+ phi_tokens = len(phi_tokenizer.encode(text))
43
+ deepseek_tokens = len(deepseek_tokenizer.encode(text))
44
+
45
+ token_counts['phi'].append(phi_tokens)
46
+ token_counts['deepseek'].append(deepseek_tokens)
47
+ token_counts['differences'].append(phi_tokens - deepseek_tokens)
48
+
49
+ return token_counts
50
+
51
+ def plot_comparison(token_counts):
52
+ """Create visualization of token count differences."""
53
+ plt.figure(figsize=(12, 6))
54
+
55
+ # Plot token count distributions
56
+ plt.subplot(1, 2, 1)
57
+ plt.hist([token_counts['phi'], token_counts['deepseek']],
58
+ label=['Phi-4', 'DeepSeek'], alpha=0.6)
59
+ plt.title('Token Count Distribution')
60
+ plt.xlabel('Number of Tokens')
61
+ plt.ylabel('Frequency')
62
+ plt.legend()
63
+
64
+ # Plot differences
65
+ plt.subplot(1, 2, 2)
66
+ plt.hist(token_counts['differences'], bins=30)
67
+ plt.title('Token Count Differences\n(Phi-4 minus DeepSeek)')
68
+ plt.xlabel('Difference in Tokens')
69
+ plt.ylabel('Frequency')
70
+
71
+ plt.tight_layout()
72
+ plt.savefig('tokenization_analysis.png')
73
+ print("Saved visualization to tokenization_analysis.png")
74
+
75
+ def main():
76
+ # Load tokenizers
77
+ phi_tokenizer, deepseek_tokenizer = load_tokenizers()
78
+
79
+ # Analyze token counts
80
+ token_counts = analyze_token_counts(
81
+ "../../../../data_processing/data/training_data.jsonl",
82
+ phi_tokenizer,
83
+ deepseek_tokenizer
84
+ )
85
+
86
+ # Calculate statistics
87
+ phi_mean = np.mean(token_counts['phi'])
88
+ deepseek_mean = np.mean(token_counts['deepseek'])
89
+ diff_mean = np.mean(token_counts['differences'])
90
+ diff_std = np.std(token_counts['differences'])
91
+
92
+ print("\nAnalysis Results:")
93
+ print(f"Phi-4 average tokens: {phi_mean:.1f}")
94
+ print(f"DeepSeek average tokens: {deepseek_mean:.1f}")
95
+ print(f"Average difference: {diff_mean:.1f} ± {diff_std:.1f}")
96
+ print(f"Max Phi-4 tokens: {max(token_counts['phi'])}")
97
+ print(f"Max DeepSeek tokens: {max(token_counts['deepseek'])}")
98
+
99
+ # Create visualization
100
+ plot_comparison(token_counts)
101
+
102
+ if __name__ == "__main__":
103
+ main()
dataset_config.json ADDED
@@ -0,0 +1,40 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "dataset": {
3
+ "name": "George-API/cognitive-data",
4
+ "split": "train",
5
+ "column_mapping": {
6
+ "text": "conversations",
7
+ "id": "id"
8
+ },
9
+ "processing": {
10
+ "sort_by_id": true,
11
+ "maintain_paper_order": true,
12
+ "max_seq_length": 2048
13
+ }
14
+ },
15
+ "data_formatting": {
16
+ "chat_template": "phi",
17
+ "roles": {
18
+ "system": "System: {content}\n\n",
19
+ "human": "Human: {content}\n\n",
20
+ "assistant": "Assistant: {content}\n\n"
21
+ },
22
+ "metadata_handling": {
23
+ "include_paper_id": true,
24
+ "include_chunk_number": true,
25
+ "metadata_format": "Paper ID: {paper_id} | Chunk: {chunk_number}"
26
+ }
27
+ },
28
+ "data_loading": {
29
+ "batch_size": 16,
30
+ "shuffle": false,
31
+ "drop_last": false,
32
+ "num_workers": 2,
33
+ "pin_memory": false
34
+ },
35
+ "validation": {
36
+ "log_samples": 3,
37
+ "log_interval": 50,
38
+ "metrics": ["processed", "skipped", "avg_tokens", "unique_papers"]
39
+ }
40
+ }
hardware_config.json ADDED
@@ -0,0 +1,49 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "hardware_name": "2xA10G",
3
+ "specs": {
4
+ "gpu_count": 2,
5
+ "gpu_type": "A10G",
6
+ "vram_per_gpu": 24,
7
+ "total_vram": 48,
8
+ "vcpu_count": 24,
9
+ "ram": 92
10
+ },
11
+ "training_optimizations": {
12
+ "per_device_batch_size": 16,
13
+ "gradient_accumulation_steps": 4,
14
+ "effective_batch_size": 128,
15
+ "memory_optimizations": {
16
+ "use_gradient_checkpointing": true,
17
+ "pin_memory": true,
18
+ "num_workers": 2
19
+ },
20
+ "distributed_settings": {
21
+ "device_map": "auto",
22
+ "ddp_find_unused_parameters": false
23
+ }
24
+ },
25
+ "memory_breakdown": {
26
+ "model_size": "~3.5GB (pre-quantized 4-bit)",
27
+ "optimizer_states": "~1GB",
28
+ "batch_memory_per_gpu": "~2GB",
29
+ "peak_memory_estimate": "18-20GB",
30
+ "safe_headroom": "4-6GB"
31
+ },
32
+ "compute_environment": "A10G_CLOUD",
33
+ "distributed_type": "DATA_PARALLEL",
34
+ "mixed_precision": "bf16",
35
+ "num_gpus": 2,
36
+ "training_parameters": {
37
+ "per_device_train_batch_size": 16,
38
+ "gradient_accumulation_steps": 4,
39
+ "dataloader_num_workers": 2,
40
+ "dataloader_pin_memory": true,
41
+ "gradient_checkpointing": true,
42
+ "max_grad_norm": 1.0
43
+ },
44
+ "memory_optimization": {
45
+ "offload_to_cpu": false,
46
+ "use_flash_attention": true,
47
+ "use_gradient_checkpointing": true
48
+ }
49
+ }
requirements.txt ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ accelerate>=0.27.0
2
+ bitsandbytes>=0.41.0
3
+ datasets>=2.15.0
4
+ filelock>=3.13.1
5
+ gradio>=5.17.0
6
+ huggingface-hub>=0.19.0
7
+ matplotlib>=3.7.0
8
+ numpy>=1.24.0
9
+ packaging>=23.0
10
+ psutil>=5.9.0
11
+ python-dotenv>=1.0.0
12
+ pyyaml>=6.0.1
13
+ regex>=2023.0.0
14
+ requests>=2.31.0
15
+ safetensors>=0.4.1
16
+ tensorboard>=2.15.0
17
+ torch>=2.0.0
18
+ tqdm>=4.65.0
19
+ transformers>=4.36.0
20
+ typing-extensions>=4.8.0
run_transformers_training.py ADDED
@@ -0,0 +1,615 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python
2
+ # coding=utf-8
3
+
4
+ import os
5
+ import sys
6
+ import json
7
+ import argparse
8
+ import logging
9
+ from datetime import datetime
10
+
11
+ import torch
12
+ from datasets import load_dataset
13
+ from transformers import (
14
+ AutoModelForCausalLM,
15
+ AutoTokenizer,
16
+ TrainingArguments,
17
+ Trainer,
18
+ TrainerCallback,
19
+ set_seed,
20
+ BitsAndBytesConfig
21
+ )
22
+
23
+ # Configure logging
24
+ logging.basicConfig(
25
+ level=logging.INFO,
26
+ format="%(asctime)s - %(levelname)s - %(message)s",
27
+ handlers=[logging.StreamHandler(sys.stdout)]
28
+ )
29
+ logger = logging.getLogger(__name__)
30
+
31
+ # Check for BitsAndBytes
32
+ try:
33
+ from transformers import BitsAndBytesConfig
34
+ bitsandbytes_available = True
35
+ except ImportError:
36
+ bitsandbytes_available = False
37
+ logger.warning("BitsAndBytes not available. 4-bit quantization will not be used.")
38
+
39
+ # Check for PEFT
40
+ try:
41
+ from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
42
+ peft_available = True
43
+ except ImportError:
44
+ peft_available = False
45
+ logger.warning("PEFT not available. Parameter-efficient fine-tuning will not be used.")
46
+
47
+ # Import Unsloth
48
+ try:
49
+ from unsloth import FastLanguageModel
50
+ from unsloth.chat_templates import get_chat_template
51
+ unsloth_available = True
52
+ except ImportError:
53
+ unsloth_available = False
54
+ logger.warning("Unsloth not available. Please install with: pip install unsloth")
55
+
56
+ def load_env_variables():
57
+ """Load environment variables from system, .env file, or Hugging Face Space variables."""
58
+ # Check if we're running in a Hugging Face Space
59
+ if os.environ.get("SPACE_ID"):
60
+ logging.info("Running in Hugging Face Space")
61
+
62
+ # Log the presence of variables (without revealing values)
63
+ logging.info(f"HF_TOKEN available: {bool(os.environ.get('HF_TOKEN'))}")
64
+ logging.info(f"HF_USERNAME available: {bool(os.environ.get('HF_USERNAME'))}")
65
+
66
+ # If username is not set, try to extract from SPACE_ID
67
+ if not os.environ.get("HF_USERNAME") and "/" in os.environ.get("SPACE_ID", ""):
68
+ username = os.environ.get("SPACE_ID").split("/")[0]
69
+ os.environ["HF_USERNAME"] = username
70
+ logging.info(f"Set HF_USERNAME from SPACE_ID: {username}")
71
+ else:
72
+ # Try to load from .env file if not in a Space
73
+ try:
74
+ from dotenv import load_dotenv
75
+ # Updated path to .env file in the new directory structure
76
+ env_path = os.path.join(os.path.dirname(os.path.dirname(os.path.abspath(__file__))), "shared", ".env")
77
+ if os.path.exists(env_path):
78
+ load_dotenv(env_path)
79
+ logging.info(f"Loaded environment variables from {env_path}")
80
+ logging.info(f"HF_TOKEN loaded from .env file: {bool(os.environ.get('HF_TOKEN'))}")
81
+ logging.info(f"HF_USERNAME loaded from .env file: {bool(os.environ.get('HF_USERNAME'))}")
82
+ logging.info(f"HF_SPACE_NAME loaded from .env file: {bool(os.environ.get('HF_SPACE_NAME'))}")
83
+ else:
84
+ logging.warning(f"No .env file found at {env_path}")
85
+ except ImportError:
86
+ logging.warning("python-dotenv not installed, not loading from .env file")
87
+
88
+ if not os.environ.get("HF_USERNAME"):
89
+ logger.warning("HF_USERNAME is not set. Using default username.")
90
+
91
+ if not os.environ.get("HF_SPACE_NAME"):
92
+ logger.warning("HF_SPACE_NAME is not set. Using default space name.")
93
+
94
+ # Set HF_TOKEN for huggingface_hub
95
+ if os.environ.get("HF_TOKEN"):
96
+ os.environ["HUGGING_FACE_HUB_TOKEN"] = os.environ.get("HF_TOKEN")
97
+
98
+ def load_configs(base_path):
99
+ """Load all configuration files."""
100
+ configs = {}
101
+
102
+ # List of config files to load
103
+ config_files = [
104
+ "transformers_config.json",
105
+ "hardware_config.json",
106
+ "dataset_config.json"
107
+ ]
108
+
109
+ for config_file in config_files:
110
+ file_path = os.path.join(base_path, config_file)
111
+ try:
112
+ with open(file_path, "r") as f:
113
+ config_name = config_file.replace("_config.json", "")
114
+ configs[config_name] = json.load(f)
115
+ logger.info(f"Loaded {config_name} configuration from {file_path}")
116
+ except Exception as e:
117
+ logger.error(f"Error loading {config_file}: {e}")
118
+ raise
119
+
120
+ return configs
121
+
122
+ def parse_args():
123
+ parser = argparse.ArgumentParser(description="Fine-tune a language model on a text dataset")
124
+ parser.add_argument("--config_dir", type=str, default=".", help="Directory containing configuration files")
125
+ return parser.parse_args()
126
+
127
+ def load_model_and_tokenizer(config):
128
+ """Load model and tokenizer with proper error handling and optimizations."""
129
+ try:
130
+ if config.get("use_unsloth", False) and unsloth_available:
131
+ logger.info("Using Unsloth optimizations")
132
+ model, tokenizer = FastLanguageModel.from_pretrained(
133
+ model_name=config.get("model_name"),
134
+ max_seq_length=config.get("max_seq_length", 2048),
135
+ dtype=None, # Let Unsloth choose optimal dtype
136
+ load_in_4bit=config.get("load_in_4bit", True),
137
+ device_map="auto",
138
+ )
139
+
140
+ # Apply Unsloth's training optimizations with config parameters
141
+ model = FastLanguageModel.get_peft_model(
142
+ model,
143
+ r=config.get("unsloth_r", 32),
144
+ target_modules=config.get("unsloth_target_modules",
145
+ ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]),
146
+ lora_alpha=config.get("unsloth_alpha", 16),
147
+ lora_dropout=config.get("unsloth_dropout", 0.05),
148
+ bias="none",
149
+ use_gradient_checkpointing=config.get("gradient_checkpointing", True),
150
+ random_state=config.get("seed", 42),
151
+ )
152
+ logger.info("Unsloth optimizations applied successfully")
153
+ else:
154
+ if config.get("use_unsloth", False):
155
+ logger.warning("Unsloth requested but not available. Falling back to standard training.")
156
+
157
+ # Standard quantization setup
158
+ quantization_config = None
159
+ if config.get("load_in_4bit", False) and bitsandbytes_available:
160
+ logger.info("Using 4-bit quantization")
161
+ quantization_config = BitsAndBytesConfig(
162
+ load_in_4bit=True,
163
+ bnb_4bit_quant_type="nf4",
164
+ bnb_4bit_compute_dtype=torch.float16,
165
+ bnb_4bit_use_double_quant=True
166
+ )
167
+
168
+ # Load model with standard settings
169
+ model = AutoModelForCausalLM.from_pretrained(
170
+ config.get("model_name"),
171
+ quantization_config=quantization_config,
172
+ device_map="auto",
173
+ trust_remote_code=config.get("trust_remote_code", True),
174
+ use_cache=not config.get("gradient_checkpointing", True)
175
+ )
176
+
177
+ # Load tokenizer
178
+ tokenizer = AutoTokenizer.from_pretrained(
179
+ config.get("model_name"),
180
+ use_fast=config.get("use_fast_tokenizer", True),
181
+ trust_remote_code=config.get("trust_remote_code", True)
182
+ )
183
+
184
+ # Enable gradient checkpointing if requested
185
+ if config.get("gradient_checkpointing", True) and hasattr(model, "gradient_checkpointing_enable"):
186
+ model.gradient_checkpointing_enable(use_reentrant=False)
187
+ logger.info("Gradient checkpointing enabled")
188
+
189
+ # Set up tokenizer settings
190
+ if config.get("chat_template"):
191
+ if unsloth_available and config.get("use_unsloth", False):
192
+ chat_template = get_chat_template("phi")
193
+ tokenizer.chat_template = chat_template
194
+ else:
195
+ tokenizer.chat_template = config.get("chat_template")
196
+ logger.info(f"Set chat template to {config.get('chat_template')}")
197
+
198
+ # Ensure proper token settings
199
+ if tokenizer.pad_token_id is None:
200
+ tokenizer.pad_token_id = tokenizer.eos_token_id
201
+ logger.info(f"Set pad_token_id to eos_token_id: {tokenizer.pad_token_id}")
202
+
203
+ return model, tokenizer
204
+
205
+ except Exception as e:
206
+ logger.error(f"Error in model/tokenizer loading: {str(e)}")
207
+ raise
208
+
209
+ def load_dataset_with_mapping(dataset_config):
210
+ """Load and prepare dataset with proper column mapping."""
211
+ try:
212
+ # Load dataset
213
+ dataset = load_dataset(
214
+ dataset_config["dataset"]["name"],
215
+ split=dataset_config["dataset"]["split"]
216
+ )
217
+ logger.info(f"Dataset loaded successfully with {len(dataset)} examples")
218
+
219
+ # Apply column mapping if specified
220
+ if "column_mapping" in dataset_config["dataset"]:
221
+ mapping = dataset_config["dataset"]["column_mapping"]
222
+ dataset = dataset.rename_columns({v: k for k, v in mapping.items()})
223
+ logger.info(f"Applied column mapping: {mapping}")
224
+
225
+ # Sort dataset if required
226
+ if dataset_config["dataset"]["processing"]["sort_by_id"]:
227
+ logger.info("Sorting dataset by ID to maintain paper chunk order")
228
+ dataset = dataset.sort("id")
229
+
230
+ # Log first few IDs to verify sorting
231
+ sample_ids = [example["id"] for example in dataset.select(range(min(5, len(dataset))))]
232
+ logger.info(f"First few IDs after sorting: {sample_ids}")
233
+
234
+ return dataset
235
+
236
+ except Exception as e:
237
+ logger.error(f"Error loading dataset: {str(e)}")
238
+ raise
239
+
240
+ def main():
241
+ # Set up logging
242
+ logger.info("Starting training process")
243
+
244
+ # Parse arguments
245
+ args = parse_args()
246
+
247
+ # Load environment variables
248
+ load_env_variables()
249
+
250
+ # Load all configurations
251
+ try:
252
+ configs = load_configs(args.config_dir)
253
+ logger.info("All configurations loaded successfully")
254
+
255
+ # Extract specific configs
256
+ model_config = configs["transformers"]
257
+ hardware_config = configs["hardware"]
258
+ dataset_config = configs["dataset"]
259
+
260
+ # Apply hardware-specific settings
261
+ per_device_batch_size = hardware_config["training_optimizations"]["per_device_batch_size"]
262
+ gradient_accumulation = hardware_config["training_optimizations"]["gradient_accumulation_steps"]
263
+
264
+ # Update model config with hardware settings
265
+ model_config["training"].update({
266
+ "per_device_train_batch_size": per_device_batch_size,
267
+ "gradient_accumulation_steps": gradient_accumulation,
268
+ "gradient_checkpointing": hardware_config["training_optimizations"]["memory_optimizations"]["use_gradient_checkpointing"]
269
+ })
270
+
271
+ except Exception as e:
272
+ logger.error(f"Error loading configurations: {e}")
273
+ return 1
274
+
275
+ # Set random seed for reproducibility
276
+ seed = model_config.get("seed", 42)
277
+ set_seed(seed)
278
+ logger.info(f"Set random seed to {seed}")
279
+
280
+ # Check if we're running in a Hugging Face Space
281
+ if os.environ.get("SPACE_ID") and not os.environ.get("HF_USERNAME"):
282
+ # Extract username from SPACE_ID
283
+ username = os.environ.get("SPACE_ID").split("/")[0]
284
+ logger.info(f"Extracted username from SPACE_ID: {username}")
285
+
286
+ # Set hub_model_id if not already set and push_to_hub is enabled
287
+ if model_config.get("push_to_hub", False) and not model_config.get("hub_model_id"):
288
+ model_name = model_config.get("model_name", "").split("/")[-1]
289
+ model_config["hub_model_id"] = f"{username}/finetuned-{model_name}"
290
+ logger.info(f"Set hub_model_id to {model_config['hub_model_id']}")
291
+
292
+ # Load model and tokenizer
293
+ logger.info(f"Loading model: {model_config.get('model_name')}")
294
+
295
+ try:
296
+ model, tokenizer = load_model_and_tokenizer(model_config)
297
+ logger.info("Model and tokenizer loaded successfully")
298
+
299
+ # Prepare model for k-bit training if using PEFT
300
+ if model_config.get("use_peft", False) and peft_available:
301
+ logger.info("Preparing model for parameter-efficient fine-tuning")
302
+ try:
303
+ model = prepare_model_for_kbit_training(model)
304
+
305
+ # Get target modules
306
+ target_modules = model_config.get("target_modules", ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"])
307
+
308
+ # Create LoRA config
309
+ lora_config = LoraConfig(
310
+ r=model_config.get("lora_r", 16),
311
+ lora_alpha=model_config.get("lora_alpha", 32),
312
+ lora_dropout=model_config.get("lora_dropout", 0.05),
313
+ bias="none",
314
+ task_type="CAUSAL_LM",
315
+ target_modules=target_modules
316
+ )
317
+
318
+ # Apply LoRA to model
319
+ model = get_peft_model(model, lora_config)
320
+ logger.info(f"Applied LoRA with r={model_config.get('lora_r', 16)}, alpha={model_config.get('lora_alpha', 32)}")
321
+ except Exception as e:
322
+ logger.error(f"Error setting up PEFT: {e}")
323
+ return 1
324
+
325
+ # Load dataset with proper mapping
326
+ try:
327
+ dataset = load_dataset_with_mapping(dataset_config)
328
+ logger.info("Dataset loaded and prepared successfully")
329
+ except Exception as e:
330
+ logger.error(f"Error loading dataset: {e}")
331
+ return 1
332
+
333
+ # Simple data collator that processes each entry independently
334
+ class SimpleDataCollator:
335
+ def __init__(self, tokenizer):
336
+ self.tokenizer = tokenizer
337
+ self.stats = {"processed": 0, "skipped": 0, "total_tokens": 0}
338
+ self.pad_token_id = tokenizer.pad_token_id if tokenizer.pad_token_id is not None else 0
339
+ self.prompt_counter = 0
340
+ self.paper_counters = {}
341
+ logger.info("SimpleDataCollator initialized - using phi-4 chat format")
342
+
343
+ def format_phi_chat(self, messages):
344
+ """Format messages according to phi-4's chat template."""
345
+ formatted_chat = ""
346
+ for message in messages:
347
+ # Extract role and content
348
+ if isinstance(message, dict):
349
+ role = message.get("role", "").lower()
350
+ content = message.get("content", "")
351
+ else:
352
+ role = getattr(message, "role", "").lower()
353
+ content = getattr(message, "content", "")
354
+
355
+ # Format based on role
356
+ if role == "human" or role == "user":
357
+ formatted_chat += f"Human: {content}\n\n"
358
+ elif role == "assistant":
359
+ formatted_chat += f"Assistant: {content}\n\n"
360
+ elif role == "system":
361
+ # For system messages, we prepend them with a special format
362
+ formatted_chat = f"System: {content}\n\n" + formatted_chat
363
+ else:
364
+ logger.warning(f"Unknown role '{role}' - treating as system message")
365
+ formatted_chat += f"System: {content}\n\n"
366
+
367
+ return formatted_chat.strip()
368
+
369
+ def __call__(self, features):
370
+ batch = {"input_ids": [], "attention_mask": [], "labels": []}
371
+
372
+ for example in features:
373
+ try:
374
+ # Get ID and conversation fields
375
+ paper_id = example.get("id", "") if isinstance(example, dict) else getattr(example, "id", "")
376
+ conversation = example.get("conversations", []) if isinstance(example, dict) else getattr(example, "conversations", [])
377
+
378
+ if not conversation:
379
+ self.stats["skipped"] += 1
380
+ continue
381
+
382
+ # Increment counters
383
+ self.prompt_counter += 1
384
+ if paper_id not in self.paper_counters:
385
+ self.paper_counters[paper_id] = 0
386
+ self.paper_counters[paper_id] += 1
387
+
388
+ # Add metadata as system message
389
+ metadata = {
390
+ "role": "system",
391
+ "content": f"Paper ID: {paper_id} | Chunk: {self.paper_counters[paper_id]}"
392
+ }
393
+
394
+ # Format the conversation using phi-4's chat template
395
+ formatted_content = self.format_phi_chat([metadata] + conversation)
396
+
397
+ # Tokenize with the model's chat template
398
+ inputs = self.tokenizer(
399
+ formatted_content,
400
+ add_special_tokens=True,
401
+ truncation=True,
402
+ max_length=model_config.get("max_seq_length", 2048),
403
+ return_tensors=None, # Return list instead of tensors
404
+ )
405
+
406
+ input_ids = inputs["input_ids"]
407
+ attention_mask = inputs["attention_mask"]
408
+
409
+ if len(input_ids) > 0:
410
+ # For causal language modeling, labels are the same as inputs
411
+ labels = input_ids.copy()
412
+
413
+ batch["input_ids"].append(input_ids)
414
+ batch["attention_mask"].append(attention_mask)
415
+ batch["labels"].append(labels)
416
+
417
+ self.stats["processed"] += 1
418
+ self.stats["total_tokens"] += len(input_ids)
419
+
420
+ # Debug logging for first few examples
421
+ if self.stats["processed"] <= 3:
422
+ logger.info(f"Example {self.stats['processed']} format:")
423
+ logger.info(f"Paper ID: {paper_id} | Chunk: {self.paper_counters[paper_id]}")
424
+ logger.info(f"Token count: {len(input_ids)}")
425
+ logger.info(f"Content preview:\n{formatted_content[:500]}...")
426
+ else:
427
+ self.stats["skipped"] += 1
428
+
429
+ except Exception as e:
430
+ logger.warning(f"Error processing example: {str(e)[:100]}...")
431
+ self.stats["skipped"] += 1
432
+ continue
433
+
434
+ # Handle empty batches
435
+ if not batch["input_ids"]:
436
+ logger.warning("Empty batch, returning dummy tensors")
437
+ return {
438
+ "input_ids": torch.zeros((1, 1), dtype=torch.long),
439
+ "attention_mask": torch.zeros((1, 1), dtype=torch.long),
440
+ "labels": torch.zeros((1, 1), dtype=torch.long)
441
+ }
442
+
443
+ # Pad the batch
444
+ max_length = max(len(ids) for ids in batch["input_ids"])
445
+
446
+ for i in range(len(batch["input_ids"])):
447
+ padding_length = max_length - len(batch["input_ids"][i])
448
+ if padding_length > 0:
449
+ batch["input_ids"][i].extend([self.pad_token_id] * padding_length)
450
+ batch["attention_mask"][i].extend([0] * padding_length)
451
+ batch["labels"][i].extend([-100] * padding_length) # Don't compute loss on padding
452
+
453
+ # Convert to tensors
454
+ batch = {k: torch.tensor(v) for k, v in batch.items()}
455
+
456
+ # Log stats periodically
457
+ if self.stats["processed"] % 100 == 0 and self.stats["processed"] > 0:
458
+ logger.info(f"Data collator stats: processed={self.stats['processed']}, "
459
+ f"skipped={self.stats['skipped']}, "
460
+ f"avg_tokens={self.stats['total_tokens']/self.stats['processed']:.1f}, "
461
+ f"unique_papers={len(self.paper_counters)}")
462
+
463
+ return batch
464
+
465
+ # Create data collator
466
+ data_collator = SimpleDataCollator(tokenizer)
467
+
468
+ # Simple logging callback
469
+ class LoggingCallback(TrainerCallback):
470
+ def __init__(self):
471
+ self.last_log_time = datetime.now()
472
+ self.training_start_time = datetime.now()
473
+
474
+ def on_step_end(self, args, state, control, **kwargs):
475
+ # Log every 50 steps or every 5 minutes, whichever comes first
476
+ current_time = datetime.now()
477
+ time_diff = (current_time - self.last_log_time).total_seconds()
478
+ elapsed_time = (current_time - self.training_start_time).total_seconds() / 60 # in minutes
479
+
480
+ if state.global_step % 50 == 0 or time_diff > 300: # 300 seconds = 5 minutes
481
+ loss = state.log_history[-1]['loss'] if state.log_history else 'N/A'
482
+ lr = state.log_history[-1]['learning_rate'] if state.log_history else 'N/A'
483
+
484
+ if isinstance(loss, float):
485
+ loss_str = f"{loss:.4f}"
486
+ else:
487
+ loss_str = str(loss)
488
+
489
+ if isinstance(lr, float):
490
+ lr_str = f"{lr:.8f}"
491
+ else:
492
+ lr_str = str(lr)
493
+
494
+ logger.info(f"Step: {state.global_step} | Loss: {loss_str} | LR: {lr_str} | Elapsed: {elapsed_time:.2f} min")
495
+ self.last_log_time = current_time
496
+
497
+ # Set up training arguments
498
+ logger.info("Setting up training arguments")
499
+ training_args = TrainingArguments(
500
+ output_dir=model_config.get("output_dir", "./results"),
501
+ num_train_epochs=model_config.get("num_train_epochs", 3),
502
+ per_device_train_batch_size=model_config.get("per_device_train_batch_size", 4), # Use config value, can be > 1
503
+ gradient_accumulation_steps=model_config.get("gradient_accumulation_steps", 8),
504
+ learning_rate=model_config.get("learning_rate", 5e-5),
505
+ weight_decay=model_config.get("weight_decay", 0.01),
506
+ warmup_ratio=model_config.get("warmup_ratio", 0.1),
507
+ lr_scheduler_type=model_config.get("lr_scheduler_type", "cosine"),
508
+ logging_steps=model_config.get("logging_steps", 10),
509
+ save_strategy=model_config.get("save_strategy", "steps"), # Updated to use steps by default
510
+ save_steps=model_config.get("save_steps", 100), # Save every 100 steps by default
511
+ save_total_limit=model_config.get("save_total_limit", 3), # Keep last 3 checkpoints
512
+ fp16=model_config.get("fp16", True),
513
+ bf16=model_config.get("bf16", False),
514
+ max_grad_norm=model_config.get("max_grad_norm", 1.0),
515
+ push_to_hub=model_config.get("push_to_hub", False),
516
+ hub_model_id=model_config.get("hub_model_id", None),
517
+ hub_token=os.environ.get("HF_TOKEN", None),
518
+ report_to="tensorboard",
519
+ remove_unused_columns=False, # Keep the conversations column
520
+ gradient_checkpointing=model_config.get("gradient_checkpointing", True), # Enable gradient checkpointing
521
+ dataloader_pin_memory=False, # Reduce memory usage
522
+ optim=model_config.get("optim", "adamw_torch"),
523
+ ddp_find_unused_parameters=False, # Improve distributed training efficiency
524
+ dataloader_drop_last=False, # Process all examples
525
+ dataloader_num_workers=0, # Sequential data loading
526
+ )
527
+
528
+ # Create a sequential sampler to ensure dataset is processed in order
529
+ logger.info("Creating sequential sampler to maintain dataset order")
530
+
531
+ # Create trainer with callback
532
+ logger.info("Creating trainer")
533
+
534
+ # Check if we should resume from checkpoint
535
+ resume_from_checkpoint = False
536
+ output_dir = model_config.get("output_dir", "./results")
537
+ if os.path.exists(output_dir):
538
+ checkpoints = [folder for folder in os.listdir(output_dir) if folder.startswith("checkpoint-")]
539
+ if checkpoints:
540
+ latest_checkpoint = max(checkpoints, key=lambda x: int(x.split("-")[1]))
541
+ resume_from_checkpoint = os.path.join(output_dir, latest_checkpoint)
542
+ logger.info(f"Found checkpoint: {resume_from_checkpoint}. Training will resume from this point.")
543
+
544
+ trainer = Trainer(
545
+ model=model,
546
+ args=training_args,
547
+ train_dataset=dataset,
548
+ data_collator=data_collator,
549
+ callbacks=[LoggingCallback()]
550
+ )
551
+
552
+ # Override the default data loader to disable shuffling
553
+ # This is necessary because TrainingArguments doesn't have a direct shuffle parameter
554
+ def get_train_dataloader_no_shuffle():
555
+ """Create a train DataLoader with shuffling disabled."""
556
+ logger.info("Creating train dataloader with sequential sampler (no shuffling)")
557
+
558
+ # Create a sequential sampler to ensure dataset is processed in order
559
+ train_sampler = torch.utils.data.SequentialSampler(dataset)
560
+
561
+ return torch.utils.data.DataLoader(
562
+ dataset,
563
+ batch_size=training_args.per_device_train_batch_size,
564
+ sampler=train_sampler, # Use sequential sampler instead of shuffle parameter
565
+ collate_fn=data_collator,
566
+ drop_last=False,
567
+ num_workers=0,
568
+ pin_memory=False
569
+ )
570
+
571
+ # Replace the default data loader with our non-shuffling version
572
+ trainer.get_train_dataloader = get_train_dataloader_no_shuffle
573
+
574
+ # Start training
575
+ logger.info("Starting training")
576
+ logger.info(f"Processing with batch size = {training_args.per_device_train_batch_size}, each entry processed independently")
577
+
578
+ # Create a lock file to indicate training is in progress
579
+ lock_file = os.path.join(os.path.dirname(os.path.abspath(__file__)), "TRAINING_IN_PROGRESS.lock")
580
+ with open(lock_file, "w") as f:
581
+ f.write(f"Training started: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\n")
582
+ f.write(f"Expected completion: After {training_args.num_train_epochs} epochs\n")
583
+ f.write("DO NOT UPDATE OR RESTART THIS SPACE UNTIL TRAINING COMPLETES\n")
584
+ logger.info(f"Created lock file: {lock_file}")
585
+
586
+ try:
587
+ trainer.train(resume_from_checkpoint=resume_from_checkpoint)
588
+ logger.info("Training completed successfully")
589
+
590
+ # Save model
591
+ if model_config.get("push_to_hub", False):
592
+ logger.info(f"Pushing model to hub: {model_config.get('hub_model_id')}")
593
+ trainer.push_to_hub()
594
+ logger.info("Model pushed to hub successfully")
595
+ else:
596
+ logger.info(f"Saving model to {model_config.get('output_dir', './results')}")
597
+ trainer.save_model()
598
+ logger.info("Model saved successfully")
599
+ except Exception as e:
600
+ logger.error(f"Training failed with error: {str(e)}")
601
+ raise
602
+ finally:
603
+ # Remove the lock file when training completes or fails
604
+ if os.path.exists(lock_file):
605
+ os.remove(lock_file)
606
+ logger.info(f"Removed lock file: {lock_file}")
607
+
608
+ return 0
609
+
610
+ except Exception as e:
611
+ logger.error(f"Error in main training loop: {str(e)}")
612
+ return 1
613
+
614
+ if __name__ == "__main__":
615
+ sys.exit(main())
transformers_config.json ADDED
@@ -0,0 +1,75 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "model": {
3
+ "name": "unsloth/phi-4-unsloth-bnb-4bit",
4
+ "trust_remote_code": true,
5
+ "use_fast_tokenizer": true
6
+ },
7
+
8
+ "tokenizer": {
9
+ "chat_template": "phi",
10
+ "max_seq_length": 2048,
11
+ "padding_side": "right",
12
+ "add_eos_token": true
13
+ },
14
+
15
+ "training": {
16
+ "per_device_train_batch_size": 16,
17
+ "gradient_accumulation_steps": 4,
18
+ "learning_rate": 2e-5,
19
+ "num_train_epochs": 3,
20
+ "max_steps": -1,
21
+ "logging_steps": 10,
22
+ "save_steps": 200,
23
+ "save_total_limit": 5,
24
+ "push_to_hub": true,
25
+ "hub_strategy": "every_save",
26
+ "gradient_checkpointing": true,
27
+ "optim": "adamw_torch",
28
+ "lr_scheduler_type": "cosine",
29
+ "warmup_ratio": 0.03,
30
+ "weight_decay": 0.01,
31
+ "max_grad_norm": 1.0,
32
+ "neftune_noise_alpha": 5
33
+ },
34
+
35
+ "checkpointing": {
36
+ "output_dir": "./results",
37
+ "save_strategy": "steps",
38
+ "save_steps": 100,
39
+ "save_total_limit": 3,
40
+ "hub_strategy": "every_save"
41
+ },
42
+
43
+ "unsloth": {
44
+ "enabled": true,
45
+ "r": 32,
46
+ "alpha": 16,
47
+ "dropout": 0.05,
48
+ "target_modules": [
49
+ "q_proj",
50
+ "k_proj",
51
+ "v_proj",
52
+ "o_proj",
53
+ "gate_proj",
54
+ "up_proj",
55
+ "down_proj"
56
+ ]
57
+ },
58
+
59
+ "logging": {
60
+ "logging_steps": 50,
61
+ "log_level": "info"
62
+ },
63
+
64
+ "huggingface_hub": {
65
+ "push_to_hub": true,
66
+ "hub_model_id": "phi-4-research-assistant",
67
+ "hub_private_repo": true
68
+ },
69
+
70
+ "model_name_or_path": "unsloth/phi-4-unsloth-bnb-4bit",
71
+ "model_revision": "main",
72
+ "use_flash_attention": true,
73
+ "torch_dtype": "bfloat16",
74
+ "bf16": true
75
+ }
update_space.py ADDED
@@ -0,0 +1,219 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python
2
+
3
+ """
4
+ Quick script to update your Hugging Face Space for phi-4-unsloth-bnb-4bit training.
5
+ This script handles the specific requirements for the 4-bit quantized Phi-4 model training,
6
+ including proper configuration and dependency management.
7
+ """
8
+
9
+ import os
10
+ import sys
11
+ import json
12
+ import subprocess
13
+ import argparse
14
+ import logging
15
+ from pathlib import Path
16
+ from huggingface_hub import HfApi, login
17
+ import getpass
18
+
19
+ # Configure logging
20
+ logging.basicConfig(
21
+ level=logging.INFO,
22
+ format="%(asctime)s - %(levelname)s - %(message)s",
23
+ handlers=[logging.StreamHandler(sys.stdout)]
24
+ )
25
+ logger = logging.getLogger(__name__)
26
+
27
+ def load_env_variables():
28
+ """Load environment variables from system or .env file."""
29
+ # Check if we're running in a Hugging Face Space
30
+ if os.environ.get("SPACE_ID"):
31
+ logger.info("Running in Hugging Face Space")
32
+ if "/" in os.environ.get("SPACE_ID", ""):
33
+ username = os.environ.get("SPACE_ID").split("/")[0]
34
+ os.environ["HF_USERNAME"] = username
35
+ logger.info(f"Set HF_USERNAME from SPACE_ID: {username}")
36
+ else:
37
+ try:
38
+ from dotenv import load_dotenv
39
+ env_path = Path(__file__).parent.parent / ".env"
40
+ if env_path.exists():
41
+ load_dotenv(env_path)
42
+ logger.info(f"Loaded environment variables from {env_path}")
43
+ else:
44
+ logger.warning(f"No .env file found at {env_path}")
45
+ except ImportError:
46
+ logger.warning("python-dotenv not installed, skipping .env loading")
47
+
48
+ # Verify required variables
49
+ required_vars = {
50
+ "HF_TOKEN": os.environ.get("HF_TOKEN"),
51
+ "HF_USERNAME": os.environ.get("HF_USERNAME"),
52
+ "HF_SPACE_NAME": os.environ.get("HF_SPACE_NAME", "phi4-cognitive-training")
53
+ }
54
+
55
+ missing_vars = [k for k, v in required_vars.items() if not v]
56
+ if missing_vars:
57
+ raise ValueError(f"Missing required environment variables: {', '.join(missing_vars)}")
58
+
59
+ return required_vars
60
+
61
+ def verify_configs():
62
+ """Verify that all necessary configuration files exist and are valid."""
63
+ current_dir = Path(__file__).parent
64
+ required_files = [
65
+ "transformers_config.json",
66
+ "hardware_config.json",
67
+ "dataset_config.json",
68
+ "requirements.txt",
69
+ "run_transformers_training.py"
70
+ ]
71
+
72
+ missing_files = []
73
+ for file in required_files:
74
+ if not (current_dir / file).exists():
75
+ missing_files.append(file)
76
+
77
+ if missing_files:
78
+ raise FileNotFoundError(f"Missing required files: {', '.join(missing_files)}")
79
+
80
+ # Verify JSON configs
81
+ json_files = [f for f in required_files if f.endswith('.json')]
82
+ for json_file in json_files:
83
+ try:
84
+ with open(current_dir / json_file) as f:
85
+ json.load(f)
86
+ logger.info(f"Verified {json_file} is valid JSON")
87
+ except json.JSONDecodeError as e:
88
+ raise ValueError(f"Invalid JSON in {json_file}: {e}")
89
+
90
+ def update_requirements():
91
+ """Update requirements.txt with necessary packages."""
92
+ current_dir = Path(__file__).parent
93
+ req_path = current_dir / "requirements.txt"
94
+
95
+ required_packages = {
96
+ "torch>=2.0.0",
97
+ "transformers>=4.36.0",
98
+ "accelerate>=0.27.0",
99
+ "bitsandbytes>=0.41.0",
100
+ "tensorboard>=2.15.0",
101
+ "gradio>=5.17.0",
102
+ "huggingface-hub>=0.19.0",
103
+ "datasets>=2.15.0"
104
+ }
105
+
106
+ # Read existing requirements
107
+ existing_requirements = set()
108
+ if req_path.exists():
109
+ with open(req_path) as f:
110
+ existing_requirements = {line.strip() for line in f if line.strip()}
111
+
112
+ # Add new requirements
113
+ updated_requirements = existing_requirements.union(required_packages)
114
+
115
+ # Write updated requirements
116
+ with open(req_path, 'w') as f:
117
+ for req in sorted(updated_requirements):
118
+ f.write(f"{req}\n")
119
+
120
+ logger.info("Updated requirements.txt with necessary packages")
121
+
122
+ def create_space(username, space_name):
123
+ """Create or get a Hugging Face Space."""
124
+ try:
125
+ api = HfApi()
126
+ space_id = f"{username}/{space_name}"
127
+ logger.info(f"Checking Space {space_id}...")
128
+
129
+ try:
130
+ space_info = api.space_info(repo_id=space_id)
131
+ logger.info(f"Space {space_id} exists")
132
+ return space_info
133
+ except Exception:
134
+ logger.info(f"Creating new Space {space_id}...")
135
+
136
+ space_info = api.create_repo(
137
+ repo_id=space_id,
138
+ repo_type="space",
139
+ space_sdk="gradio",
140
+ private=False
141
+ )
142
+ logger.info(f"Space {space_id} created successfully")
143
+ return space_info
144
+ except Exception as e:
145
+ raise RuntimeError(f"Error with Space {username}/{space_name}: {e}")
146
+
147
+ def main():
148
+ parser = argparse.ArgumentParser(description='Update Hugging Face Space for Phi-4 training')
149
+ parser.add_argument('--space_name', type=str, help='Space name (default: from env)')
150
+ parser.add_argument('--force', action='store_true', help='Skip confirmation')
151
+ args = parser.parse_args()
152
+
153
+ if not args.force:
154
+ print("\n" + "!"*80)
155
+ print("WARNING: Updating the Space will INTERRUPT any ongoing training!")
156
+ print("Make sure all checkpoints are saved before proceeding.")
157
+ print("!"*80 + "\n")
158
+
159
+ confirm = input("Type 'update' to confirm: ")
160
+ if confirm.lower() != 'update':
161
+ logger.info("Update cancelled")
162
+ return False
163
+
164
+ try:
165
+ # Load environment variables
166
+ env_vars = load_env_variables()
167
+ logger.info(f"Environment variables loaded: USERNAME={env_vars['HF_USERNAME']}, SPACE_NAME={env_vars['HF_SPACE_NAME']}")
168
+
169
+ # Verify configurations
170
+ verify_configs()
171
+ logger.info("All configuration files verified successfully")
172
+
173
+ # Update requirements
174
+ update_requirements()
175
+ logger.info("Requirements updated successfully")
176
+
177
+ # Get space name
178
+ space_name = args.space_name or env_vars["HF_SPACE_NAME"]
179
+ logger.info(f"Using space name: {space_name}")
180
+
181
+ # Login to Hugging Face
182
+ logger.info("Logging in to Hugging Face...")
183
+ login(token=env_vars["HF_TOKEN"])
184
+ logger.info("Successfully logged in to Hugging Face")
185
+
186
+ # Create/get space
187
+ space_info = create_space(env_vars["HF_USERNAME"], space_name)
188
+ logger.info(f"Space info: {space_info}")
189
+
190
+ # Upload files
191
+ current_dir = Path(__file__).parent
192
+ logger.info(f"Uploading files from {current_dir} to Space {env_vars['HF_USERNAME']}/{space_name}...")
193
+
194
+ # Create .gitignore
195
+ with open(current_dir / ".gitignore", "w") as f:
196
+ f.write(".env\n*.pyc\n__pycache__\n")
197
+ logger.info("Created .gitignore file")
198
+
199
+ api = HfApi()
200
+ api.upload_folder(
201
+ folder_path=str(current_dir),
202
+ repo_id=f"{env_vars['HF_USERNAME']}/{space_name}",
203
+ repo_type="space",
204
+ ignore_patterns=[".env", "*.pyc", "__pycache__", "TRAINING_IN_PROGRESS.lock"]
205
+ )
206
+
207
+ logger.info(f"Files uploaded successfully")
208
+ space_url = f"https://huggingface.co/spaces/{env_vars['HF_USERNAME']}/{space_name}"
209
+ logger.info(f"Space URL: {space_url}")
210
+ print(f"\nSpace created successfully! You can view it at:\n{space_url}")
211
+ return True
212
+
213
+ except Exception as e:
214
+ logger.error(f"Error updating Space: {str(e)}")
215
+ return False
216
+
217
+ if __name__ == "__main__":
218
+ success = main()
219
+ sys.exit(0 if success else 1)