BharatAI RS_1: Transformer-Based Language Model
Overview
BharatAI RS_1 is a transformer-based language model designed for text generation. This repository contains the necessary components to train, fine-tune, and perform inference with BharatAI.
Installation
Before running the model, install the required dependencies:
pip install torch transformers datasets sentencepiece evaluate accelerate zstandard
File Structure
- tokenizer.py - Defines the SentencePiece tokenizer.
- model.py - Contains the BharatAI transformer architecture.
- train.py - Script for training the model.
- inference.py - Script for generating text using the trained model.
- model.bin - Pre-generated model file.
- tokenizer.model - Pre-generated tokenizer file.
Tokenizer
The tokenizer is based on SentencePiece and has been pre-generated. If you wish to train a new tokenizer, use:
import sentencepiece as spm
spm.SentencePieceTrainer.train(input='data.txt', model_prefix='tokenizer', vocab_size=1000)
Model Architecture
The BharatAI RS_1 model consists of multiple transformer blocks with self-attention mechanisms. It includes:
- Multi-head self-attention
- Feedforward layers
- Layer normalization
- Positional embeddings
Model Hyperparameters
The model uses the following default hyperparameters:
batch_size = 64
block_size = 256
max_iters = 250
learning_rate = 3e-4
eval_iters = 150
n_embd = 768
n_head = 12
n_layer = 12
dropout = 0.2
These can be adjusted in train.py
or model.py
as needed.
Training the Model
Important: The model is untrained by default
Users must train the model before using it for text generation. To train the model, run:
python train.py
This script loads the dataset, tokenizes text, and trains the transformer model from scratch.
Pre-Generated Model & Tokenizer
- A pre-generated model (
model.bin
) and tokenizer (tokenizer.model
) are included in the repository. - If you wish to use them, simply load them without retraining:
import torch
model = torch.load("model.bin")
Generating Text
After training, or using the pre-generated model, you can generate text using:
python inference.py --input "Your prompt here"
Notes
- The model is untrained by default, so users must train it first before inference.
- Modify hyperparameters in
train.py
to optimize performance.