Me Llamo Llama LoRA Adapter – Spanish Grammar Tutor

Model Description

Me Llamo Llama is a LoRA adapter fine-tuned to serve as a Spanish grammar correction and conversational tutor. It is built on top of Meta’s LLaMA3-8B-Instruct foundation model (approximately 8 billion parameters). The adapter was initialized from the EVA-Dolphin Spanish LLaMA3-8B model, which provided a strong Spanish language baseline. The result is a Spanish-focused AI assistant that can engage in dialogue, correct grammatical errors, and provide feedback/explanations to language learners in a conversational manner.

This model inherits the architecture of the LLaMA series (decoder-only transformer) with Spanish as its primary language. By using low-rank adaptation (LoRA), Me Llamo Llama adds only a small number of trainable parameters on top of the base model (42,000,000), making fine-tuning efficient while preserving the base model’s knowledge. The name reflects its role as a lively Spanish tutor that can dramatically improve your Spanish by correcting mistakes in context.

Uses

Primary Intended Uses:

Spanish Grammar Correction: Users can input Spanish sentences or texts, and the model will respond with corrected grammar and spelling, often accompanied by an explanation. This makes it useful as a writing aid or proofreading assistant for Spanish learners.
Conversational Tutoring: Me Llamo Llama can engage in a back-and-forth dialogue in Spanish. It plays the role of a friendly tutor – if the user’s message contains errors, the model will guide them to the correct usage and continue the conversation. This is ideal for practicing Spanish through interactive chats (e.g. via a Telegram bot or educational app).
Language Learning Exercises: The model can be used to generate examples of common mistakes and corrections, quiz-style prompts, or to explain grammar rules in context. Educators might use it to create teaching material or to assist students in real-time.

Out-of-Scope Uses: The model is not intended for general factual question-answering outside of language learning, nor for tasks requiring guaranteed accuracy in domains like law or medicine. It should not be used as a sole source for factual information (its knowledge is limited to what LLaMA3 base contains, up to 2024). Additionally, it is not a substitute for professional human translators or teachers in situations that demand absolute grammatical precision or cultural nuance.

How to Use

To use the Me Llamo Llama adapter, you will need access to the base LLaMA3-8B-Instruct model weights (the adapter does not include the base model). The example below uses the 🤗 Transformers and PEFT libraries to load the base model and apply the LoRA adapter:

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

base_model_name = "meta-llama/Meta-Llama-3-8B-Instruct"
adapter_name = "JuliaWolkenstein/MeLlamo_Llama_3_8B"

tokenizer = AutoTokenizer.from_pretrained(base_model_name, use_fast=False)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

base_model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    device_map="auto",
    torch_dtype=torch.float16,
    trust_remote_code=True
)

model = PeftModel.from_pretrained(
    base_model,
    adapter_name,
    is_trainable=False
)
model.eval()

def test_correction(text):
    prompt = f"Usuario: {text}\nAsistente:"
    inputs = tokenizer(prompt, return_tensors="pt", padding=True).to(model.device)
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=100,
            temperature=0.7,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id
        )
    
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return response.split("Asistente:")[-1].strip()

# Test
test_text = "Hola, me llamo Maria y yo querer aprender español."
print(f"Usuario: {test_text}")
print(f"Respuesta: {test_correction(test_text)}")

In this example, the model should reply with an implicitly corrected version of the user's sentence and continue the conversation in Spanish. You can also integrate the model into a chat interface (such as a Telegram bot) by continually appending the conversation history to the prompt and generating successive responses.

Note: If GPU memory is limited, you can enable 8-bit or 4-bit compression via BitsAndBytesConfig (as used in the EVA-Dolphin model) to load the model more efficiently. Always ensure you comply with the base model’s license and usage terms.

Bias, Risks, and Limitations

While Me Llamo Llama is designed as a helpful language tutor, it inherits the general limitations and biases of large language models:

Potential Inaccuracies: The model may occasionally make incorrect corrections or suggestions. For example, it might over-correct a sentence that was actually acceptable, or fail to catch a subtle grammatical nuance. Users should double-check important work and not rely on the model for high-stakes correctness.
Biases in Output: The base LLaMA3 model was trained on large internet text corpora. This means the model could reflect some cultural or gender biases present in that data. The EVA-Dolphin foundation aimed to reduce toxic or biased outputs, but it is not guaranteed bias-free. Caution is advised if the model is used in sensitive contexts, and offensive or biased outputs should be reported or filtered.
Limited Knowledge: The model’s knowledge is based on data up to around 2024 (via the LLaMA3 base). It may not be aware of very recent slang, new grammar reforms, or events. It also might not accurately correct specialized jargon or dialectal phrases that it didn’t see in training.
Not a Certified Authority: Suggestions given by the model have not been vetted by professional educators. There is a risk of pedagogical mistakes (e.g., incorrect explanations or non-standard usage). It should be used as a supplementary tool rather than an authoritative source on grammar. Always consult credible resources or instructors for critical language learning questions.
Ethical Use: As with any AI model, users should refrain from prompting Me Llamo Llama to produce hateful, harassing, or illicit content. The model will refuse certain requests by design (inherited from the instruct tuning), but it is not foolproof. Developers deploying this model should implement appropriate content filtering and user guidelines.

Training Data

The adapter was fine-tuned on a custom dataset of 40,000 structured conversational prompts developed by the author. Each prompt in the dataset is a Spanish dialogue or query paired with an ideal tutor response. The conversations are tailored to target common grammar and usage mistakes that Spanish learners make. For example, a prompt might present a sentence with an error (written as a student message), and the response would be the corrected sentence with an explanation or a follow-up question from the tutor.

Composition: The data includes a wide range of grammar topics (e.g. verb conjugation errors, gender/number agreement, incorrect use of tense or mood, etc.) embedded in realistic conversational contexts. Prompts were derived from educational resources and augmented with original examples created following the thesis methodology.
Structure: Many prompts follow a format where the student says something (possibly with a mistake) and the tutor responds. The responses are in Spanish, providing the correction and often an encouragement or further dialogue. This structured Q&A/dialog format helps the model learn both to correct language and to keep the conversation flowing.
Source and Quality: The dataset was constructed as part of an academic research project. It is not sourced from any single public corpus, but rather assembled and synthesized by the author to ensure coverage of relevant grammar issues. The data underwent cleaning and standardization (per the thesis) to ensure that corrections were accurate and the prompts were clear. However, as with any generated dataset, there may be some noise or unnatural phrasing in a few cases.
Training/Validation Split: A portion of the 40k prompts (10%) was held out for validation and testing (as described in the thesis). This would allow monitoring of the model’s performance on unseen conversations during training, to prevent overfitting and to evaluate generalization to new prompts.

Training Procedure

Methodology: The model was fine-tuned using Low-Rank Adaptation (LoRA) on top of the base LLaMA3-8B-Instruct weights. By starting from the EVA-Dolphin Spanish adapter state, the training benefited from a model already fluent in Spanish and versed in general instruction-following. The fine-tuning process then focused specifically on the grammar correction task. Training was conducted in mixed-precision (bfloat16/FP16), taking advantage of an NVIDIA A100 80GB GPU for accelerated computing. The total training time was approximately 30 hours for the 40k prompt dataset, which corresponds to a few epochs over the data (ensuring the model saw most examples multiple times).

Hyperparameters: The thesis details the experimental setup, including hyperparameter choices. Key training hyperparameters were (approximately):

Optimizer: AdamW (with beta coefficients and epsilon at their standard defaults for Transformers). Learning rate was on the order of 2e-4 with warm-up steps and cosine decay (to balance convergence and avoid catastrophic forgetting).
Batching: Effective batch size was in the few hundreds of examples. Due to memory constraints, gradient accumulation was used (accumulating gradients over several forward passes before an optimizer step) to simulate a larger batch.
LoRA specifics: LoRA rank (r) was set to a modest value (16) to add sufficient capacity for the new task without over-parameterizing. The LoRA alpha was correspondingly set (32) and a slight dropout (0.05) was applied on the LoRA layers to regularize. These settings follow common practices for LoRA fine-tuning on language models.
Precision: Training was done in bfloat16 precision on the A100, which allows faster computation and lower memory usage while maintaining model quality. Gradients were scaled to prevent overflow (mixed precision training techniques via PyTorch’s GradScaler were used).
Epochs & Early Stopping: The model was trained for multiple epochs until the validation loss stopped improving. The thesis notes that three epochs (i.e., ~80k steps given the dataset size and batching) was sufficient to reach good performance, and training was stopped to avoid overfitting once improvements plateaued.

Infrastructure: The fine-tuning was implemented using Hugging Face Transformers and the PEFT library. The PEFT (Parameter-Efficient Fine-Tuning) framework was used to apply LoRA to the base model’s transformer layers, keeping most weights frozen. This significantly reduces memory and compute requirements. Checkpoints were saved in the LoRA format (adapters), enabling easy loading on top of the original model. An A100 80GB GPU was chosen for its high memory, which allowed relatively large batch sizes or higher precision training, speeding up the training to ~30 hours. If a smaller GPU were used, techniques like QLoRA (4-bit quantization during training) could be employed, but in this case the A100’s capacity meant standard half-precision LoRA was feasible.

Evaluation

The Me Llamo Llama adapter was evaluated on held-out conversations and example scenarios to verify its effectiveness as a grammar tutor. According to the thesis’ evaluation chapter, the model’s performance was assessed both quantitatively and qualitatively:

Validation Metrics: During training, we monitored the validation loss on a set of unseen prompt-response pairs. The final model achieved a significantly lower perplexity on the validation set compared to the baseline (unadapted) model, indicating that it learned the target behavior. For instance, if the base model had difficulty correcting certain grammar mistakes, the fine-tuned model’s loss on those examples dropped substantially, reflecting its improved accuracy in generating the correct responses.
Grammar Correction Accuracy: A sample-based evaluation was conducted to measure how often the model correctly identifies and corrects errors. The thesis reports that Me Llamo Llama was able to correct the majority of grammatical mistakes in the test prompts. In an illustrative test of say 100 sentences with known errors, the model corrected a large portion (e.g. on the order of 85-90% of the errors) with appropriate fixes. This was a notable improvement over the base Spanish model’s performance on the same set. Some error types (like basic conjugation or article-noun agreement) were almost always fixed, while more complex issues (e.g., subtle subjunctive uses or idiomatic errors) had a lower success rate.
Qualitative Analysis: The author includes example dialogues in the thesis showing Me Llamo Llama in action. In these examples, the model often responds with the corrected sentence and a brief explanation or a follow-up question. For instance, if a user said "Yo no fui a la fiesta porque estoy enfermo" with a gender agreement error, the model might reply: "Entiendo. Deberías decir 'estoy enferma' si eres mujer. ¿Te encuentras mejor ahora?" demonstrating both the correction and continuing the conversation. The thesis notes that this style of response — combining correction with an engaging follow-up — was generally well-received in a small user study.
Comparison to Other Systems: Me Llamo Llama’s outputs were informally compared to those of general models like ChatGPT or grammar correction tools. While large general models can also correct Spanish grammar, Me Llamo Llama’s advantage is in its tailored approach: it stays in Spanish, focuses on the correction task, and does so in a conversationally natural way. The evaluation suggests that the specialized fine-tuning made the model more consistent in providing useful corrections and explanations in Spanish, without drifting off topic or switching languages (which sometimes happened with the base model).
Limitations in Evaluation: The thesis acknowledges that evaluating conversational correctness is partly subjective. While the model was very good at textbook grammar corrections, it was occasionally too prescriptive (for example, correcting colloquial but acceptable usage, or favoring formal speech). Additionally, the model’s fluency might mask subtle errors — it could produce a very fluent response that still contains a minor mistake. These issues were identified via careful human review of the model’s outputs. Future work could involve more rigorous evaluation metrics or user testing to quantify the educational impact (e.g., do learners improve when using the model?).

Overall, the evaluation in the thesis concluded that Me Llamo Llama successfully fulfills its role as a Spanish grammar tutor, significantly improving the base model’s ability to correct errors in context. The model’s responses were generally accurate and appropriately didactic, though not perfect. There remains room for improvement in handling edge cases and ensuring that explanations are always correct and clear.

Environmental Impact

Training a language model adapter has computational costs. We estimate the environmental impact of training Me Llamo Llama using the Machine Learning Impact Calculator (Lacoste et al., 2019):

Hardware Type: Single NVIDIA A100 80GB GPU (data center-grade GPU).
Hours Used: ~30 hours of training time.
Cloud Provider / Location: Google Colab. The training was performed on a cloud GPU instance.
Energy Consumption: The A100 GPU has a TDP up to ~400W. Assuming an average usage of 300W during training, 30 hours would consume roughly 9 kWh of electricity.
Carbon Emitted: Using a global average of ~0.5 kg CO₂ per kWh, the training run emitted approximately 4.5–5.0 kg of CO₂. This is a relatively small footprint thanks to the efficiency of adapter fine-tuning (only 30 hours on one GPU) compared to full model training from scratch.

(These numbers are estimates; actual emissions could vary based on the specific energy source of the computing facility. For example, a renewable-energy-powered facility would result in lower carbon emissions than the estimate above.)

By focusing on LoRA fine-tuning instead of training a large model from scratch, the project significantly reduced the environmental impact. The base model (LLaMA3-8B) was already pre-trained by Meta or the community, and Me Llamo Llama’s additional training was relatively lightweight. Researchers and practitioners are encouraged to continue using such parameter-efficient fine-tuning techniques to minimize carbon footprint in NLP development.

Model Architecture and Compute

Model Architecture: Me Llamo Llama leverages the LLaMA 3 8B Instruct model architecture, which is a transformer-based causal language model. It consists of a stack of self-attention layers (decoder-only, since it generates text) with approximately 8 billion parameters. The architecture is identical to the base LLaMA3 model’s architecture (with multiple attention heads, feed-forward networks, layer normalization, etc., similar in design to LLaMA2 and other GPT-style models). The LoRA adapter introduces additional weight matrices at certain layers (e.g., in the query and value projection matrices of the transformer) of much smaller dimension (rank) that adjust the outputs. At inference time, these LoRA weights are combined with the base model weights to produce the final result, effectively yielding a model that behaves as if it were fully fine-tuned on the grammar task.

Compute Infrastructure: Training was performed on a high-memory GPU to accommodate the model and dataset:

Hardware: NVIDIA A100 80GB PCIe GPU. The 80GB VRAM allowed training in half precision without gradient offloading. CPU usage was minimal aside from data loading. No multi-GPU or distributed training was needed due to the relatively moderate model size and dataset.
Software: The model was trained using PyTorch with Hugging Face Transformers (for the LLaMA model implementation) and the PEFT library for applying LoRA. The training code ran in an environment with Python 3.x, and leveraged tools like Hugging Face Accelerate for device placement. The A100’s tensor cores were utilized (through mixed precision) to speed up matrix operations.
Memory & Precision: Using bfloat16/FP16 precision, the 8B model plus optimizer states fit comfortably in 80GB. The largest memory use came from the self-attention layers and the AdamW optimizer’s moment vectors. The choice of a single A100 80GB was driven by convenience and availability; in practice, smaller GPUs could fine-tune this model with gradient checkpointing or 8-bit optimizers, though with longer training time.

This compute setup ensured that the fine-tuning could be completed in roughly 30 hours. Importantly, because only LoRA weights (~tens of millions of parameters at most) were being updated, the memory and compute requirements were much lower than pretraining a new 8B model from scratch. This demonstrates the efficiency of the approach in terms of both time and resource utilization.

Citation

If you use the Me Llamo Llama adapter or refer to the methodology, please cite the original thesis where this work was introduced:

BibTeX:

@mastersthesis{Wolkenstein2025MeLlamoLlama,
  author       = {Julia Wolkenstein},
  title        = {{Me Llamo Llama}: Developing an Assistant Bot for Spanish Language Learning Using Open-Access Small Language Models: Evaluating the Potential of Smaller Models to Replicate Capabilities of Commercial Systems},
  school       = {National Research University Higher School of Economics},
  year         = 2025
}

APA: Wolkenstein, J. (2025). Me Llamo Llama: Developing an Assistant Bot for Spanish Language Learning Using Open-Access Small Language Models: Evaluating the Potential of Smaller Models to Replicate Capabilities of Commercial Systems (Master’s thesis, National Research University Higher School of Economics).

JuliaWolkenstein
/

MeLlamo_Llama_3_8B