Author: yarenty
Model type: Llama 3.2 (fine-tuned)
Task: Instruction-following, code Q/A, DataFusion expert assistant
License: Apache 2.0
Visibility: Public

Llama 3.2 DataFusion Instruct

This model is a fine-tuned version of meta-llama/Llama-3.2-8B-Instruct, specialized for the Apache Arrow DataFusion ecosystem. It's designed to be a helpful assistant for developers, answering technical questions, generating code, and explaining concepts related to DataFusion, Arrow.rs, Ballista, and the broader Rust data engineering landscape.

GGUF Version: For quantized, low-resource deployment, you can find the GGUF version here.

Model Description

This model was fine-tuned on a curated dataset of high-quality question-answer pairs and instruction-following examples sourced from the official DataFusion documentation, source code, mailing lists, and community discussions.

Model Type: Instruction-following Large Language Model (LLM)
Base Model: meta-llama/Llama-3.2-8B-Instruct
Primary Use: Developer assistant for the DataFusion ecosystem.

Prompt Template

To get the best results, format your prompts using the following instruction template.

### Instruction:
{Your question or instruction here}

### Response:

Example Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "yarenty/llama32-datafusion-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")

# The model was trained with a specific instruction template.
# For optimal performance, your prompt should follow this structure.
prompt_template = """### Instruction:
How do I register a Parquet file in DataFusion?

### Response:"""

inputs = tokenizer(prompt_template, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256, eos_token_id=tokenizer.eos_token_id)

# Decode the output, skipping special tokens and the prompt
prompt_length = inputs["input_ids"].shape[1]
print(tokenizer.decode(outputs[0][prompt_length:], skip_special_tokens=True))

Training Procedure

Hardware: Trained on 1x NVIDIA A100 GPU.
Training Script: Custom script using transformers.SFTTrainer.
Key Hyperparameters:
- Epochs: 3
- Learning Rate: 2e-5
- Batch Size: 4
Dataset: A curated dataset of ~5,000 high-quality QA pairs and instructions related to DataFusion. Data was cleaned and deduplicated as per the notes in pitfalls.md.

Intended Use & Limitations

Intended Use: This model is intended for developers and data engineers working with DataFusion. It can be used for code generation, debugging assistance, and learning the library. It can also serve as a strong base for further fine-tuning on more specialized data.
Limitations: The model's knowledge is limited to the data it was trained on. It may produce inaccurate or outdated information for rapidly evolving parts of the library. It is not a substitute for official documentation or expert human review.

Citation

If you find this model useful in your work, please cite:

@misc{yarenty_2025_llama32_datafusion_instruct,
  author = {yarenty},
  title = {Llama 3.2 DataFusion Instruct},
  year = {2025},
  publisher = {Hugging Face},
  journal = {Hugging Face repository},
  howpublished = {\url{https://huggingface.co/yarenty/llama32-datafusion-instruct}}
}

Contact

For questions or feedback, please open an issue on the Hugging Face repository or the source GitHub repository.

yarenty
/

llama32-datafusion-instruct