Edit model card

Model Card for Llama-3.2-11B-Vision-WebSight

LLama 3.2 Vision Instruct trained on 10k samples from https://huggingface.co/datasets/HuggingFaceM4/WebSight.

Model Details

Model Description

  • Developed by: pdufour
  • Model type: Vision Language Model
  • Language(s) (NLP): English
  • License: MIT
  • Finetuned from model: meta-llama/Llama-3.2-11B-Vision-Instruct

How to Get Started with the Model

from transformers import AutoModelForVision2Seq, AutoTokenizer, AutoProcessor
from peft import PeftModel
from PIL import Image
import torch

model = PeftModel.from_pretrained(
    AutoModelForVision2Seq.from_pretrained("meta-llama/Llama-3.2-11B-Vision-Instruct", device_map="auto", load_in_4bit=True),
    "pdufour/Llama-3.2-11B-Vision-WebSight"
)
tokenizer = AutoTokenizer.from_pretrained("pdufour/Llama-3.2-11B-Vision-WebSight")
processor = AutoProcessor.from_pretrained("meta-llama/Llama-3.2-11B-Vision-Instruct")

inputs = processor(text="Generate code for a web page that looks exactly like this. <|image|>", images=Image.open("fashion.jpg"), return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(input_ids=inputs['input_ids'], max_new_tokens=4096, do_sample=True, temperature=0.7, top_p=0.9)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Training Details

Training Data

Vision-language dataset used for instruction tuning.

Training Procedure

Training Hyperparameters

  • Training regime: Fine-tuning with LoRA
  • Learning rate: 0.0002
  • Batch size: 10
  • Gradient accumulation steps: 1
  • Number of epochs: 3.0
  • Optimizer: adamw_torch_fused
  • LR scheduler type: constant
  • Weight decay: 0.0
  • FP16 Training: False

Speeds, Sizes, Times

  • Training Duration: Unknown hours
  • Number of Parameters: Unknown trainable parameters
  • Model Size: 0.08 GB

Evaluation

Metrics

Results

  • epoch: 0.9000
  • grad_norm: 0.2568
  • learning_rate: 0.0002
  • loss: 0.0791
  • step: 900.0000

Technical Specifications

Model Architecture and Objective

LoRA-tuned Vision-Language Model based on Llama architecture.

Compute Infrastructure

  • Hardware Type: GPU
  • Number of GPUs: 1

Software

  • Framework versions:
    • PEFT 0.13.2
    • PyTorch 2.5.0+cu121

Model Card Contact

For questions about this model, please file an issue on the GitHub repository.

Downloads last month
69
Inference API
Unable to determine this model’s pipeline type. Check the docs .

Model tree for pdufour/Llama-3.2-11B-Vision-Instruct-WebSight

Adapter
(62)
this model

Dataset used to train pdufour/Llama-3.2-11B-Vision-Instruct-WebSight