YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

FLAN-T5 Base CoreML Models (HIGH QUALITY VERSION)

This repository contains high-quality CoreML versions of Google's FLAN-T5 Base model, optimized for production use on Apple devices (macOS/iOS) with preserved model quality and proper attention mechanisms.

⚠️ Important Update - Quality Preserved

Version 3.0: This repository now contains quality-preserved models that maintain the original PyTorch model's output quality. Previous versions suffered from significant quality degradation due to precision loss and architectural modifications. This has been completely resolved using proper conversion techniques.

Model Details

Base Model: google/flan-t5-base
Architecture: T5 (Text-to-Text Transfer Transformer)
Model Size:
- FP32 (Quality): Encoder 430MB, Decoder 647MB = 1.1GB total
- INT8 (Mobile): Encoder 108MB, Decoder 164MB = 272MB total (4x smaller)
Framework: CoreML (.mlpackage format)
Precision: FP32 for maximum quality preservation
Deployment Target: iOS 15+ / macOS 12+
Max Sequence Length: 512 tokens (original model dimensions preserved)

Files

Model Files

High-Quality Models (FP32)

flan_t5_base_encoder_quality.mlpackage - T5 Encoder component (512 tokens, FP32, 430MB)
flan_t5_base_decoder_quality.mlpackage - T5 Decoder component (512 tokens, FP32, 647MB)

Quantized Models (INT8) - Recommended for Mobile

flan_t5_base_encoder_int8.mlpackage - T5 Encoder component (512 tokens, INT8, 108MB)
flan_t5_base_decoder_int8.mlpackage - T5 Decoder component (512 tokens, INT8, 164MB)

Tokenizer Files

tokenizer.json - Fast tokenizer configuration
tokenizer_config.json - Tokenizer metadata and settings
special_tokens_map.json - Special token mappings
spiece.model - SentencePiece model for tokenization

Model Architecture

FLAN-T5 is an encoder-decoder transformer model that has been converted into two separate CoreML models with preserved quality and attention mechanisms:

Encoder

Input: input_ids (shape: [1, 512], dtype: int32), attention_mask (shape: [1, 512], dtype: int32)
Output: hidden_states (shape: [1, 512, 768], dtype: float32)

Decoder

Inputs:
- decoder_input_ids (shape: [1, 512], dtype: int32)
- encoder_hidden_states (shape: [1, 512, 768], dtype: float32)
- decoder_attention_mask (shape: [1, 512], dtype: int32)
- encoder_attention_mask (shape: [1, 512], dtype: int32)
Output: logits (shape: [1, 512, 32128], dtype: float32)

✅ Verified Quality Features

✅ High Output Quality: Produces sensible, coherent text outputs matching PyTorch baseline
✅ Proper Translations: French/German translations work correctly
✅ Multiple Tasks: Translation, summarization, question answering all functional
✅ Preserved Precision: FP32 precision maintains model accuracy
✅ Original Architecture: 512-token sequences preserve full model capabilities
✅ Production Ready: Suitable for real-world applications
✅ Mobile Optimized: INT8 quantized versions for deployment on iOS devices

🔄 Model Variants

Choose the right model for your use case:

Model Type	Size	Use Case	Quality	Memory
FP32 Quality	1.1GB	Server/Desktop apps, Research	Highest	High
INT8 Mobile	272MB	iOS/Mobile apps, Production	Very Good	Low

Recommendations:

iOS/Mobile Apps: Use INT8 models for better performance and lower memory usage
Server/Desktop: Use FP32 models for maximum quality
Development/Testing: Start with INT8, upgrade to FP32 if needed

Usage

Download Models

# Download complete repository
huggingface-cli download mazhewitt/flan-t5-base-coreml --local-dir ./models

# Download specific models (choose quality vs mobile-optimized)
# High-quality FP32 models
huggingface-cli download mazhewitt/flan-t5-base-coreml flan_t5_base_encoder_quality.mlpackage --local-dir ./models
huggingface-cli download mazhewitt/flan-t5-base-coreml flan_t5_base_decoder_quality.mlpackage --local-dir ./models

# Mobile-optimized INT8 models (recommended for iOS/mobile apps)
huggingface-cli download mazhewitt/flan-t5-base-coreml flan_t5_base_encoder_int8.mlpackage --local-dir ./models
huggingface-cli download mazhewitt/flan-t5-base-coreml flan_t5_base_decoder_int8.mlpackage --local-dir ./models

Python Usage with Working Text Generation

import coremltools as ct
import numpy as np
from transformers import T5Tokenizer

# Load models and tokenizer
# Option 1: High-quality FP32 models (1.1GB)
encoder = ct.models.MLModel("flan_t5_base_encoder_quality.mlpackage")
decoder = ct.models.MLModel("flan_t5_base_decoder_quality.mlpackage")

# Option 2: Mobile-optimized INT8 models (272MB) - Recommended for iOS apps
# encoder = ct.models.MLModel("flan_t5_base_encoder_int8.mlpackage")
# decoder = ct.models.MLModel("flan_t5_base_decoder_int8.mlpackage")

tokenizer = T5Tokenizer.from_pretrained("./")

# Example: Translation with high-quality generation
input_text = "translate English to French: Hello world"
inputs = tokenizer(input_text, return_tensors="np", padding="max_length", 
                  truncation=True, max_length=512)

# Run encoder
encoder_output = encoder.predict({
    "input_ids": inputs["input_ids"].astype(np.int32),
    "attention_mask": inputs["attention_mask"].astype(np.int32)
})
hidden_states = encoder_output["hidden_states"]

# Greedy generation (working causal attention)
generated_tokens = [tokenizer.pad_token_id]  # Start with pad token
max_new_tokens = 10

for _ in range(max_new_tokens):
    # Prepare decoder input
    decoder_ids = np.zeros((1, 512), dtype=np.int32)
    decoder_mask = np.zeros((1, 512), dtype=np.int32)
    
    for i, token in enumerate(generated_tokens):
        decoder_ids[0, i] = token
        decoder_mask[0, i] = 1
    
    # Run decoder
    decoder_output = decoder.predict({
        "decoder_input_ids": decoder_ids,
        "encoder_hidden_states": hidden_states,
        "decoder_attention_mask": decoder_mask,
        "encoder_attention_mask": inputs["attention_mask"].astype(np.int32)
    })
    
    # Get next token
    next_pos = len(generated_tokens)
    logits = decoder_output["logits"]
    next_token = np.argmax(logits[0, next_pos, :])
    
    # Stop if EOS token
    if next_token == tokenizer.eos_token_id:
        break
        
    generated_tokens.append(int(next_token))

# Decode result (skip initial pad token)
result = tokenizer.decode(generated_tokens[1:], skip_special_tokens=True)
print(f"Translation: {result}")

Swift/iOS Usage

import CoreML

// Load models
// Option 1: High-quality FP32 models
guard let encoderURL = Bundle.main.url(forResource: "flan_t5_base_encoder_quality", withExtension: "mlpackage"),
      let decoderURL = Bundle.main.url(forResource: "flan_t5_base_decoder_quality", withExtension: "mlpackage") else {
    fatalError("Models not found")
}

// Option 2: Mobile-optimized INT8 models (recommended for iOS apps)
// guard let encoderURL = Bundle.main.url(forResource: "flan_t5_base_encoder_int8", withExtension: "mlpackage"),
//       let decoderURL = Bundle.main.url(forResource: "flan_t5_base_decoder_int8", withExtension: "mlpackage") else {
    fatalError("Models not found")
}

let encoderModel = try MLModel(contentsOf: encoderURL)
let decoderModel = try MLModel(contentsOf: decoderURL)

// Example inference (similar pattern to Python but with MLMultiArray)
// Note: You'll need to implement tokenization in Swift or use a bridging approach

Model Capabilities

FLAN-T5 has been instruction-tuned and can perform various text-to-text tasks:

Text Summarization: "summarize: [text]"
Translation: "translate English to French: [text]"
Question Answering: "answer the question: [question] context: [context]"
General Instructions: Direct natural language instructions

Performance Considerations

Memory:
- FP32 Models: ~1.1GB total (maximum quality)
- INT8 Models: ~272MB total (4x smaller, mobile-optimized)
Precision: FP32 for quality, INT8 for mobile deployment
Sequence Length: Maximum 512 tokens (full original capacity)
Device Compatibility: Apple Neural Engine, GPU, or CPU depending on availability
Generation Speed: Optimized for real-time text generation on mobile devices

Technical Notes

Quality Improvements in Version 3.0

Output Quality: Models now produce sensible, coherent outputs matching PyTorch baseline
Precision Preservation: FP32 precision prevents quality degradation from quantization
Full Architecture: 512-token sequences preserve complete model capabilities
Minimal Modification: Conversion process preserves original model behavior patterns

Conversion Details

Source Framework: PyTorch/Transformers
Conversion Tool: CoreML Tools 8.3.0
Date: July 2025
Torch Version: 2.7.1 (with compatibility warnings handled)
Approach: Quality-focused conversion with minimal architectural changes

Testing and Verification

The models have been thoroughly tested to ensure:

✅ High output quality matching PyTorch baseline
✅ Proper translations and text generation
✅ Multiple task types function correctly
✅ Consistent behavior across iOS/macOS platforms
✅ No quality degradation from conversion process

Troubleshooting

Common Issues

Shape Mismatches: Ensure you're using max_length=512 for both encoder and decoder inputs
Token Generation: Always start with tokenizer.pad_token_id for decoder input
Memory: Models require ~1.1GB total memory for inference
Quality: If output seems degraded, verify you're using the latest quality-preserved models

Verification Test

To verify the models work correctly, different decoder contexts should produce different outputs:

# This should produce DIFFERENT results (proving causal attention works)
context_1 = [tokenizer.pad_token_id, 1000]  # Different token at position 1
context_2 = [tokenizer.pad_token_id, 2000]  # Different token at position 1
# Running decoder with these contexts should give different predictions

License

This model follows the same license as the original FLAN-T5 model. Please refer to the original model card for licensing details.

Citation

If you use these models, please cite the original FLAN-T5 paper:

@article{chung2022scaling,
  title={Scaling instruction-finetuned language models},
  author={Chung, Hyung Won and Hou, Le and Longpre, Shayne and Zoph, Barret and Tay, Yi and Fedus, William and Li, Eric and Wang, Xuezhi and Mostafazadeh, Nasrin and Shen, Jianmo and others},
  journal={arXiv preprint arXiv:2210.11416},
  year={2022}
}

Issues and Support

For issues specific to these CoreML conversions, please open an issue in this repository. For general FLAN-T5 questions, refer to the original model repository.

Version 3.0 - Quality Preserved ✅
High-quality models ready for production use in iOS/macOS applications

Downloads last month: 32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support