YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

FLAN-T5 Base CoreML Models (HIGH QUALITY VERSION)

This repository contains high-quality CoreML versions of Google's FLAN-T5 Base model, optimized for production use on Apple devices (macOS/iOS) with preserved model quality and proper attention mechanisms.

⚠️ Important Update - Quality Preserved

Version 3.0: This repository now contains quality-preserved models that maintain the original PyTorch model's output quality. Previous versions suffered from significant quality degradation due to precision loss and architectural modifications. This has been completely resolved using proper conversion techniques.

Model Details

  • Base Model: google/flan-t5-base
  • Architecture: T5 (Text-to-Text Transfer Transformer)
  • Model Size:
    • FP32 (Quality): Encoder 430MB, Decoder 647MB = 1.1GB total
    • INT8 (Mobile): Encoder 108MB, Decoder 164MB = 272MB total (4x smaller)
  • Framework: CoreML (.mlpackage format)
  • Precision: FP32 for maximum quality preservation
  • Deployment Target: iOS 15+ / macOS 12+
  • Max Sequence Length: 512 tokens (original model dimensions preserved)

Files

Model Files

High-Quality Models (FP32)

  • flan_t5_base_encoder_quality.mlpackage - T5 Encoder component (512 tokens, FP32, 430MB)
  • flan_t5_base_decoder_quality.mlpackage - T5 Decoder component (512 tokens, FP32, 647MB)

Quantized Models (INT8) - Recommended for Mobile

  • flan_t5_base_encoder_int8.mlpackage - T5 Encoder component (512 tokens, INT8, 108MB)
  • flan_t5_base_decoder_int8.mlpackage - T5 Decoder component (512 tokens, INT8, 164MB)

Tokenizer Files

  • tokenizer.json - Fast tokenizer configuration
  • tokenizer_config.json - Tokenizer metadata and settings
  • special_tokens_map.json - Special token mappings
  • spiece.model - SentencePiece model for tokenization

Model Architecture

FLAN-T5 is an encoder-decoder transformer model that has been converted into two separate CoreML models with preserved quality and attention mechanisms:

Encoder

  • Input: input_ids (shape: [1, 512], dtype: int32), attention_mask (shape: [1, 512], dtype: int32)
  • Output: hidden_states (shape: [1, 512, 768], dtype: float32)

Decoder

  • Inputs:
    • decoder_input_ids (shape: [1, 512], dtype: int32)
    • encoder_hidden_states (shape: [1, 512, 768], dtype: float32)
    • decoder_attention_mask (shape: [1, 512], dtype: int32)
    • encoder_attention_mask (shape: [1, 512], dtype: int32)
  • Output: logits (shape: [1, 512, 32128], dtype: float32)

βœ… Verified Quality Features

  • βœ… High Output Quality: Produces sensible, coherent text outputs matching PyTorch baseline
  • βœ… Proper Translations: French/German translations work correctly
  • βœ… Multiple Tasks: Translation, summarization, question answering all functional
  • βœ… Preserved Precision: FP32 precision maintains model accuracy
  • βœ… Original Architecture: 512-token sequences preserve full model capabilities
  • βœ… Production Ready: Suitable for real-world applications
  • βœ… Mobile Optimized: INT8 quantized versions for deployment on iOS devices

πŸ”„ Model Variants

Choose the right model for your use case:

Model Type Size Use Case Quality Memory
FP32 Quality 1.1GB Server/Desktop apps, Research Highest High
INT8 Mobile 272MB iOS/Mobile apps, Production Very Good Low

Recommendations:

  • iOS/Mobile Apps: Use INT8 models for better performance and lower memory usage
  • Server/Desktop: Use FP32 models for maximum quality
  • Development/Testing: Start with INT8, upgrade to FP32 if needed

Usage

Download Models

# Download complete repository
huggingface-cli download mazhewitt/flan-t5-base-coreml --local-dir ./models

# Download specific models (choose quality vs mobile-optimized)
# High-quality FP32 models
huggingface-cli download mazhewitt/flan-t5-base-coreml flan_t5_base_encoder_quality.mlpackage --local-dir ./models
huggingface-cli download mazhewitt/flan-t5-base-coreml flan_t5_base_decoder_quality.mlpackage --local-dir ./models

# Mobile-optimized INT8 models (recommended for iOS/mobile apps)
huggingface-cli download mazhewitt/flan-t5-base-coreml flan_t5_base_encoder_int8.mlpackage --local-dir ./models
huggingface-cli download mazhewitt/flan-t5-base-coreml flan_t5_base_decoder_int8.mlpackage --local-dir ./models

Python Usage with Working Text Generation

import coremltools as ct
import numpy as np
from transformers import T5Tokenizer

# Load models and tokenizer
# Option 1: High-quality FP32 models (1.1GB)
encoder = ct.models.MLModel("flan_t5_base_encoder_quality.mlpackage")
decoder = ct.models.MLModel("flan_t5_base_decoder_quality.mlpackage")

# Option 2: Mobile-optimized INT8 models (272MB) - Recommended for iOS apps
# encoder = ct.models.MLModel("flan_t5_base_encoder_int8.mlpackage")
# decoder = ct.models.MLModel("flan_t5_base_decoder_int8.mlpackage")

tokenizer = T5Tokenizer.from_pretrained("./")

# Example: Translation with high-quality generation
input_text = "translate English to French: Hello world"
inputs = tokenizer(input_text, return_tensors="np", padding="max_length", 
                  truncation=True, max_length=512)

# Run encoder
encoder_output = encoder.predict({
    "input_ids": inputs["input_ids"].astype(np.int32),
    "attention_mask": inputs["attention_mask"].astype(np.int32)
})
hidden_states = encoder_output["hidden_states"]

# Greedy generation (working causal attention)
generated_tokens = [tokenizer.pad_token_id]  # Start with pad token
max_new_tokens = 10

for _ in range(max_new_tokens):
    # Prepare decoder input
    decoder_ids = np.zeros((1, 512), dtype=np.int32)
    decoder_mask = np.zeros((1, 512), dtype=np.int32)
    
    for i, token in enumerate(generated_tokens):
        decoder_ids[0, i] = token
        decoder_mask[0, i] = 1
    
    # Run decoder
    decoder_output = decoder.predict({
        "decoder_input_ids": decoder_ids,
        "encoder_hidden_states": hidden_states,
        "decoder_attention_mask": decoder_mask,
        "encoder_attention_mask": inputs["attention_mask"].astype(np.int32)
    })
    
    # Get next token
    next_pos = len(generated_tokens)
    logits = decoder_output["logits"]
    next_token = np.argmax(logits[0, next_pos, :])
    
    # Stop if EOS token
    if next_token == tokenizer.eos_token_id:
        break
        
    generated_tokens.append(int(next_token))

# Decode result (skip initial pad token)
result = tokenizer.decode(generated_tokens[1:], skip_special_tokens=True)
print(f"Translation: {result}")

Swift/iOS Usage

import CoreML

// Load models
// Option 1: High-quality FP32 models
guard let encoderURL = Bundle.main.url(forResource: "flan_t5_base_encoder_quality", withExtension: "mlpackage"),
      let decoderURL = Bundle.main.url(forResource: "flan_t5_base_decoder_quality", withExtension: "mlpackage") else {
    fatalError("Models not found")
}

// Option 2: Mobile-optimized INT8 models (recommended for iOS apps)
// guard let encoderURL = Bundle.main.url(forResource: "flan_t5_base_encoder_int8", withExtension: "mlpackage"),
//       let decoderURL = Bundle.main.url(forResource: "flan_t5_base_decoder_int8", withExtension: "mlpackage") else {
    fatalError("Models not found")
}

let encoderModel = try MLModel(contentsOf: encoderURL)
let decoderModel = try MLModel(contentsOf: decoderURL)

// Example inference (similar pattern to Python but with MLMultiArray)
// Note: You'll need to implement tokenization in Swift or use a bridging approach

Model Capabilities

FLAN-T5 has been instruction-tuned and can perform various text-to-text tasks:

  • Text Summarization: "summarize: [text]"
  • Translation: "translate English to French: [text]"
  • Question Answering: "answer the question: [question] context: [context]"
  • General Instructions: Direct natural language instructions

Performance Considerations

  • Memory:
    • FP32 Models: ~1.1GB total (maximum quality)
    • INT8 Models: ~272MB total (4x smaller, mobile-optimized)
  • Precision: FP32 for quality, INT8 for mobile deployment
  • Sequence Length: Maximum 512 tokens (full original capacity)
  • Device Compatibility: Apple Neural Engine, GPU, or CPU depending on availability
  • Generation Speed: Optimized for real-time text generation on mobile devices

Technical Notes

Quality Improvements in Version 3.0

  1. Output Quality: Models now produce sensible, coherent outputs matching PyTorch baseline
  2. Precision Preservation: FP32 precision prevents quality degradation from quantization
  3. Full Architecture: 512-token sequences preserve complete model capabilities
  4. Minimal Modification: Conversion process preserves original model behavior patterns

Conversion Details

  • Source Framework: PyTorch/Transformers
  • Conversion Tool: CoreML Tools 8.3.0
  • Date: July 2025
  • Torch Version: 2.7.1 (with compatibility warnings handled)
  • Approach: Quality-focused conversion with minimal architectural changes

Testing and Verification

The models have been thoroughly tested to ensure:

  • βœ… High output quality matching PyTorch baseline
  • βœ… Proper translations and text generation
  • βœ… Multiple task types function correctly
  • βœ… Consistent behavior across iOS/macOS platforms
  • βœ… No quality degradation from conversion process

Troubleshooting

Common Issues

  1. Shape Mismatches: Ensure you're using max_length=512 for both encoder and decoder inputs
  2. Token Generation: Always start with tokenizer.pad_token_id for decoder input
  3. Memory: Models require ~1.1GB total memory for inference
  4. Quality: If output seems degraded, verify you're using the latest quality-preserved models

Verification Test

To verify the models work correctly, different decoder contexts should produce different outputs:

# This should produce DIFFERENT results (proving causal attention works)
context_1 = [tokenizer.pad_token_id, 1000]  # Different token at position 1
context_2 = [tokenizer.pad_token_id, 2000]  # Different token at position 1
# Running decoder with these contexts should give different predictions

License

This model follows the same license as the original FLAN-T5 model. Please refer to the original model card for licensing details.

Citation

If you use these models, please cite the original FLAN-T5 paper:

@article{chung2022scaling,
  title={Scaling instruction-finetuned language models},
  author={Chung, Hyung Won and Hou, Le and Longpre, Shayne and Zoph, Barret and Tay, Yi and Fedus, William and Li, Eric and Wang, Xuezhi and Mostafazadeh, Nasrin and Shen, Jianmo and others},
  journal={arXiv preprint arXiv:2210.11416},
  year={2022}
}

Issues and Support

For issues specific to these CoreML conversions, please open an issue in this repository. For general FLAN-T5 questions, refer to the original model repository.


Version 3.0 - Quality Preserved βœ…
High-quality models ready for production use in iOS/macOS applications

Downloads last month
32
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support