FLAN-T5 Base CoreML Models (HIGH QUALITY VERSION)
This repository contains high-quality CoreML versions of Google's FLAN-T5 Base model, optimized for production use on Apple devices (macOS/iOS) with preserved model quality and proper attention mechanisms.
β οΈ Important Update - Quality Preserved
Version 3.0: This repository now contains quality-preserved models that maintain the original PyTorch model's output quality. Previous versions suffered from significant quality degradation due to precision loss and architectural modifications. This has been completely resolved using proper conversion techniques.
Model Details
- Base Model: google/flan-t5-base
- Architecture: T5 (Text-to-Text Transfer Transformer)
- Model Size:
- FP32 (Quality): Encoder 430MB, Decoder 647MB = 1.1GB total
- INT8 (Mobile): Encoder 108MB, Decoder 164MB = 272MB total (4x smaller)
- Framework: CoreML (.mlpackage format)
- Precision: FP32 for maximum quality preservation
- Deployment Target: iOS 15+ / macOS 12+
- Max Sequence Length: 512 tokens (original model dimensions preserved)
Files
Model Files
High-Quality Models (FP32)
flan_t5_base_encoder_quality.mlpackage- T5 Encoder component (512 tokens, FP32, 430MB)flan_t5_base_decoder_quality.mlpackage- T5 Decoder component (512 tokens, FP32, 647MB)
Quantized Models (INT8) - Recommended for Mobile
flan_t5_base_encoder_int8.mlpackage- T5 Encoder component (512 tokens, INT8, 108MB)flan_t5_base_decoder_int8.mlpackage- T5 Decoder component (512 tokens, INT8, 164MB)
Tokenizer Files
tokenizer.json- Fast tokenizer configurationtokenizer_config.json- Tokenizer metadata and settingsspecial_tokens_map.json- Special token mappingsspiece.model- SentencePiece model for tokenization
Model Architecture
FLAN-T5 is an encoder-decoder transformer model that has been converted into two separate CoreML models with preserved quality and attention mechanisms:
Encoder
- Input:
input_ids(shape: [1, 512], dtype: int32),attention_mask(shape: [1, 512], dtype: int32) - Output:
hidden_states(shape: [1, 512, 768], dtype: float32)
Decoder
- Inputs:
decoder_input_ids(shape: [1, 512], dtype: int32)encoder_hidden_states(shape: [1, 512, 768], dtype: float32)decoder_attention_mask(shape: [1, 512], dtype: int32)encoder_attention_mask(shape: [1, 512], dtype: int32)
- Output:
logits(shape: [1, 512, 32128], dtype: float32)
β Verified Quality Features
- β High Output Quality: Produces sensible, coherent text outputs matching PyTorch baseline
- β Proper Translations: French/German translations work correctly
- β Multiple Tasks: Translation, summarization, question answering all functional
- β Preserved Precision: FP32 precision maintains model accuracy
- β Original Architecture: 512-token sequences preserve full model capabilities
- β Production Ready: Suitable for real-world applications
- β Mobile Optimized: INT8 quantized versions for deployment on iOS devices
π Model Variants
Choose the right model for your use case:
| Model Type | Size | Use Case | Quality | Memory |
|---|---|---|---|---|
| FP32 Quality | 1.1GB | Server/Desktop apps, Research | Highest | High |
| INT8 Mobile | 272MB | iOS/Mobile apps, Production | Very Good | Low |
Recommendations:
- iOS/Mobile Apps: Use INT8 models for better performance and lower memory usage
- Server/Desktop: Use FP32 models for maximum quality
- Development/Testing: Start with INT8, upgrade to FP32 if needed
Usage
Download Models
# Download complete repository
huggingface-cli download mazhewitt/flan-t5-base-coreml --local-dir ./models
# Download specific models (choose quality vs mobile-optimized)
# High-quality FP32 models
huggingface-cli download mazhewitt/flan-t5-base-coreml flan_t5_base_encoder_quality.mlpackage --local-dir ./models
huggingface-cli download mazhewitt/flan-t5-base-coreml flan_t5_base_decoder_quality.mlpackage --local-dir ./models
# Mobile-optimized INT8 models (recommended for iOS/mobile apps)
huggingface-cli download mazhewitt/flan-t5-base-coreml flan_t5_base_encoder_int8.mlpackage --local-dir ./models
huggingface-cli download mazhewitt/flan-t5-base-coreml flan_t5_base_decoder_int8.mlpackage --local-dir ./models
Python Usage with Working Text Generation
import coremltools as ct
import numpy as np
from transformers import T5Tokenizer
# Load models and tokenizer
# Option 1: High-quality FP32 models (1.1GB)
encoder = ct.models.MLModel("flan_t5_base_encoder_quality.mlpackage")
decoder = ct.models.MLModel("flan_t5_base_decoder_quality.mlpackage")
# Option 2: Mobile-optimized INT8 models (272MB) - Recommended for iOS apps
# encoder = ct.models.MLModel("flan_t5_base_encoder_int8.mlpackage")
# decoder = ct.models.MLModel("flan_t5_base_decoder_int8.mlpackage")
tokenizer = T5Tokenizer.from_pretrained("./")
# Example: Translation with high-quality generation
input_text = "translate English to French: Hello world"
inputs = tokenizer(input_text, return_tensors="np", padding="max_length",
truncation=True, max_length=512)
# Run encoder
encoder_output = encoder.predict({
"input_ids": inputs["input_ids"].astype(np.int32),
"attention_mask": inputs["attention_mask"].astype(np.int32)
})
hidden_states = encoder_output["hidden_states"]
# Greedy generation (working causal attention)
generated_tokens = [tokenizer.pad_token_id] # Start with pad token
max_new_tokens = 10
for _ in range(max_new_tokens):
# Prepare decoder input
decoder_ids = np.zeros((1, 512), dtype=np.int32)
decoder_mask = np.zeros((1, 512), dtype=np.int32)
for i, token in enumerate(generated_tokens):
decoder_ids[0, i] = token
decoder_mask[0, i] = 1
# Run decoder
decoder_output = decoder.predict({
"decoder_input_ids": decoder_ids,
"encoder_hidden_states": hidden_states,
"decoder_attention_mask": decoder_mask,
"encoder_attention_mask": inputs["attention_mask"].astype(np.int32)
})
# Get next token
next_pos = len(generated_tokens)
logits = decoder_output["logits"]
next_token = np.argmax(logits[0, next_pos, :])
# Stop if EOS token
if next_token == tokenizer.eos_token_id:
break
generated_tokens.append(int(next_token))
# Decode result (skip initial pad token)
result = tokenizer.decode(generated_tokens[1:], skip_special_tokens=True)
print(f"Translation: {result}")
Swift/iOS Usage
import CoreML
// Load models
// Option 1: High-quality FP32 models
guard let encoderURL = Bundle.main.url(forResource: "flan_t5_base_encoder_quality", withExtension: "mlpackage"),
let decoderURL = Bundle.main.url(forResource: "flan_t5_base_decoder_quality", withExtension: "mlpackage") else {
fatalError("Models not found")
}
// Option 2: Mobile-optimized INT8 models (recommended for iOS apps)
// guard let encoderURL = Bundle.main.url(forResource: "flan_t5_base_encoder_int8", withExtension: "mlpackage"),
// let decoderURL = Bundle.main.url(forResource: "flan_t5_base_decoder_int8", withExtension: "mlpackage") else {
fatalError("Models not found")
}
let encoderModel = try MLModel(contentsOf: encoderURL)
let decoderModel = try MLModel(contentsOf: decoderURL)
// Example inference (similar pattern to Python but with MLMultiArray)
// Note: You'll need to implement tokenization in Swift or use a bridging approach
Model Capabilities
FLAN-T5 has been instruction-tuned and can perform various text-to-text tasks:
- Text Summarization: "summarize: [text]"
- Translation: "translate English to French: [text]"
- Question Answering: "answer the question: [question] context: [context]"
- General Instructions: Direct natural language instructions
Performance Considerations
- Memory:
- FP32 Models: ~1.1GB total (maximum quality)
- INT8 Models: ~272MB total (4x smaller, mobile-optimized)
- Precision: FP32 for quality, INT8 for mobile deployment
- Sequence Length: Maximum 512 tokens (full original capacity)
- Device Compatibility: Apple Neural Engine, GPU, or CPU depending on availability
- Generation Speed: Optimized for real-time text generation on mobile devices
Technical Notes
Quality Improvements in Version 3.0
- Output Quality: Models now produce sensible, coherent outputs matching PyTorch baseline
- Precision Preservation: FP32 precision prevents quality degradation from quantization
- Full Architecture: 512-token sequences preserve complete model capabilities
- Minimal Modification: Conversion process preserves original model behavior patterns
Conversion Details
- Source Framework: PyTorch/Transformers
- Conversion Tool: CoreML Tools 8.3.0
- Date: July 2025
- Torch Version: 2.7.1 (with compatibility warnings handled)
- Approach: Quality-focused conversion with minimal architectural changes
Testing and Verification
The models have been thoroughly tested to ensure:
- β High output quality matching PyTorch baseline
- β Proper translations and text generation
- β Multiple task types function correctly
- β Consistent behavior across iOS/macOS platforms
- β No quality degradation from conversion process
Troubleshooting
Common Issues
- Shape Mismatches: Ensure you're using max_length=512 for both encoder and decoder inputs
- Token Generation: Always start with
tokenizer.pad_token_idfor decoder input - Memory: Models require ~1.1GB total memory for inference
- Quality: If output seems degraded, verify you're using the latest quality-preserved models
Verification Test
To verify the models work correctly, different decoder contexts should produce different outputs:
# This should produce DIFFERENT results (proving causal attention works)
context_1 = [tokenizer.pad_token_id, 1000] # Different token at position 1
context_2 = [tokenizer.pad_token_id, 2000] # Different token at position 1
# Running decoder with these contexts should give different predictions
License
This model follows the same license as the original FLAN-T5 model. Please refer to the original model card for licensing details.
Citation
If you use these models, please cite the original FLAN-T5 paper:
@article{chung2022scaling,
title={Scaling instruction-finetuned language models},
author={Chung, Hyung Won and Hou, Le and Longpre, Shayne and Zoph, Barret and Tay, Yi and Fedus, William and Li, Eric and Wang, Xuezhi and Mostafazadeh, Nasrin and Shen, Jianmo and others},
journal={arXiv preprint arXiv:2210.11416},
year={2022}
}
Issues and Support
For issues specific to these CoreML conversions, please open an issue in this repository. For general FLAN-T5 questions, refer to the original model repository.
Version 3.0 - Quality Preserved β
High-quality models ready for production use in iOS/macOS applications
- Downloads last month
- 32