Phi-3 Medium-128K-Instruct ONNX CPU models

This repository hosts the optimized versions of Phi-3-medium-128k-instruct to accelerate inference with ONNX Runtime for your CPU.

Phi-3 Medium is a 14B parameter, lightweight, state-of-the-art open model trained with the Phi-3 datasets, which include both synthetic data and the filtered publicly available websites data, with a focus on high-quality and reasoning dense properties. The model belongs to the Phi-3 family with the medium version in two variants: 4K and 128K, which are the context lengths (in tokens) that they can support.

The base model has undergone a post-training process that incorporates both supervised fine-tuning and direct preference optimization for the instruction following and safety measures. When assessed against benchmarks testing common sense, language understanding, math, code, long context, and logical reasoning, Phi-3-Medium-128K-Instruct showcased a robust and state-of-the-art performance among models of the same-size and next-size-up.

Optimized variants of the Phi-3 Medium models are published here in ONNX format and run with ONNX Runtime on CPU and GPU across devices, including server platforms, Windows, and Linux, with the precision best suited to each of these targets.

ONNX Models

Here are some of the optimized configurations we have added:

ONNX model for INT4 CPU: ONNX model for CPUs using int4 quantization via RTN.

How do you know which is the best ONNX model for you:

Are you on a Windows machine with GPU?
- I don't know → Review this guide to see whether you have a GPU in your Windows machine.
- Yes → Access the Hugging Face DirectML ONNX models and instructions at Phi-3-medium-128k-instruct-onnx-directml.
- No → Do you have a NVIDIA GPU?
  - I don't know → Review this guide to see whether you have a CUDA-capable GPU.
  - Yes → Access the Hugging Face CUDA ONNX models and instructions at Phi-3-medium-128k-instruct-onnx-cuda for NVIDIA GPUs.
  - No → Access the Hugging Face ONNX models for CPU devices and instructions at Phi-3-medium-128k-instruct-onnx-cpu.

How to Get Started with the Model

To support the Phi-3 models across a range of devices, platforms, and EP backends, we introduce a new API to wrap several aspects of generative AI inferencing. This API makes it easy to drag and drop LLMs straight into your app. To run the early version of these models with ONNX, follow the steps here. You can also test this with a chat app.

Hardware Supported

The models are tested on:

Intel(R) Core(TM) i9-10920X CPU @ 3.50GHz

Minimum Configuration Required:

CPU machine with 16GB RAM

Model Description

Developed by: Microsoft
Model type: ONNX
Language(s) (NLP): Python, C, C++
License: MIT
Model Description: This is a conversion of the Phi-3 Medium-128K-Instruct model for ONNX Runtime inference.

Additional Details

Performance Metrics

The model runs at ~20 tokens/sec on a Intel(R) Core(TM) i9-10920X CPU @ 3.50GHz.

Appendix

Model Card Contact

parinitarahi, kvaishnavi, natke

Contributors

Kunal Vaishnavi, Sunghoon Choi, Yufeng Li, Akshay Sonawane, Sheetal Arun Kadam, Rui Ren, Edward Chen, Scott McKay, Emma Ning, Natalie Kershaw, Parinita Rahi

microsoft
/

Phi-3-medium-128k-instruct-onnx-cpu