litert-community
/

Qwen2.5-1.5B-Instruct

@@ -11,8 +11,9 @@ tags:
 This model provides a few variants of
 [Qwen/Qwen2.5-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct) that are ready for
 deployment on Android using the
-[LiteRT (fka TFLite) stack](https://ai.google.dev/edge/litert) and
-[MediaPipe LLM Inference API](https://ai.google.dev/edge/mediapipe/solutions/genai/llm_inference).
 ## Use the models
@@ -28,6 +29,15 @@ on Colab could be much worse than on a local device.*
 ### Android
 *   Download and install
     [the apk](https://github.com/google-ai-edge/gallery/releases/latest/download/ai-edge-gallery.apk).
 *   Follow the instructions in the app.
@@ -46,7 +56,7 @@ from the GitHub repository.
 ### Android
-Note that all benchmark stats are from a Samsung S24 Ultra and multiple prefill signatures enabled.
 <table border="1">
   <tr>
@@ -56,60 +66,69 @@ Note that all benchmark stats are from a Samsung S24 Ultra and multiple prefill
    <th style="text-align: left">Prefill (tokens/sec)</th>
    <th style="text-align: left">Decode (tokens/sec)</th>
    <th style="text-align: left">Time-to-first-token (sec)</th>
-   <th style="text-align: left">CPU Memory (RSS in MB)</th>
-   <th style="text-align: left">GPU Memory (RSS in MB)</th>
    <th style="text-align: left">Model size (MB)</th>
    <th></th>
   </tr>
   <tr>
-<td rowspan="5"><p style="text-align: left">CPU</p></td>
-<td rowspan="3"><p style="text-align: left">fp32 (baseline)</p></td>
 <td><p style="text-align: right">1280</p></td>
-<td><p style="text-align: right">27 tk/s</p></td>
-<td><p style="text-align: right">6 tk/s</p></td>
-<td><p style="text-align: right">9.88 s</p></td>
-<td><p style="text-align: right">6,144 MB</p></td>
-<td><p style="text-align: right"></p></td>
-<td><p style="text-align: right">5,895 MB</p></td>
 <td><p style="text-align: left"><a style="text-decoration: none" href="https://huggingface.co/litert-community/Qwen2.5-1.5B-Instruct/resolve/main/Qwen2.5-1.5B-Instruct_multi-prefill-seq_f32_ekv1280.task">&#128279;</a></p></td>
 </tr>
 <tr>
-<td rowspan="2"><p style="text-align: right">1280</p></td>
-<td><p style="text-align: right">106 tk/s</p></td>
-<td><p style="text-align: right">23 tk/s</p></td>
-<td><p style="text-align: right">2.74 s</p></td>
-<td><p style="text-align: right">1,820 MB</p></td>
-<td><p style="text-align: right"></p></td>
-<td><p style="text-align: right">1,523 MB</p></td>
 <td><p style="text-align: left"><a style="text-decoration: none" href="https://huggingface.co/litert-community/Qwen2.5-1.5B-Instruct/resolve/main/Qwen2.5-1.5B-Instruct_multi-prefill-seq_q8_ekv1280.task">&#128279;</a></p></td>
 </tr>
 <tr>
-<td><p style="text-align: right">63 tk/s</p></td>
-<td><p style="text-align: right">20 tk/s</p></td>
-<td><p style="text-align: right">4.40 s</p></td>
-<td><p style="text-align: right">2,042 MB</p></td>
-<td><p style="text-align: right"></p></td>
-<td><p style="text-align: right">1,523 MB</p></td>
 <td><p style="text-align: left"><a style="text-decoration: none" href="https://huggingface.co/litert-community/Qwen2.5-1.5B-Instruct/resolve/main/Qwen2.5-1.5B-Instruct_multi-prefill-seq_q8_ekv4096.task">&#128279;</a></p></td>
 </tr>
 <tr>
-<td rowspan="2"><p style="text-align: left">dynamic_int8</p></td>
-<td rowspan="2"><p style="text-align: right">1280</p></td>
-<td><p style="text-align: right">706 tk/s</p></td>
-<td><p style="text-align: right">24 tk/s</p></td>
-<td><p style="text-align: right">6.94 s</p></td>
-<td><p style="text-align: right">3,175 MB</p></td>
-<td><p style="text-align: right">1,504 MB</p></td>
-<td><p style="text-align: right">1,523 MB</p></td>
 <td><p style="text-align: left"><a style="text-decoration: none" href="https://huggingface.co/litert-community/Qwen2.5-1.5B-Instruct/resolve/main/Qwen2.5-1.5B-Instruct_multi-prefill-seq_q8_ekv1280.task">&#128279;</a></p></td>
 </tr>
 <tr>
-<td><p style="text-align: right">417 tk/s</p></td>
-<td><p style="text-align: right">22 tk/s</p></td>
-<td><p style="text-align: right">7.93 s</p></td>
-<td><p style="text-align: right">3,176 MB</p></td>
-<td><p style="text-align: right">1,875 MB</p></td>
-<td><p style="text-align: right">1,523 MB</p></td>
 <td><p style="text-align: left"><a style="text-decoration: none" href="https://huggingface.co/litert-community/Qwen2.5-1.5B-Instruct/resolve/main/Qwen2.5-1.5B-Instruct_multi-prefill-seq_q8_ekv4096.task">&#128279;</a></p></td>
 </tr>

 This model provides a few variants of
 [Qwen/Qwen2.5-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct) that are ready for
 deployment on Android using the
+[LiteRT (fka TFLite) stack](https://ai.google.dev/edge/litert),
+[MediaPipe LLM Inference API](https://ai.google.dev/edge/mediapipe/solutions/genai/llm_inference) and
+[LiteRT-LM](https://github.com/google-ai-edge/LiteRT-LM).
 ## Use the models
 ### Android
+#### Edge Gallery App
+*   Download or build the [app](https://github.com/google-ai-edge/gallery?tab=readme-ov-file#-get-started-in-minutes) from GitHub.
+*   Install the [app](https://play.google.com/store/apps/details?id=com.google.ai.edge.gallery&pli=1) from Google Play.
+*   Follow the instructions in the app.
+#### LLM Inference API
 *   Download and install
     [the apk](https://github.com/google-ai-edge/gallery/releases/latest/download/ai-edge-gallery.apk).
 *   Follow the instructions in the app.
 ### Android
+Note that all benchmark stats are from a Samsung S25 Ultra and multiple prefill signatures enabled.
 <table border="1">
   <tr>
    <th style="text-align: left">Prefill (tokens/sec)</th>
    <th style="text-align: left">Decode (tokens/sec)</th>
    <th style="text-align: left">Time-to-first-token (sec)</th>
    <th style="text-align: left">Model size (MB)</th>
+   <th style="text-align: left">Peak RSS Memory (MB)</th>
+   <th style="text-align: left">GPU Memory (RSS in MB)</th>
    <th></th>
   </tr>
   <tr>
+<td><p style="text-align: left">CPU</p></td>
+<td><p style="text-align: left">fp32 (baseline)</p></td>
 <td><p style="text-align: right">1280</p></td>
+<td><p style="text-align: right">49.50</p></td>
+<td><p style="text-align: right">10 tk/s</p></td>
+<td><p style="text-align: right">21.25 s</p></td>
+<td><p style="text-align: right">6182 MB</p></td>
+<td><p style="text-align: right">6254 MB</p></td>
+<td><p style="text-align: right">N/A</p></td>
 <td><p style="text-align: left"><a style="text-decoration: none" href="https://huggingface.co/litert-community/Qwen2.5-1.5B-Instruct/resolve/main/Qwen2.5-1.5B-Instruct_multi-prefill-seq_f32_ekv1280.task">&#128279;</a></p></td>
 </tr>
 <tr>
+<td><p style="text-align: left">CPU</p></td>
+<td><p style="text-align: left">dynamic_int8</p></td>
+<td><p style="text-align: right">1280</p></td>
+<td><p style="text-align: right">297.58</p></td>
+<td><p style="text-align: right">34.25 tk/s</p></td>
+<td><p style="text-align: right">3.71 s</p></td>
+<td><p style="text-align: right">1598 MB</p></td>
+<td><p style="text-align: right">1997 MB</p></td>
+<td><p style="text-align: right">N/A</p></td>
 <td><p style="text-align: left"><a style="text-decoration: none" href="https://huggingface.co/litert-community/Qwen2.5-1.5B-Instruct/resolve/main/Qwen2.5-1.5B-Instruct_multi-prefill-seq_q8_ekv1280.task">&#128279;</a></p></td>
 </tr>
 <tr>
+<td><p style="text-align: left">CPU</p></td>
+<td><p style="text-align: left">dynamic_int8</p></td>
+<td><p style="text-align: right">4096</p></td>
+<td><p style="text-align: right">162.72 tk/s</p></td>
+<td><p style="text-align: right">26.06 tk/s</p></td>
+<td><p style="text-align: right">6.57 s</p></td>
+<td><p style="text-align: right">1598 MB</p></td>
+<td><p style="text-align: right">2216 MB</p></td>
+<td><p style="text-align: right">N/A</p></td>
 <td><p style="text-align: left"><a style="text-decoration: none" href="https://huggingface.co/litert-community/Qwen2.5-1.5B-Instruct/resolve/main/Qwen2.5-1.5B-Instruct_multi-prefill-seq_q8_ekv4096.task">&#128279;</a></p></td>
 </tr>
 <tr>
+<td><p style="text-align: left">GPU</p></td>
+<td><p style="text-align: left">dynamic_int8</p></td>
+<td><p style="text-align: right">1280</p></td>
+<td><p style="text-align: right">1667.75 tk/s</p></td>
+<td><p style="text-align: right">30.88 tk/s</p></td>
+<td><p style="text-align: right">3.63 s</p></td>
+<td><p style="text-align: right">1598 MB</p></td>
+<td><p style="text-align: right">1846 MB</p></td>
+<td><p style="text-align: right">1505 MB</p></td>
 <td><p style="text-align: left"><a style="text-decoration: none" href="https://huggingface.co/litert-community/Qwen2.5-1.5B-Instruct/resolve/main/Qwen2.5-1.5B-Instruct_multi-prefill-seq_q8_ekv1280.task">&#128279;</a></p></td>
 </tr>
 <tr>
+<td><p style="text-align: left">GPU</p></td>
+<td><p style="text-align: left">dynamic_int8</p></td>
+<td><p style="text-align: right">4096</p></td>
+<td><p style="text-align: right">933.45 tk/s</p></td>
+<td><p style="text-align: right">27.30 tk/s</p></td>
+<td><p style="text-align: right">4.77 s</p></td>
+<td><p style="text-align: right">1598 MB</p></td>
+<td><p style="text-align: right">1869 MB</p></td>
+<td><p style="text-align: right">1505 MB</p></td>
 <td><p style="text-align: left"><a style="text-decoration: none" href="https://huggingface.co/litert-community/Qwen2.5-1.5B-Instruct/resolve/main/Qwen2.5-1.5B-Instruct_multi-prefill-seq_q8_ekv4096.task">&#128279;</a></p></td>
 </tr>