Update README.md
Browse files
README.md
CHANGED
@@ -11,8 +11,9 @@ tags:
|
|
11 |
This model provides a few variants of
|
12 |
[Qwen/Qwen2.5-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct) that are ready for
|
13 |
deployment on Android using the
|
14 |
-
[LiteRT (fka TFLite) stack](https://ai.google.dev/edge/litert)
|
15 |
-
[MediaPipe LLM Inference API](https://ai.google.dev/edge/mediapipe/solutions/genai/llm_inference)
|
|
|
16 |
|
17 |
## Use the models
|
18 |
|
@@ -28,6 +29,15 @@ on Colab could be much worse than on a local device.*
|
|
28 |
|
29 |
### Android
|
30 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
31 |
* Download and install
|
32 |
[the apk](https://github.com/google-ai-edge/gallery/releases/latest/download/ai-edge-gallery.apk).
|
33 |
* Follow the instructions in the app.
|
@@ -46,7 +56,7 @@ from the GitHub repository.
|
|
46 |
|
47 |
### Android
|
48 |
|
49 |
-
Note that all benchmark stats are from a Samsung
|
50 |
|
51 |
<table border="1">
|
52 |
<tr>
|
@@ -56,60 +66,69 @@ Note that all benchmark stats are from a Samsung S24 Ultra and multiple prefill
|
|
56 |
<th style="text-align: left">Prefill (tokens/sec)</th>
|
57 |
<th style="text-align: left">Decode (tokens/sec)</th>
|
58 |
<th style="text-align: left">Time-to-first-token (sec)</th>
|
59 |
-
<th style="text-align: left">CPU Memory (RSS in MB)</th>
|
60 |
-
<th style="text-align: left">GPU Memory (RSS in MB)</th>
|
61 |
<th style="text-align: left">Model size (MB)</th>
|
|
|
|
|
62 |
<th></th>
|
63 |
</tr>
|
64 |
<tr>
|
65 |
-
<td
|
66 |
-
<td
|
67 |
<td><p style="text-align: right">1280</p></td>
|
68 |
-
<td><p style="text-align: right">
|
69 |
-
<td><p style="text-align: right">
|
70 |
-
<td><p style="text-align: right">
|
71 |
-
<td><p style="text-align: right">
|
72 |
-
<td><p style="text-align: right"
|
73 |
-
<td><p style="text-align: right">
|
74 |
<td><p style="text-align: left"><a style="text-decoration: none" href="https://huggingface.co/litert-community/Qwen2.5-1.5B-Instruct/resolve/main/Qwen2.5-1.5B-Instruct_multi-prefill-seq_f32_ekv1280.task">🔗</a></p></td>
|
75 |
</tr>
|
76 |
<tr>
|
77 |
-
<td
|
78 |
-
<td><p style="text-align:
|
79 |
-
<td><p style="text-align: right">
|
80 |
-
<td><p style="text-align: right">
|
81 |
-
<td><p style="text-align: right">
|
82 |
-
<td><p style="text-align: right"
|
83 |
-
<td><p style="text-align: right">
|
|
|
|
|
84 |
<td><p style="text-align: left"><a style="text-decoration: none" href="https://huggingface.co/litert-community/Qwen2.5-1.5B-Instruct/resolve/main/Qwen2.5-1.5B-Instruct_multi-prefill-seq_q8_ekv1280.task">🔗</a></p></td>
|
85 |
</tr>
|
86 |
<tr>
|
87 |
-
<td><p style="text-align:
|
88 |
-
<td><p style="text-align:
|
89 |
-
<td><p style="text-align: right">
|
90 |
-
<td><p style="text-align: right">
|
91 |
-
<td><p style="text-align: right"
|
92 |
-
<td><p style="text-align: right">
|
|
|
|
|
|
|
93 |
<td><p style="text-align: left"><a style="text-decoration: none" href="https://huggingface.co/litert-community/Qwen2.5-1.5B-Instruct/resolve/main/Qwen2.5-1.5B-Instruct_multi-prefill-seq_q8_ekv4096.task">🔗</a></p></td>
|
94 |
</tr>
|
95 |
<tr>
|
96 |
-
<td
|
97 |
-
<td
|
98 |
-
<td><p style="text-align: right">
|
99 |
-
<td><p style="text-align: right">
|
100 |
-
<td><p style="text-align: right">
|
101 |
-
<td><p style="text-align: right">3
|
102 |
-
<td><p style="text-align: right">
|
103 |
-
<td><p style="text-align: right">
|
|
|
104 |
<td><p style="text-align: left"><a style="text-decoration: none" href="https://huggingface.co/litert-community/Qwen2.5-1.5B-Instruct/resolve/main/Qwen2.5-1.5B-Instruct_multi-prefill-seq_q8_ekv1280.task">🔗</a></p></td>
|
105 |
</tr>
|
106 |
<tr>
|
107 |
-
<td><p style="text-align:
|
108 |
-
<td><p style="text-align:
|
109 |
-
<td><p style="text-align: right">
|
110 |
-
<td><p style="text-align: right">
|
111 |
-
<td><p style="text-align: right">
|
112 |
-
<td><p style="text-align: right">
|
|
|
|
|
|
|
113 |
<td><p style="text-align: left"><a style="text-decoration: none" href="https://huggingface.co/litert-community/Qwen2.5-1.5B-Instruct/resolve/main/Qwen2.5-1.5B-Instruct_multi-prefill-seq_q8_ekv4096.task">🔗</a></p></td>
|
114 |
</tr>
|
115 |
|
|
|
11 |
This model provides a few variants of
|
12 |
[Qwen/Qwen2.5-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct) that are ready for
|
13 |
deployment on Android using the
|
14 |
+
[LiteRT (fka TFLite) stack](https://ai.google.dev/edge/litert),
|
15 |
+
[MediaPipe LLM Inference API](https://ai.google.dev/edge/mediapipe/solutions/genai/llm_inference) and
|
16 |
+
[LiteRT-LM](https://github.com/google-ai-edge/LiteRT-LM).
|
17 |
|
18 |
## Use the models
|
19 |
|
|
|
29 |
|
30 |
### Android
|
31 |
|
32 |
+
#### Edge Gallery App
|
33 |
+
* Download or build the [app](https://github.com/google-ai-edge/gallery?tab=readme-ov-file#-get-started-in-minutes) from GitHub.
|
34 |
+
|
35 |
+
* Install the [app](https://play.google.com/store/apps/details?id=com.google.ai.edge.gallery&pli=1) from Google Play.
|
36 |
+
|
37 |
+
* Follow the instructions in the app.
|
38 |
+
|
39 |
+
#### LLM Inference API
|
40 |
+
|
41 |
* Download and install
|
42 |
[the apk](https://github.com/google-ai-edge/gallery/releases/latest/download/ai-edge-gallery.apk).
|
43 |
* Follow the instructions in the app.
|
|
|
56 |
|
57 |
### Android
|
58 |
|
59 |
+
Note that all benchmark stats are from a Samsung S25 Ultra and multiple prefill signatures enabled.
|
60 |
|
61 |
<table border="1">
|
62 |
<tr>
|
|
|
66 |
<th style="text-align: left">Prefill (tokens/sec)</th>
|
67 |
<th style="text-align: left">Decode (tokens/sec)</th>
|
68 |
<th style="text-align: left">Time-to-first-token (sec)</th>
|
|
|
|
|
69 |
<th style="text-align: left">Model size (MB)</th>
|
70 |
+
<th style="text-align: left">Peak RSS Memory (MB)</th>
|
71 |
+
<th style="text-align: left">GPU Memory (RSS in MB)</th>
|
72 |
<th></th>
|
73 |
</tr>
|
74 |
<tr>
|
75 |
+
<td><p style="text-align: left">CPU</p></td>
|
76 |
+
<td><p style="text-align: left">fp32 (baseline)</p></td>
|
77 |
<td><p style="text-align: right">1280</p></td>
|
78 |
+
<td><p style="text-align: right">49.50</p></td>
|
79 |
+
<td><p style="text-align: right">10 tk/s</p></td>
|
80 |
+
<td><p style="text-align: right">21.25 s</p></td>
|
81 |
+
<td><p style="text-align: right">6182 MB</p></td>
|
82 |
+
<td><p style="text-align: right">6254 MB</p></td>
|
83 |
+
<td><p style="text-align: right">N/A</p></td>
|
84 |
<td><p style="text-align: left"><a style="text-decoration: none" href="https://huggingface.co/litert-community/Qwen2.5-1.5B-Instruct/resolve/main/Qwen2.5-1.5B-Instruct_multi-prefill-seq_f32_ekv1280.task">🔗</a></p></td>
|
85 |
</tr>
|
86 |
<tr>
|
87 |
+
<td><p style="text-align: left">CPU</p></td>
|
88 |
+
<td><p style="text-align: left">dynamic_int8</p></td>
|
89 |
+
<td><p style="text-align: right">1280</p></td>
|
90 |
+
<td><p style="text-align: right">297.58</p></td>
|
91 |
+
<td><p style="text-align: right">34.25 tk/s</p></td>
|
92 |
+
<td><p style="text-align: right">3.71 s</p></td>
|
93 |
+
<td><p style="text-align: right">1598 MB</p></td>
|
94 |
+
<td><p style="text-align: right">1997 MB</p></td>
|
95 |
+
<td><p style="text-align: right">N/A</p></td>
|
96 |
<td><p style="text-align: left"><a style="text-decoration: none" href="https://huggingface.co/litert-community/Qwen2.5-1.5B-Instruct/resolve/main/Qwen2.5-1.5B-Instruct_multi-prefill-seq_q8_ekv1280.task">🔗</a></p></td>
|
97 |
</tr>
|
98 |
<tr>
|
99 |
+
<td><p style="text-align: left">CPU</p></td>
|
100 |
+
<td><p style="text-align: left">dynamic_int8</p></td>
|
101 |
+
<td><p style="text-align: right">4096</p></td>
|
102 |
+
<td><p style="text-align: right">162.72 tk/s</p></td>
|
103 |
+
<td><p style="text-align: right">26.06 tk/s</p></td>
|
104 |
+
<td><p style="text-align: right">6.57 s</p></td>
|
105 |
+
<td><p style="text-align: right">1598 MB</p></td>
|
106 |
+
<td><p style="text-align: right">2216 MB</p></td>
|
107 |
+
<td><p style="text-align: right">N/A</p></td>
|
108 |
<td><p style="text-align: left"><a style="text-decoration: none" href="https://huggingface.co/litert-community/Qwen2.5-1.5B-Instruct/resolve/main/Qwen2.5-1.5B-Instruct_multi-prefill-seq_q8_ekv4096.task">🔗</a></p></td>
|
109 |
</tr>
|
110 |
<tr>
|
111 |
+
<td><p style="text-align: left">GPU</p></td>
|
112 |
+
<td><p style="text-align: left">dynamic_int8</p></td>
|
113 |
+
<td><p style="text-align: right">1280</p></td>
|
114 |
+
<td><p style="text-align: right">1667.75 tk/s</p></td>
|
115 |
+
<td><p style="text-align: right">30.88 tk/s</p></td>
|
116 |
+
<td><p style="text-align: right">3.63 s</p></td>
|
117 |
+
<td><p style="text-align: right">1598 MB</p></td>
|
118 |
+
<td><p style="text-align: right">1846 MB</p></td>
|
119 |
+
<td><p style="text-align: right">1505 MB</p></td>
|
120 |
<td><p style="text-align: left"><a style="text-decoration: none" href="https://huggingface.co/litert-community/Qwen2.5-1.5B-Instruct/resolve/main/Qwen2.5-1.5B-Instruct_multi-prefill-seq_q8_ekv1280.task">🔗</a></p></td>
|
121 |
</tr>
|
122 |
<tr>
|
123 |
+
<td><p style="text-align: left">GPU</p></td>
|
124 |
+
<td><p style="text-align: left">dynamic_int8</p></td>
|
125 |
+
<td><p style="text-align: right">4096</p></td>
|
126 |
+
<td><p style="text-align: right">933.45 tk/s</p></td>
|
127 |
+
<td><p style="text-align: right">27.30 tk/s</p></td>
|
128 |
+
<td><p style="text-align: right">4.77 s</p></td>
|
129 |
+
<td><p style="text-align: right">1598 MB</p></td>
|
130 |
+
<td><p style="text-align: right">1869 MB</p></td>
|
131 |
+
<td><p style="text-align: right">1505 MB</p></td>
|
132 |
<td><p style="text-align: left"><a style="text-decoration: none" href="https://huggingface.co/litert-community/Qwen2.5-1.5B-Instruct/resolve/main/Qwen2.5-1.5B-Instruct_multi-prefill-seq_q8_ekv4096.task">🔗</a></p></td>
|
133 |
</tr>
|
134 |
|