fengwuyao commited on
Commit
8c784d7
·
verified ·
1 Parent(s): 4526a9f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +59 -40
README.md CHANGED
@@ -11,8 +11,9 @@ tags:
11
  This model provides a few variants of
12
  [Qwen/Qwen2.5-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct) that are ready for
13
  deployment on Android using the
14
- [LiteRT (fka TFLite) stack](https://ai.google.dev/edge/litert) and
15
- [MediaPipe LLM Inference API](https://ai.google.dev/edge/mediapipe/solutions/genai/llm_inference).
 
16
 
17
  ## Use the models
18
 
@@ -28,6 +29,15 @@ on Colab could be much worse than on a local device.*
28
 
29
  ### Android
30
 
 
 
 
 
 
 
 
 
 
31
  * Download and install
32
  [the apk](https://github.com/google-ai-edge/gallery/releases/latest/download/ai-edge-gallery.apk).
33
  * Follow the instructions in the app.
@@ -46,7 +56,7 @@ from the GitHub repository.
46
 
47
  ### Android
48
 
49
- Note that all benchmark stats are from a Samsung S24 Ultra and multiple prefill signatures enabled.
50
 
51
  <table border="1">
52
  <tr>
@@ -56,60 +66,69 @@ Note that all benchmark stats are from a Samsung S24 Ultra and multiple prefill
56
  <th style="text-align: left">Prefill (tokens/sec)</th>
57
  <th style="text-align: left">Decode (tokens/sec)</th>
58
  <th style="text-align: left">Time-to-first-token (sec)</th>
59
- <th style="text-align: left">CPU Memory (RSS in MB)</th>
60
- <th style="text-align: left">GPU Memory (RSS in MB)</th>
61
  <th style="text-align: left">Model size (MB)</th>
 
 
62
  <th></th>
63
  </tr>
64
  <tr>
65
- <td rowspan="5"><p style="text-align: left">CPU</p></td>
66
- <td rowspan="3"><p style="text-align: left">fp32 (baseline)</p></td>
67
  <td><p style="text-align: right">1280</p></td>
68
- <td><p style="text-align: right">27 tk/s</p></td>
69
- <td><p style="text-align: right">6 tk/s</p></td>
70
- <td><p style="text-align: right">9.88 s</p></td>
71
- <td><p style="text-align: right">6,144 MB</p></td>
72
- <td><p style="text-align: right"></p></td>
73
- <td><p style="text-align: right">5,895 MB</p></td>
74
  <td><p style="text-align: left"><a style="text-decoration: none" href="https://huggingface.co/litert-community/Qwen2.5-1.5B-Instruct/resolve/main/Qwen2.5-1.5B-Instruct_multi-prefill-seq_f32_ekv1280.task">&#128279;</a></p></td>
75
  </tr>
76
  <tr>
77
- <td rowspan="2"><p style="text-align: right">1280</p></td>
78
- <td><p style="text-align: right">106 tk/s</p></td>
79
- <td><p style="text-align: right">23 tk/s</p></td>
80
- <td><p style="text-align: right">2.74 s</p></td>
81
- <td><p style="text-align: right">1,820 MB</p></td>
82
- <td><p style="text-align: right"></p></td>
83
- <td><p style="text-align: right">1,523 MB</p></td>
 
 
84
  <td><p style="text-align: left"><a style="text-decoration: none" href="https://huggingface.co/litert-community/Qwen2.5-1.5B-Instruct/resolve/main/Qwen2.5-1.5B-Instruct_multi-prefill-seq_q8_ekv1280.task">&#128279;</a></p></td>
85
  </tr>
86
  <tr>
87
- <td><p style="text-align: right">63 tk/s</p></td>
88
- <td><p style="text-align: right">20 tk/s</p></td>
89
- <td><p style="text-align: right">4.40 s</p></td>
90
- <td><p style="text-align: right">2,042 MB</p></td>
91
- <td><p style="text-align: right"></p></td>
92
- <td><p style="text-align: right">1,523 MB</p></td>
 
 
 
93
  <td><p style="text-align: left"><a style="text-decoration: none" href="https://huggingface.co/litert-community/Qwen2.5-1.5B-Instruct/resolve/main/Qwen2.5-1.5B-Instruct_multi-prefill-seq_q8_ekv4096.task">&#128279;</a></p></td>
94
  </tr>
95
  <tr>
96
- <td rowspan="2"><p style="text-align: left">dynamic_int8</p></td>
97
- <td rowspan="2"><p style="text-align: right">1280</p></td>
98
- <td><p style="text-align: right">706 tk/s</p></td>
99
- <td><p style="text-align: right">24 tk/s</p></td>
100
- <td><p style="text-align: right">6.94 s</p></td>
101
- <td><p style="text-align: right">3,175 MB</p></td>
102
- <td><p style="text-align: right">1,504 MB</p></td>
103
- <td><p style="text-align: right">1,523 MB</p></td>
 
104
  <td><p style="text-align: left"><a style="text-decoration: none" href="https://huggingface.co/litert-community/Qwen2.5-1.5B-Instruct/resolve/main/Qwen2.5-1.5B-Instruct_multi-prefill-seq_q8_ekv1280.task">&#128279;</a></p></td>
105
  </tr>
106
  <tr>
107
- <td><p style="text-align: right">417 tk/s</p></td>
108
- <td><p style="text-align: right">22 tk/s</p></td>
109
- <td><p style="text-align: right">7.93 s</p></td>
110
- <td><p style="text-align: right">3,176 MB</p></td>
111
- <td><p style="text-align: right">1,875 MB</p></td>
112
- <td><p style="text-align: right">1,523 MB</p></td>
 
 
 
113
  <td><p style="text-align: left"><a style="text-decoration: none" href="https://huggingface.co/litert-community/Qwen2.5-1.5B-Instruct/resolve/main/Qwen2.5-1.5B-Instruct_multi-prefill-seq_q8_ekv4096.task">&#128279;</a></p></td>
114
  </tr>
115
 
 
11
  This model provides a few variants of
12
  [Qwen/Qwen2.5-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct) that are ready for
13
  deployment on Android using the
14
+ [LiteRT (fka TFLite) stack](https://ai.google.dev/edge/litert),
15
+ [MediaPipe LLM Inference API](https://ai.google.dev/edge/mediapipe/solutions/genai/llm_inference) and
16
+ [LiteRT-LM](https://github.com/google-ai-edge/LiteRT-LM).
17
 
18
  ## Use the models
19
 
 
29
 
30
  ### Android
31
 
32
+ #### Edge Gallery App
33
+ * Download or build the [app](https://github.com/google-ai-edge/gallery?tab=readme-ov-file#-get-started-in-minutes) from GitHub.
34
+
35
+ * Install the [app](https://play.google.com/store/apps/details?id=com.google.ai.edge.gallery&pli=1) from Google Play.
36
+
37
+ * Follow the instructions in the app.
38
+
39
+ #### LLM Inference API
40
+
41
  * Download and install
42
  [the apk](https://github.com/google-ai-edge/gallery/releases/latest/download/ai-edge-gallery.apk).
43
  * Follow the instructions in the app.
 
56
 
57
  ### Android
58
 
59
+ Note that all benchmark stats are from a Samsung S25 Ultra and multiple prefill signatures enabled.
60
 
61
  <table border="1">
62
  <tr>
 
66
  <th style="text-align: left">Prefill (tokens/sec)</th>
67
  <th style="text-align: left">Decode (tokens/sec)</th>
68
  <th style="text-align: left">Time-to-first-token (sec)</th>
 
 
69
  <th style="text-align: left">Model size (MB)</th>
70
+ <th style="text-align: left">Peak RSS Memory (MB)</th>
71
+ <th style="text-align: left">GPU Memory (RSS in MB)</th>
72
  <th></th>
73
  </tr>
74
  <tr>
75
+ <td><p style="text-align: left">CPU</p></td>
76
+ <td><p style="text-align: left">fp32 (baseline)</p></td>
77
  <td><p style="text-align: right">1280</p></td>
78
+ <td><p style="text-align: right">49.50</p></td>
79
+ <td><p style="text-align: right">10 tk/s</p></td>
80
+ <td><p style="text-align: right">21.25 s</p></td>
81
+ <td><p style="text-align: right">6182 MB</p></td>
82
+ <td><p style="text-align: right">6254 MB</p></td>
83
+ <td><p style="text-align: right">N/A</p></td>
84
  <td><p style="text-align: left"><a style="text-decoration: none" href="https://huggingface.co/litert-community/Qwen2.5-1.5B-Instruct/resolve/main/Qwen2.5-1.5B-Instruct_multi-prefill-seq_f32_ekv1280.task">&#128279;</a></p></td>
85
  </tr>
86
  <tr>
87
+ <td><p style="text-align: left">CPU</p></td>
88
+ <td><p style="text-align: left">dynamic_int8</p></td>
89
+ <td><p style="text-align: right">1280</p></td>
90
+ <td><p style="text-align: right">297.58</p></td>
91
+ <td><p style="text-align: right">34.25 tk/s</p></td>
92
+ <td><p style="text-align: right">3.71 s</p></td>
93
+ <td><p style="text-align: right">1598 MB</p></td>
94
+ <td><p style="text-align: right">1997 MB</p></td>
95
+ <td><p style="text-align: right">N/A</p></td>
96
  <td><p style="text-align: left"><a style="text-decoration: none" href="https://huggingface.co/litert-community/Qwen2.5-1.5B-Instruct/resolve/main/Qwen2.5-1.5B-Instruct_multi-prefill-seq_q8_ekv1280.task">&#128279;</a></p></td>
97
  </tr>
98
  <tr>
99
+ <td><p style="text-align: left">CPU</p></td>
100
+ <td><p style="text-align: left">dynamic_int8</p></td>
101
+ <td><p style="text-align: right">4096</p></td>
102
+ <td><p style="text-align: right">162.72 tk/s</p></td>
103
+ <td><p style="text-align: right">26.06 tk/s</p></td>
104
+ <td><p style="text-align: right">6.57 s</p></td>
105
+ <td><p style="text-align: right">1598 MB</p></td>
106
+ <td><p style="text-align: right">2216 MB</p></td>
107
+ <td><p style="text-align: right">N/A</p></td>
108
  <td><p style="text-align: left"><a style="text-decoration: none" href="https://huggingface.co/litert-community/Qwen2.5-1.5B-Instruct/resolve/main/Qwen2.5-1.5B-Instruct_multi-prefill-seq_q8_ekv4096.task">&#128279;</a></p></td>
109
  </tr>
110
  <tr>
111
+ <td><p style="text-align: left">GPU</p></td>
112
+ <td><p style="text-align: left">dynamic_int8</p></td>
113
+ <td><p style="text-align: right">1280</p></td>
114
+ <td><p style="text-align: right">1667.75 tk/s</p></td>
115
+ <td><p style="text-align: right">30.88 tk/s</p></td>
116
+ <td><p style="text-align: right">3.63 s</p></td>
117
+ <td><p style="text-align: right">1598 MB</p></td>
118
+ <td><p style="text-align: right">1846 MB</p></td>
119
+ <td><p style="text-align: right">1505 MB</p></td>
120
  <td><p style="text-align: left"><a style="text-decoration: none" href="https://huggingface.co/litert-community/Qwen2.5-1.5B-Instruct/resolve/main/Qwen2.5-1.5B-Instruct_multi-prefill-seq_q8_ekv1280.task">&#128279;</a></p></td>
121
  </tr>
122
  <tr>
123
+ <td><p style="text-align: left">GPU</p></td>
124
+ <td><p style="text-align: left">dynamic_int8</p></td>
125
+ <td><p style="text-align: right">4096</p></td>
126
+ <td><p style="text-align: right">933.45 tk/s</p></td>
127
+ <td><p style="text-align: right">27.30 tk/s</p></td>
128
+ <td><p style="text-align: right">4.77 s</p></td>
129
+ <td><p style="text-align: right">1598 MB</p></td>
130
+ <td><p style="text-align: right">1869 MB</p></td>
131
+ <td><p style="text-align: right">1505 MB</p></td>
132
  <td><p style="text-align: left"><a style="text-decoration: none" href="https://huggingface.co/litert-community/Qwen2.5-1.5B-Instruct/resolve/main/Qwen2.5-1.5B-Instruct_multi-prefill-seq_q8_ekv4096.task">&#128279;</a></p></td>
133
  </tr>
134