Commit
·
b81bc93
1
Parent(s):
d7024c3
Update README.md (#1)
Browse files- Update README.md (94d995fe79a3074f97628ebb8412b3e284af80d0)
- Update README.md (b1b51d00fd1c0a3780fa225ae34872e95f0eb38c)
README.md
CHANGED
@@ -7,11 +7,39 @@ sdk: static
|
|
7 |
pinned: false
|
8 |
---
|
9 |
|
10 |
-
Text-Generation-Inference is
|
11 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
12 |
|
13 |
<img width="300px" src="https://huggingface.co/spaces/text-generation-inference/README/resolve/main/architecture.jpg" />
|
14 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
15 |
## Check out the source code 👉
|
16 |
- the server backend: https://github.com/huggingface/text-generation-inference
|
17 |
-
- the Chat UI: https://huggingface.co/spaces/text-generation-inference/chat-ui
|
|
|
|
|
|
|
|
|
|
|
|
|
|
7 |
pinned: false
|
8 |
---
|
9 |
|
10 |
+
Text-Generation-Inference is, an open-source, purpose-built solution for deploying and serving Large Language Models (LLMs). TGI enables high-performance text generation using Tensor Parallelism and dynamic batching for the most popular open-source LLMs, including StarCoder, BLOOM, GPT-NeoX, Llama, and T5. Text Generation Inference is already used by customers such as IBM, Grammarly, and the Open-Assistant initiative implements optimization for all supported model architectures, including:
|
11 |
+
|
12 |
+
- Tensor Parallelism and custom cuda kernels
|
13 |
+
- Optimized transformers code for inference using flash-attention and Paged Attention on the most popular architectures
|
14 |
+
- Quantization with bitsandbytes or gptq
|
15 |
+
- Continuous batching of incoming requests for increased total throughput
|
16 |
+
- Accelerated weight loading (start-up time) with safetensors
|
17 |
+
- Logits warpers (temperature scaling, topk, repetition penalty ...)
|
18 |
+
- Watermarking with A Watermark for Large Language Models
|
19 |
+
- Stop sequences, Log probabilities
|
20 |
+
- Token streaming using Server-Sent Events (SSE)
|
21 |
|
22 |
<img width="300px" src="https://huggingface.co/spaces/text-generation-inference/README/resolve/main/architecture.jpg" />
|
23 |
|
24 |
+
## Currently optimized architectures
|
25 |
+
|
26 |
+
- [BLOOM](https://huggingface.co/bigscience/bloom)
|
27 |
+
- [FLAN-T5](https://huggingface.co/google/flan-t5-xxl)
|
28 |
+
- [Galactica](https://huggingface.co/facebook/galactica-120b)
|
29 |
+
- [GPT-Neox](https://huggingface.co/EleutherAI/gpt-neox-20b)
|
30 |
+
- [Llama](https://github.com/facebookresearch/llama)
|
31 |
+
- [OPT](https://huggingface.co/facebook/opt-66b)
|
32 |
+
- [SantaCoder](https://huggingface.co/bigcode/santacoder)
|
33 |
+
- [Starcoder](https://huggingface.co/bigcode/starcoder)
|
34 |
+
- [Falcon 7B](https://huggingface.co/tiiuae/falcon-7b)
|
35 |
+
- [Falcon 40B](https://huggingface.co/tiiuae/falcon-40b)
|
36 |
+
|
37 |
## Check out the source code 👉
|
38 |
- the server backend: https://github.com/huggingface/text-generation-inference
|
39 |
+
- the Chat UI: https://huggingface.co/spaces/text-generation-inference/chat-ui
|
40 |
+
|
41 |
+
## Check out examples
|
42 |
+
|
43 |
+
- [Introducing the Hugging Face LLM Inference Container for Amazon SageMaker](https://huggingface.co/blog/sagemaker-huggingface-llm)
|
44 |
+
- [Deploy LLMs with Hugging Face Inference Endpoints](https://huggingface.co/blog/inference-endpoints-llm)
|
45 |
+
|