Spaces:

text-generation-inference
/

README

Running

philschmid commited on Jul 5, 2023

Commit

b81bc93

1 Parent(s): d7024c3

Update README.md (#1)

- Update README.md (94d995fe79a3074f97628ebb8412b3e284af80d0)
- Update README.md (b1b51d00fd1c0a3780fa225ae34872e95f0eb38c)

Files changed (1) hide show

README.md +31 -3

README.md CHANGED Viewed

@@ -7,11 +7,39 @@ sdk: static
 pinned: false
 ---
-Text-Generation-Inference is a Rust, Python and gRPC server for text generation inference. Used in production at [HuggingFace](https://huggingface.co)
-to power LLMs api-inference widgets.
 <img width="300px" src="https://huggingface.co/spaces/text-generation-inference/README/resolve/main/architecture.jpg" />
 ## Check out the source code 👉
 - the server backend: https://github.com/huggingface/text-generation-inference
-- the Chat UI: https://huggingface.co/spaces/text-generation-inference/chat-ui

 pinned: false
 ---
+Text-Generation-Inference is, an open-source, purpose-built solution for deploying and serving Large Language Models (LLMs). TGI enables high-performance text generation using Tensor Parallelism and dynamic batching for the most popular open-source LLMs, including StarCoder, BLOOM, GPT-NeoX, Llama, and T5. Text Generation Inference is already used by customers such as IBM, Grammarly, and the Open-Assistant initiative implements optimization for all supported model architectures, including:
+- Tensor Parallelism and custom cuda kernels
+- Optimized transformers code for inference using flash-attention and Paged Attention on the most popular architectures
+- Quantization with bitsandbytes or gptq
+- Continuous batching of incoming requests for increased total throughput
+- Accelerated weight loading (start-up time) with safetensors
+- Logits warpers (temperature scaling, topk, repetition penalty ...)
+- Watermarking with A Watermark for Large Language Models
+- Stop sequences, Log probabilities
+- Token streaming using Server-Sent Events (SSE)
 <img width="300px" src="https://huggingface.co/spaces/text-generation-inference/README/resolve/main/architecture.jpg" />
+## Currently optimized architectures
+- [BLOOM](https://huggingface.co/bigscience/bloom)
+- [FLAN-T5](https://huggingface.co/google/flan-t5-xxl)
+- [Galactica](https://huggingface.co/facebook/galactica-120b)
+- [GPT-Neox](https://huggingface.co/EleutherAI/gpt-neox-20b)
+- [Llama](https://github.com/facebookresearch/llama)
+- [OPT](https://huggingface.co/facebook/opt-66b)
+- [SantaCoder](https://huggingface.co/bigcode/santacoder)
+- [Starcoder](https://huggingface.co/bigcode/starcoder)
+- [Falcon 7B](https://huggingface.co/tiiuae/falcon-7b)
+- [Falcon 40B](https://huggingface.co/tiiuae/falcon-40b)
 ## Check out the source code 👉
 - the server backend: https://github.com/huggingface/text-generation-inference
+- the Chat UI: https://huggingface.co/spaces/text-generation-inference/chat-ui
+## Check out examples
+- [Introducing the Hugging Face LLM Inference Container for Amazon SageMaker](https://huggingface.co/blog/sagemaker-huggingface-llm)
+- [Deploy LLMs with Hugging Face Inference Endpoints](https://huggingface.co/blog/inference-endpoints-llm)