shuyuej
/

Public-Shared-LoRA-for-Llama-3.3-70B-Instruct-GPTQ

Safetensors

Model card Files Files and versions Community

shuyuej commited on Dec 21, 2024

Commit

293cb52

verified ·

1 Parent(s): 4b24cb3

Update README.md

Browse files

Files changed (1) hide show

README.md +131 -0

README.md CHANGED Viewed

@@ -5,3 +5,134 @@ license: apache-2.0
 # The Public-shared LoRA Adapter for shuyuej/Llama-3.3-70B-Instruct-GPTQ Model
 This is publicly-shared LoRA Adapter for the `shuyuej/Llama-3.3-70B-Instruct-GPTQ` model.<br>
 Please check our GPTQ-quantized model [https://huggingface.co/shuyuej/Llama-3.3-70B-Instruct-GPTQ](https://huggingface.co/shuyuej/Llama-3.3-70B-Instruct-GPTQ).

 # The Public-shared LoRA Adapter for shuyuej/Llama-3.3-70B-Instruct-GPTQ Model
 This is publicly-shared LoRA Adapter for the `shuyuej/Llama-3.3-70B-Instruct-GPTQ` model.<br>
 Please check our GPTQ-quantized model [https://huggingface.co/shuyuej/Llama-3.3-70B-Instruct-GPTQ](https://huggingface.co/shuyuej/Llama-3.3-70B-Instruct-GPTQ).
+# 🔥 Real-world deployment
+For real-world deployment, please refer to the [vLLM Distributed Inference and Serving](https://docs.vllm.ai/en/latest/serving/distributed_serving.html) and [OpenAI Compatible Server](https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html). We provide a deployment script [here](https://github.com/vkola-lab/PodGPT/blob/main/scripts/deployment.py).
+> [!NOTE]
+> The vLLM version we are using is `0.6.2`. Please check [this version](https://github.com/vllm-project/vllm/releases/tag/v0.6.2).
+vLLM can be deployed as a server that implements the OpenAI API protocol. This allows vLLM to be used as a drop-in replacement for applications using OpenAI API. By default, it starts the server at `http://localhost:8000`.
+```shell
+vllm serve shuyuej/Llama-3.3-70B-Instruct-GPTQ \
+    --quantization gptq \
+    --trust-remote-code \
+    --dtype float16 \
+    --max-model-len 4096 \
+    --distributed-executor-backend mp \
+    --pipeline-parallel-size 4 \
+    --api-key token-abc123
+```
+Please check [here](https://docs.vllm.ai/en/latest/usage/engine_args.html) if you wanna change `Engine Arguments`.
+If you would like to deploy your LoRA adapter, please refer to the [vLLM documentation](https://docs.vllm.ai/en/latest/usage/lora.html#serving-lora-adapters) for a detailed guide.
+It provides step-by-step instructions on how to serve LoRA adapters effectively in a vLLM environment.
+```shell
+vllm serve shuyuej/Llama-3.3-70B-Instruct-GPTQ \
+    --quantization gptq \
+    --trust-remote-code \
+    --dtype float16 \
+    --max-model-len 4096 \
+    --distributed-executor-backend mp \
+    --pipeline-parallel-size 4 \
+    --api-key token-abc123 \
+    --enable-lora \
+    --lora-modules adapter=checkpoint-18640
+```
+Since this server is compatible with OpenAI API, you can use it as a drop-in replacement for any applications using OpenAI API.
+For example, another way to query the server is via the openai python package:
+```python
+#!/usr/bin/env python
+# coding=utf-8
+import time
+import asyncio
+from openai import AsyncOpenAI
+# Our system prompt
+SYSTEM_PROMPT = (
+    "I am PodGPT, a large language model developed by the Kolachalama Lab in Boston, "
+    "specializing in science, technology, engineering, mathematics, and medicine "
+    "(STEMM)-related research and education, powered by podcast audio.\n"
+    "I provide information based on established scientific knowledge but must not offer "
+    "personal medical advice or present myself as a licensed medical professional.\n"
+    "I will maintain a consistently professional and informative tone, avoiding humor, "
+    "sarcasm, and pop culture references.\n"
+    "I will prioritize factual accuracy and clarity while ensuring my responses are "
+    "educational and non-harmful, adhering to the principle of 'do no harm'.\n"
+    "My responses are for informational purposes only and should not be considered a "
+    "substitute for professional consultation."
+)
+# Initialize the AsyncOpenAI client
+client = AsyncOpenAI(
+    base_url="http://localhost:8000/v1",
+    api_key="token-abc123",
+)
+async def main(message):
+    """
+    Streaming responses with async usage and "await" with each API call:
+    Reference: https://github.com/openai/openai-python?tab=readme-ov-file#streaming-responses
+    :param message: The user query
+    """
+    start_time = time.time()
+    stream = await client.chat.completions.create(
+        model="shuyuej/Llama-3.3-70B-Instruct-GPTQ",
+        messages=[
+            {
+                "role": "system",
+                "content": SYSTEM_PROMPT,
+            },
+            {
+                "role": "user",
+                "content": message,
+            }
+        ],
+        max_tokens=2048,
+        temperature=0.2,
+        top_p=1,
+        stream=True,
+        extra_body={
+            "ignore_eos": False,
+            # https://huggingface.co/shuyuej/Llama-3.3-70B-Instruct-GPTQ/blob/main/config.json#L10-L14
+            "stop_token_ids": [128001, 128008, 128009],
+        },
+    )
+    print(f"The user's query is\n {message}\n  ")
+    print("The model's response is\n")
+    async for chunk in stream:
+        print(chunk.choices[0].delta.content or "", end="")
+    print(f"\nInference time: {time.time() - start_time:.2f} seconds\n")
+    print("=" * 100)
+if __name__ == "__main__":
+    # Some random user queries
+    prompts = [
+        "Hello, my name is",
+        "The president of the United States is",
+        "The capital of France is",
+        "The future of AI is",
+        "Can you tell me more about Bruce Lee?",
+        "What are the differences between DNA and RNA?",
+        "What is dementia and Alzheimer's disease?",
+        "Tell me the differences between Alzheimer's disease and dementia"
+    ]
+    # Conduct model inference
+    for message in prompts:
+        asyncio.run(main(message=message))
+        print("\n\n")
+```
+<details>
+    <summary>Here is a demo of the real-world model inference and deployment</summary>
+    <p align="center">
+        <a href="https://www.medrxiv.org/content/10.1101/2024.07.11.24310304v2"> <img src="figures/inference_demo.gif"></a>
+    </p>
+</details>