shuyuej commited on
Commit
293cb52
·
verified ·
1 Parent(s): 4b24cb3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +131 -0
README.md CHANGED
@@ -5,3 +5,134 @@ license: apache-2.0
5
  # The Public-shared LoRA Adapter for shuyuej/Llama-3.3-70B-Instruct-GPTQ Model
6
  This is publicly-shared LoRA Adapter for the `shuyuej/Llama-3.3-70B-Instruct-GPTQ` model.<br>
7
  Please check our GPTQ-quantized model [https://huggingface.co/shuyuej/Llama-3.3-70B-Instruct-GPTQ](https://huggingface.co/shuyuej/Llama-3.3-70B-Instruct-GPTQ).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
  # The Public-shared LoRA Adapter for shuyuej/Llama-3.3-70B-Instruct-GPTQ Model
6
  This is publicly-shared LoRA Adapter for the `shuyuej/Llama-3.3-70B-Instruct-GPTQ` model.<br>
7
  Please check our GPTQ-quantized model [https://huggingface.co/shuyuej/Llama-3.3-70B-Instruct-GPTQ](https://huggingface.co/shuyuej/Llama-3.3-70B-Instruct-GPTQ).
8
+
9
+ # 🔥 Real-world deployment
10
+ For real-world deployment, please refer to the [vLLM Distributed Inference and Serving](https://docs.vllm.ai/en/latest/serving/distributed_serving.html) and [OpenAI Compatible Server](https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html). We provide a deployment script [here](https://github.com/vkola-lab/PodGPT/blob/main/scripts/deployment.py).
11
+
12
+ > [!NOTE]
13
+ > The vLLM version we are using is `0.6.2`. Please check [this version](https://github.com/vllm-project/vllm/releases/tag/v0.6.2).
14
+
15
+ vLLM can be deployed as a server that implements the OpenAI API protocol. This allows vLLM to be used as a drop-in replacement for applications using OpenAI API. By default, it starts the server at `http://localhost:8000`.
16
+ ```shell
17
+ vllm serve shuyuej/Llama-3.3-70B-Instruct-GPTQ \
18
+ --quantization gptq \
19
+ --trust-remote-code \
20
+ --dtype float16 \
21
+ --max-model-len 4096 \
22
+ --distributed-executor-backend mp \
23
+ --pipeline-parallel-size 4 \
24
+ --api-key token-abc123
25
+ ```
26
+ Please check [here](https://docs.vllm.ai/en/latest/usage/engine_args.html) if you wanna change `Engine Arguments`.
27
+
28
+ If you would like to deploy your LoRA adapter, please refer to the [vLLM documentation](https://docs.vllm.ai/en/latest/usage/lora.html#serving-lora-adapters) for a detailed guide.
29
+ It provides step-by-step instructions on how to serve LoRA adapters effectively in a vLLM environment.
30
+ ```shell
31
+ vllm serve shuyuej/Llama-3.3-70B-Instruct-GPTQ \
32
+ --quantization gptq \
33
+ --trust-remote-code \
34
+ --dtype float16 \
35
+ --max-model-len 4096 \
36
+ --distributed-executor-backend mp \
37
+ --pipeline-parallel-size 4 \
38
+ --api-key token-abc123 \
39
+ --enable-lora \
40
+ --lora-modules adapter=checkpoint-18640
41
+ ```
42
+
43
+ Since this server is compatible with OpenAI API, you can use it as a drop-in replacement for any applications using OpenAI API.
44
+ For example, another way to query the server is via the openai python package:
45
+ ```python
46
+ #!/usr/bin/env python
47
+ # coding=utf-8
48
+
49
+ import time
50
+ import asyncio
51
+
52
+ from openai import AsyncOpenAI
53
+
54
+ # Our system prompt
55
+ SYSTEM_PROMPT = (
56
+ "I am PodGPT, a large language model developed by the Kolachalama Lab in Boston, "
57
+ "specializing in science, technology, engineering, mathematics, and medicine "
58
+ "(STEMM)-related research and education, powered by podcast audio.\n"
59
+ "I provide information based on established scientific knowledge but must not offer "
60
+ "personal medical advice or present myself as a licensed medical professional.\n"
61
+ "I will maintain a consistently professional and informative tone, avoiding humor, "
62
+ "sarcasm, and pop culture references.\n"
63
+ "I will prioritize factual accuracy and clarity while ensuring my responses are "
64
+ "educational and non-harmful, adhering to the principle of 'do no harm'.\n"
65
+ "My responses are for informational purposes only and should not be considered a "
66
+ "substitute for professional consultation."
67
+ )
68
+
69
+ # Initialize the AsyncOpenAI client
70
+ client = AsyncOpenAI(
71
+ base_url="http://localhost:8000/v1",
72
+ api_key="token-abc123",
73
+ )
74
+
75
+
76
+ async def main(message):
77
+ """
78
+ Streaming responses with async usage and "await" with each API call:
79
+ Reference: https://github.com/openai/openai-python?tab=readme-ov-file#streaming-responses
80
+ :param message: The user query
81
+ """
82
+ start_time = time.time()
83
+ stream = await client.chat.completions.create(
84
+ model="shuyuej/Llama-3.3-70B-Instruct-GPTQ",
85
+ messages=[
86
+ {
87
+ "role": "system",
88
+ "content": SYSTEM_PROMPT,
89
+ },
90
+ {
91
+ "role": "user",
92
+ "content": message,
93
+ }
94
+ ],
95
+ max_tokens=2048,
96
+ temperature=0.2,
97
+ top_p=1,
98
+ stream=True,
99
+ extra_body={
100
+ "ignore_eos": False,
101
+ # https://huggingface.co/shuyuej/Llama-3.3-70B-Instruct-GPTQ/blob/main/config.json#L10-L14
102
+ "stop_token_ids": [128001, 128008, 128009],
103
+ },
104
+ )
105
+
106
+ print(f"The user's query is\n {message}\n ")
107
+ print("The model's response is\n")
108
+ async for chunk in stream:
109
+ print(chunk.choices[0].delta.content or "", end="")
110
+ print(f"\nInference time: {time.time() - start_time:.2f} seconds\n")
111
+ print("=" * 100)
112
+
113
+
114
+ if __name__ == "__main__":
115
+ # Some random user queries
116
+ prompts = [
117
+ "Hello, my name is",
118
+ "The president of the United States is",
119
+ "The capital of France is",
120
+ "The future of AI is",
121
+ "Can you tell me more about Bruce Lee?",
122
+ "What are the differences between DNA and RNA?",
123
+ "What is dementia and Alzheimer's disease?",
124
+ "Tell me the differences between Alzheimer's disease and dementia"
125
+ ]
126
+
127
+ # Conduct model inference
128
+ for message in prompts:
129
+ asyncio.run(main(message=message))
130
+ print("\n\n")
131
+ ```
132
+
133
+ <details>
134
+ <summary>Here is a demo of the real-world model inference and deployment</summary>
135
+ <p align="center">
136
+ <a href="https://www.medrxiv.org/content/10.1101/2024.07.11.24310304v2"> <img src="figures/inference_demo.gif"></a>
137
+ </p>
138
+ </details>