dstack to manage clusters of on-prem servers for AI workloads with ease
If you don't know what dstack is yet, please refer to this post and the official documents to grasp the basic understandings of dstack. In simple terms, dstack is computing resource management toolkit with a primary focus on AI development, training, and development.
In the beginning, dstack showed a great way to manage and control multiple machines from various cloud services including GCP(Google Cloud Platform), AWS(Aamazon Web Service), Microsoft Azure, OCI(Oracle Cloud Infrastructure), Lambda Labs, RunPod, Vast.ai, DataCrunch, and CUDO with the support of CPU, NVIDIA GPU, AMD GPU, and TPU. This makes a lot of sense because you can find the best resources(machines) that suits to your requirements(spec, cost, ..), and then instruct any machines from different sources in a uniform way.
From the release of verion 0.18.7
, dstack has evolved to manage not only cloud resources but also on-prem resources via ssh-fleet
feature. The best part of this feature is that you don't need to know anything about kubernetes or slurm, and it works at the minimum dependencies with (almost) a plain docker technology. Here are some of the advantages of ssh-fleet
:
- easy setup (no kubernetes. no slurm)
- In order to setup kubernetes or slurm, there are a lot of prior knowledge that you need to study. Also, there are huge amount of engineering effort to actually set them up, run them, and manage them. For dstack's
ssh-fleet
, there is almost nothing that you need to know about except what you already have such as installing cuda and docker.
- In order to setup kubernetes or slurm, there are a lot of prior knowledge that you need to study. Also, there are huge amount of engineering effort to actually set them up, run them, and manage them. For dstack's
- gather scattered local machines as clusters
- Not all organizations have a dedicated on-prem cloud computing infrastructure. There are lots of labs managing their own computing resources per projects. However, in these days, we are dealing with larger and larger machine learning models such as large language models(LLMs), and it often requires multi-node collaboration. For dstack's
ssh-fleet
, You can simply manage multiple resources as a cluster, then assing jobs with single node or multi-node setups.
- Not all organizations have a dedicated on-prem cloud computing infrastructure. There are lots of labs managing their own computing resources per projects. However, in these days, we are dealing with larger and larger machine learning models such as large language models(LLMs), and it often requires multi-node collaboration. For dstack's
- centralized management between cloud and on-prem resources
- Machine learning is all about running lots of experiments to find out the best model for you problem. This means there should be multiple experiments running in parallel. Otherwise, you need to spend too much time. With dstack, you can assign more experiments to cloud resources while keeping your on-prem resources busy.
Now, let's go through the basic tutorial on how to setup your own ssh-fleet
with dstack.
Pre-requisites for ssh-fleet
On the remote server side
- Install docker
- Docker provides the containerization technology that dstack relies on to encapsulate your applications and their dependencies. It ensures consistency and reproducibility across different environments.
- Follow the official Docker installation instructions for your Linux distribution. This typically involves adding Docker's repository and then using your package manager (apt, yum, etc.) to install the
docker-ce
package. After successful installation, you can verify if docker is up and running with the following command. It should print outHello from Docker
message in terminal:
$ sudo docker run hello-world
- Install cuda toolkit >= 12.1
- If you plan to use NVIDIA GPUs for your AI workloads, the CUDA Toolkit is essential. It provides the necessary libraries and tools for your applications to utilize the GPU's processing power. dstack requires CUDA 12.1 or higher for compatibility and to leverage the latest features.
- Download the CUDA Toolkit installer from NVIDIA's website and follow the installation instructions. Make sure to choose the correct version for your Linux distribution and system architecture.
- Install cuda container toolkit
- The CUDA Container Toolkit allows Docker containers to access and utilize NVIDIA GPUs. This is crucial for running GPU-accelerated AI workloads within dstack.
- Again, refer to NVIDIA's official documentation. You'll typically need to add NVIDIA's container toolkit repository and then install the nvidia-container-toolkit package using your package manager.
If you are a AMD GPU user, instead of the step 2 and 3, install AMD specific drivers by following the release note.
- sudo visudo for username ALL=(ALL) NOPASSWD: ALL
- This configuration allows the dstack server to execute commands on the remote server without requiring a password. This is necessary for dstack to automatically manage containers and resources on your behalf. It is worth noting that this grants significant privileges to the specified user. Ensure this user is dedicated to dstack operations and apply appropriate security measures.
- Open the
/etc/sudoers
file using sudo visudo with below command. Add the lineusername ALL=(ALL) NOPASSWD: ALL
, replacing username with the actual username you'll use to connect to the remote server. After this, theusername
could run any command viasudo
mode without password entering prompt:
$ sudo visudo
On the local side
- generate
id_rsa
- SSH keys provide a secure way to authenticate with your remote servers without needing to enter a password each time. dstack uses these keys to establish secure connections to your on-prem machines for automated cluster management.
- Use the
ssh-keygen
command on your local machine to generate an SSH key pair as below. This will create a private key (id_rsa
) and a public key (id_rsa.pub
):
$ ssh-keygen -t rsa
- ssh-copy-id
- This step allows your local machine to automatically authenticate with the remote server using the SSH key pair, simplifying the connection process and enabling dstack to manage the remote server without manual intervention.
- Run
ssh-copy-id username@remote_host
on your local machine as below, replacingusername
andremote_host
with the appropriate values. This command copies your public key to the remote server'sauthorized_keys
file. After this, you can directlyssh
connect to the remote server without needing to enter password via prompt:
$ ssh-copy-id username@remote_host
Install dstack and register ssh fleets on local side
- install dstack
Use pip to install dstack and all its optional dependencies. You don't need to specify [all]
if you want to use dstack just for managing on-prem clusters, but [all]
is helpful when you want to manage both on-prem and all other cloud resources simultaneously:
$ pip install "dstack[all]"
- run dstack server
Starts the dstack server on your local machine. The dstack server is the core component that manages your resources, schedules jobs, and handles communication between your local machine and your compute resources (both cloud and on-prem).
$ dstack server
- write fleet.dstack.yml
Define a YAML file of your ssh-fleet
something like below. There are a number of configurations that you can make (see the dstack's official API doc), but below shows the essentials. You can follow the Pre-requisites for ssh-fleet section on this blog post for every servers that you want to have for the ssh-fleet
cluster. For instance, below YAML file shows that I have registered 4 servers(2 with 3xRTX6000 Ada, 2 with 2xA6000). Also note that it points to the rsa
file that we generated from the On the local side section above:
type: fleet
# The name is optional, if not specified, generated randomly
name: my-ssh-fleet
# Ensure instances are interconnected
placement: cluster
# The user, private SSH key, and hostnames of the on-prem servers
ssh_config:
user: username
identity_file: ~/.ssh/id_rsa
hosts:
- xxx.xxx.171.224
- xxx.xxx.171.225
- xxx.xxx.164.172
- xxx.xxx.165.51
Note that placement: cluster
means to ensure instances(servers) are interconnected like sharing the same network. If listed instances do not share the same network, the ssh-fleet
provisioning will fail. However, if they do, and placement: cluster
is set, you can run multi-node job such as distributed AI model training.
- apply fleet.dstack.yml
Tell dstack to read the fleet.dstack.yml file and create the ssh-fleet based on your configuration. dstack will attempt to connect to each of the specified hosts using the provided SSH credentials.
$ dstack apply -f fleet.dstack.yml
List the available fleets in your dstack setup. You should see your my-ssh-fleet
listed with details about the connected instances (servers), their resources (CPU, memory, GPU, disk), and their current status.
$ dstack fleet
FLEET INSTANCE BACKEND RESOURCES PRICE STATUS CREATED
my-ssh-fleet 1 ssh (remote) 32xCPU, 503GB, 3xRTX6000Ada (48GB), 1555.1GB (disk) $0.0 idle 2 weeks ago
2 ssh (remote) 32xCPU, 503GB, 3xRTX6000Ada (48GB), 1555.1GB (disk) $0.0 idle 2 weeks ago
3 ssh (remote) 64xCPU, 693GB, 2xA6000 (48GB), 1683.6GB (disk) $0.0 idle 2 weeks ago
4 ssh (remote) 64xCPU, 693GB, 2xA6000 (48GB), 1683.6GB (disk) $0.0 idle 2 weeks ago
Also, from the terminal where you run the dstack server
, you should see the similar logs as below which indicates that dstack has successfully found and established the connections with the listed servers:
[08:24:07] INFO dstack._internal.server.background.tasks.process_instances:190 Adding ssh instance my-ssh-fleet-0...
INFO dstack._internal.server.background.tasks.process_instances:325 Connected to user xxx.xxx.171.224
[08:24:13] INFO dstack._internal.server.background.tasks.process_instances:190 Adding ssh instance my-ssh-fleet-1...
INFO dstack._internal.server.background.tasks.process_instances:325 Connected to user xxx.xxx.171.225
[08:24:17] INFO dstack._internal.server.background.tasks.process_instances:190 Adding ssh instance my-ssh-fleet-2...
[08:24:18] INFO dstack._internal.server.background.tasks.process_instances:325 Connected to user xxx.xxx.164.172
[08:24:23] INFO dstack._internal.server.background.tasks.process_instances:190 Adding ssh instance my-ssh-fleet-3...
INFO dstack._internal.server.background.tasks.process_instances:325 Connected to user xxx.xxx.165.51
[08:24:41] INFO dstack._internal.server.background.tasks.process_instances:245 The instance my-ssh-fleet-0 (xxx.xxx.171.224) was successfully added
[08:24:42] INFO dstack._internal.server.background.tasks.process_instances:245 The instance my-ssh-fleet-3 (xxx.xxx.165.51) was successfully added
[08:24:45] INFO dstack._internal.server.background.tasks.process_instances:245 The instance my-ssh-fleet-1 (xxx.xxx.171.225) was successfully added
[08:24:57] INFO dstack._internal.server.background.tasks.process_instances:245 The instance my-ssh-fleet-2 (xxx.xxx.164.172) was successfully added
- write task.dstack.yml
To test out, I have written a simple YAML file of dstack's task
as below for defining an LLM fine-tuning job with Hugging Face's Alignment Handbook framework. Note that I have requested 2
nodes each of which with 3 x RTX6000Ada
GPUs:
type: task
nodes: 2
python: "3.11"
nvcc: true
env:
- HUGGING_FACE_HUB_TOKEN
- WANDB_API_KEY
- ACCELERATE_LOG_LEVEL=info
commands:
- cd alignment-handbook
- python -m pip install .
- python -m pip install flash-attn --no-build-isolation
- pip install wandb
- pip install huggingface-hub==0.24.7
- accelerate launch
--config_file recipes/accelerate_configs/multi_gpu.yaml
--main_process_ip=$DSTACK_MASTER_NODE_IP
--main_process_port=8008
--machine_rank=$DSTACK_NODE_RANK
--num_processes=$DSTACK_GPUS_NUM
--num_machines=$DSTACK_NODES_NUM
scripts/run_sft.py
recipes/custom.yaml
ports:
- 50002
resources:
gpu:
name: rtx6000ada
memory: 48GB
count: 3
shm_size: 24GB
dstack let users to define a three different types of job. A dev environment lets you provision a remote machine with your code, dependencies, and resources, and access it with your desktop IDE. A task allows you to schedule a job or run a web app. It lets you configure dependencies, resources, ports, and more. Tasks can be distributed and run on clusters. A service allows you to deploy a web app or a model as a scalable endpoint. It lets you configure dependencies, resources, authorizarion, auto-scaling rules, etc.
service is not supported for on-prem environment since it requires a gateway in the current dstack version(
0.18.17
), but this requirement will soon to be lifted in the future release.
- apply task.dstack.yml
Apply previously written task.dstack.yml
with dstack apply -f
command as below. Then it will show the registered target servers to provision the job. When you enter y
on the prompt, the fine-tuning job will be launched:
$ dstack apply -f task.dstack.yml
Configuration train.dstack.yml
Project main
User admin
Pool default-pool
Min resources 2..xCPU, 8GB.., 2xGPU (48GB), 100GB.. (disk)
Max price -
Max duration 72h
Spot policy on-demand
Retry policy no
Creation policy reuse-or-create
Termination policy destroy-after-idle
Termination idle time 5m
# BACKEND REGION INSTANCE RESOURCES SPOT PRICE
1 ssh remote instance 32xCPU, 503GB, 3xRTX6000Ada (48GB), 1555.1GB (disk) no $0 idle
2 ssh remote instance 32xCPU, 503GB, 3xRTX6000Ada (48GB), 1555.1GB (disk) no $0 idle
(BONUS) Register other cloud services at the same time
Now, we have registered on-prem servers as a cluster with dstack's ssh-fleet
. However, you may want to benefit from cloud services at the same time. For instance, this could be particularly useful if you have multiple fine-tuning experiments to run. In this case, you can assign some experiments to the on-prem cluster while you can assign some other experiments on the cloud service. This will significantly reduce the time spending while maximizing the cost expenditure.
To do this, simply follow dstack official document on server/config.yml to add your favorite cloud services. For instance, GCP backend could be added with application default
credentials with gcloud
CLI toolkit. Or, you can add GCP backend with fine-grained control with service account
credentials.
After having both on-prem cluster and cloud service as backends, dstack apply
command tries to find out appropriate instances from cloud service by default. Append --backend remote
option when you want to provision jobs on the on-prem cluster.
# to target cloud service
$ dstack apply -f task.dstack.yml
# to target on-prem cluster
$ dstack apply -f task.dstack.yml --backend remote
Concluding thoughts
dstack's ssh-fleet
feature offers a streamlined approach to managing on-prem clusters for AI workloads. By simplifying setup and centralizing control, dstack empowers AI practitioners to efficiently leverage their on-prem resources, whether it's for large-scale model training or running multiple experiments. The ability to seamlessly integrate with cloud services further enhances flexibility and scalability, making dstack a valuable tool in modern AI development.
As dstack continues to evolve, we can anticipate even more powerful features and broader support for various hardware and software configurations. This continuous development promises to further solidify dstack's position as a versatile and indispensable tool for managing AI infrastructure, both on-prem and in the cloud.