Learning about Distributed Inference with DeepSpeed ZeRO-3 and Docker Compose
- Michael Gwilt
- Machine learning , Dev ops
- October 7, 2024
Today, we’re going to test out DeepSpeed ZeRO-3 in docker-compose. Perhaps in a future blog post, I’ll cover DeepSpeed-FastGen
or how to deploy this on a real multi-node/multi-gpu cluster. I also aim to compare this method vs Multi-Node Inference with vLLM
. If you’re setting up a local cluster, consider checking out high bandwidth networking with InfiniBand. It’s surprisingly affordable.
What is DeepSpeed?
DeepSpeed ZeRO-3 (Zero Redundancy Optimizer) is an optimization technique developed by Microsoft that enables efficient large-scale model training and inference. It’s particularly useful for distributing large language models across multiple GPUs or nodes. Here’s a brief overview of its key features:
Memory Efficiency: ZeRO-3 partitions model parameters, gradients, and optimizer states across GPUs, significantly reducing memory requirements per device.
Computation Efficiency: It eliminates redundant computations in data-parallel training, improving overall efficiency.
Scalability: ZeRO-3 allows for training and inference of models that are much larger than what can fit on a single GPU.
Communication Optimization: It implements efficient communication algorithms to minimize data transfer between devices.
Easy Integration: DeepSpeed can be integrated with existing PyTorch models with minimal code changes.
While ZeRO-3 shines in multi-GPU setups, even on a single GPU it can provide benefits in terms of memory management, potentially allowing you to work with larger models or batch sizes than would otherwise be possible. In our single-GPU setup, we won’t see performance improvements, but we’ll gain valuable experience in configuring and deploying a distributed inference system.
Project Structure
FancyInference/
├── deepspeed_config.json
├── docker-compose.yml
├── Dockerfile
├── entrypoint.sh
├── hostfile.txt
├── inference.py
└── requirements.txt
Configuring our Hosts
Here’s our dockerfile. We use entrypoint.sh
because I’d like to use the deepspeed launcher and I think it’s cleaner to have this in its own file instead of a long entrypoint string.
FROM pytorch/pytorch:1.13.1-cuda11.6-cudnn8-runtime
RUN apt-get update && apt-get install -y \
git \
wget \
curl \
&& rm -rf /var/lib/apt/lists/*
ENV PYTHONUNBUFFERED=1
ENV DEBIAN_FRONTEND=noninteractive
RUN pip install --upgrade pip
RUN pip install deepspeed transformers
WORKDIR /app
COPY inference.py .
COPY deepspeed_config.json .
COPY hostfile.txt .
COPY entrypoint.sh .
RUN chmod +x entrypoint.sh
ENTRYPOINT ["/app/entrypoint.sh"]
Our docker-compose.yml
will set up 3 hosts.
services:
worker-1:
build: .
container_name: worker-1 # TAKE NOTE OF THE NAME HERE, AS THIS WILL BE OUR HOSTNAME IN HOSTS.TXT
environment:
RANK: 0
WORLD_SIZE: 3
MASTER_ADDR: worker-1
MASTER_PORT: 12345
NVIDIA_VISIBLE_DEVICES: 0 # this is device 0, my only GPU. It should be comma separated device IDs
networks:
- distributed-net
deploy:
resources:
reservations:
devices:
- capabilities: [gpu]
runtime: nvidia
worker-2:
build: .
container_name: worker-2
environment:
RANK: 1
WORLD_SIZE: 3
MASTER_ADDR: worker-1
MASTER_PORT: 12345
NVIDIA_VISIBLE_DEVICES: 0
networks:
- distributed-net
deploy:
resources:
reservations:
devices:
- capabilities: [gpu]
runtime: nvidia
worker-3:
build: .
container_name: worker-3
environment:
RANK: 2
WORLD_SIZE: 3
MASTER_ADDR: worker-1
MASTER_PORT: 12345
NVIDIA_VISIBLE_DEVICES: 0
networks:
- distributed-net
deploy:
resources:
reservations:
devices:
- capabilities: [gpu]
runtime: nvidia
networks:
distributed-net:
driver: bridge
Here’s entrypoint.sh
. We’ll let our test rely on docker-compose’s internal DNS.
#!/bin/bash
export RANK=${RANK:-0}
export WORLD_SIZE=${WORLD_SIZE:-3}
export MASTER_ADDR=${MASTER_ADDR:-worker-1}
export MASTER_PORT=${MASTER_PORT:-12345}
export NVIDIA_VISIBLE_DEVICES=${NVIDIA_VISIBLE_DEVICES:-0}
deepspeed --hostfile=hostfile.txt \
--no_ssh \ # no_ssh is here so that our nodes can talk to each other easily.
--num_gpus=1 \ # my machine only has 1 GPU. This should be the number of GPUs per node
--num_nodes=$WORLD_SIZE \
--node_rank=$RANK \
--master_addr=$MASTER_ADDR \
--master_port=$MASTER_PORT \
inference.py
Next, let’s set up our hostfile.txt
. This file defines the hosts that will participate in our distributed inference setup. Here’s how we configure it:
- We’re setting up 3 workers, each with 1 GPU.
- The format is
hostname slots=number_of_GPUs
. - In our case, we use
worker-1
,worker-2
, andworker-3
as hostnames. These correspond to the container names in ourdocker-compose
file. - Docker conveniently uses these container names as hostnames, making our setup easier.
- If you have more GPUs per worker, you can adjust the
slots
value. For example, ifworker-1
had 2 GPUs, you’d writeworker-1 slots=2
.
Here’s what our hostfile.txt
looks like:
worker-1 slots=1
worker-2 slots=1
worker-3 slots=1
Lastly, we want to define our configuration file for DeepSpeed in deepspeed_config.json
{
"train_batch_size": 3,
"train_micro_batch_size_per_gpu": 1,
"gradient_accumulation_steps": 1,
"fp16": {
"enabled": true
},
"zero_optimization": {
"stage": 3
}
}
Inference
First, define our package requirements in requirements.txt
torch
transformers
deepspeed
In inference.py
we use a small sentiment model and pass in a few statements to test with.
import os
import torch
import deepspeed
from transformers import AutoModelForSequenceClassification, AutoTokenizer
def main():
deepspeed.init_distributed()
rank = int(os.environ.get("RANK", 0))
world_size = int(os.environ.get("WORLD_SIZE", 1))
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
ds_config = "deepspeed_config.json"
model_engine, _, _, _ = deepspeed.initialize(
model=model,
config=ds_config,
model_parameters=None
)
statements = [
"I absolutely love this movie, it's a masterpiece!",
"The service at this restaurant was terrible and the food was bland.",
"I'm feeling quite neutral about the whole situation.",
"This new gadget is amazing, it has exceeded all my expectations!",
"I'm really disappointed with the outcome of the game."
]
labels = ["Negative", "Positive"]
for input_text in statements:
inputs = tokenizer(input_text, return_tensors="pt")
inputs = {key: value.to(model_engine.device) for key, value in inputs.items()}
with torch.no_grad():
outputs = model_engine(**inputs)
logits = outputs.logits
predictions = torch.argmax(logits, dim=-1)
if rank == 0:
print(f"Input: {input_text}")
print(f"Prediction: {labels[predictions.item()]}")
else:
print(f"Worker {rank}: Inference completed for input: {input_text}")
if __name__ == "__main__":
main()
To Test, navigate to your project folder and run:
docker compose build
docker compose up
After all of your workers initialize and the model is downloaded you should see each worker process our input statements and we expect Worker-1 to deliver the result for each.