Learning about Distributed Inference with DeepSpeed ZeRO-3 and Docker Compose

Today, we’re going to test out DeepSpeed ZeRO-3 in docker-compose. Perhaps in a future blog post, I’ll cover DeepSpeed-FastGen or how to deploy this on a real multi-node/multi-gpu cluster. I also aim to compare this method vs Multi-Node Inference with vLLM. If you’re setting up a local cluster, consider checking out high bandwidth networking with InfiniBand. It’s surprisingly affordable.

What is DeepSpeed?

DeepSpeed ZeRO-3 (Zero Redundancy Optimizer) is an optimization technique developed by Microsoft that enables efficient large-scale model training and inference. It’s particularly useful for distributing large language models across multiple GPUs or nodes. Here’s a brief overview of its key features:

Memory Efficiency: ZeRO-3 partitions model parameters, gradients, and optimizer states across GPUs, significantly reducing memory requirements per device.
Computation Efficiency: It eliminates redundant computations in data-parallel training, improving overall efficiency.
Scalability: ZeRO-3 allows for training and inference of models that are much larger than what can fit on a single GPU.
Communication Optimization: It implements efficient communication algorithms to minimize data transfer between devices.
Easy Integration: DeepSpeed can be integrated with existing PyTorch models with minimal code changes.

While ZeRO-3 shines in multi-GPU setups, even on a single GPU it can provide benefits in terms of memory management, potentially allowing you to work with larger models or batch sizes than would otherwise be possible. In our single-GPU setup, we won’t see performance improvements, but we’ll gain valuable experience in configuring and deploying a distributed inference system.

Project Structure

FancyInference/
├── deepspeed_config.json
├── docker-compose.yml
├── Dockerfile
├── entrypoint.sh
├── hostfile.txt
├── inference.py
└── requirements.txt

Configuring our Hosts

Here’s our dockerfile. We use entrypoint.sh because I’d like to use the deepspeed launcher and I think it’s cleaner to have this in its own file instead of a long entrypoint string.

FROM pytorch/pytorch:1.13.1-cuda11.6-cudnn8-runtime

RUN apt-get update && apt-get install -y \
    git \
    wget \
    curl \
    && rm -rf /var/lib/apt/lists/*

ENV PYTHONUNBUFFERED=1
ENV DEBIAN_FRONTEND=noninteractive

RUN pip install --upgrade pip
RUN pip install deepspeed transformers

WORKDIR /app
COPY inference.py .
COPY deepspeed_config.json .
COPY hostfile.txt .
COPY entrypoint.sh .

RUN chmod +x entrypoint.sh

ENTRYPOINT ["/app/entrypoint.sh"]

Our docker-compose.yml will set up 3 hosts.

services:
  worker-1: 
    build: .
    container_name: worker-1 # TAKE NOTE OF THE NAME HERE, AS THIS WILL BE OUR HOSTNAME IN HOSTS.TXT
    environment:
      RANK: 0
      WORLD_SIZE: 3
      MASTER_ADDR: worker-1
      MASTER_PORT: 12345
      NVIDIA_VISIBLE_DEVICES: 0 # this is device 0, my only GPU. It should be comma separated device IDs
    networks:
      - distributed-net
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]
    runtime: nvidia

  worker-2:
    build: .
    container_name: worker-2
    environment:
      RANK: 1
      WORLD_SIZE: 3
      MASTER_ADDR: worker-1
      MASTER_PORT: 12345
      NVIDIA_VISIBLE_DEVICES: 0
    networks:
      - distributed-net
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]
    runtime: nvidia

  worker-3:
    build: .
    container_name: worker-3
    environment:
      RANK: 2
      WORLD_SIZE: 3
      MASTER_ADDR: worker-1
      MASTER_PORT: 12345
      NVIDIA_VISIBLE_DEVICES: 0
    networks:
      - distributed-net
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]
    runtime: nvidia

networks:
  distributed-net:
    driver: bridge

Here’s entrypoint.sh. We’ll let our test rely on docker-compose’s internal DNS.

#!/bin/bash

export RANK=${RANK:-0}
export WORLD_SIZE=${WORLD_SIZE:-3}
export MASTER_ADDR=${MASTER_ADDR:-worker-1}
export MASTER_PORT=${MASTER_PORT:-12345}
export NVIDIA_VISIBLE_DEVICES=${NVIDIA_VISIBLE_DEVICES:-0}

deepspeed --hostfile=hostfile.txt \
    --no_ssh \ # no_ssh is here so that our nodes can talk to each other easily.
    --num_gpus=1 \ # my machine only has 1 GPU. This should be the number of GPUs per node
    --num_nodes=$WORLD_SIZE \
    --node_rank=$RANK \
    --master_addr=$MASTER_ADDR \
    --master_port=$MASTER_PORT \
    inference.py

Next, let’s set up our hostfile.txt. This file defines the hosts that will participate in our distributed inference setup. Here’s how we configure it:

We’re setting up 3 workers, each with 1 GPU.
The format is hostname slots=number_of_GPUs.
In our case, we use worker-1, worker-2, and worker-3 as hostnames. These correspond to the container names in our docker-compose file.
Docker conveniently uses these container names as hostnames, making our setup easier.
If you have more GPUs per worker, you can adjust the slots value. For example, if worker-1 had 2 GPUs, you’d write worker-1 slots=2.

Here’s what our hostfile.txt looks like:

worker-1 slots=1
worker-2 slots=1
worker-3 slots=1

Lastly, we want to define our configuration file for DeepSpeed in deepspeed_config.json

{
  "train_batch_size": 3,
  "train_micro_batch_size_per_gpu": 1,
  "gradient_accumulation_steps": 1,
  "fp16": {
    "enabled": true
  },
  "zero_optimization": {
    "stage": 3
  }
}

Inference

First, define our package requirements in requirements.txt

torch
transformers
deepspeed

In inference.py we use a small sentiment model and pass in a few statements to test with.

import os
import torch
import deepspeed
from transformers import AutoModelForSequenceClassification, AutoTokenizer

def main():
    deepspeed.init_distributed()

    rank = int(os.environ.get("RANK", 0))
    world_size = int(os.environ.get("WORLD_SIZE", 1))

    model_name = "distilbert-base-uncased-finetuned-sst-2-english"
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForSequenceClassification.from_pretrained(model_name)

    ds_config = "deepspeed_config.json"

    model_engine, _, _, _ = deepspeed.initialize(
        model=model,
        config=ds_config,
        model_parameters=None
    )

    statements = [
        "I absolutely love this movie, it's a masterpiece!",
        "The service at this restaurant was terrible and the food was bland.",
        "I'm feeling quite neutral about the whole situation.",
        "This new gadget is amazing, it has exceeded all my expectations!",
        "I'm really disappointed with the outcome of the game."
    ]

    labels = ["Negative", "Positive"]

    for input_text in statements:
        inputs = tokenizer(input_text, return_tensors="pt")
        inputs = {key: value.to(model_engine.device) for key, value in inputs.items()}

        with torch.no_grad():
            outputs = model_engine(**inputs)
            logits = outputs.logits
            predictions = torch.argmax(logits, dim=-1)

        if rank == 0:
            print(f"Input: {input_text}")
            print(f"Prediction: {labels[predictions.item()]}")
        else:
            print(f"Worker {rank}: Inference completed for input: {input_text}")


if __name__ == "__main__":
        main()

To Test, navigate to your project folder and run:

docker compose build
docker compose up

After all of your workers initialize and the model is downloaded you should see each worker process our input statements and we expect Worker-1 to deliver the result for each.

Learning about Distributed Inference with DeepSpeed ZeRO-3 and Docker Compose

What is DeepSpeed?

Project Structure

Configuring our Hosts

Inference

Tags :

Related Posts

How to Set Up a Serverless Home Lab with AWS CDK, Lambda, and LocalStack

Learning about Distributed Inference with DeepSpeed ZeRO-3 and Docker Compose

What is DeepSpeed?

Project Structure

Configuring our Hosts

Inference

Tags :

Share :

Related Posts

How to Set Up a Serverless Home Lab with AWS CDK, Lambda, and LocalStack