Checkpointing Training to Amazon S3

Overview
Prerequisites
Checkpointing Methods
Choosing the Right Method

Overview

Checkpointing saves your model’s state during training, allowing you to:

Resume training after interruptions (preemptions, crashes, maintenance)
Recover from failures without losing hours of training progress
Enable spot instance usage for cost savings with fault tolerance

Storing checkpoints in S3 provides durable, accessible storage that persists beyond the lifetime of your training nodes.

Prerequisites

Before implementing a checkpointing strategy, you must establish a secure and efficient connection between your training environment and Amazon S3.

Create an S3 Bucket

You need a dedicated container for your model artifacts. It is highly recommended to create this bucket in the same region as your training instances to minimize data transfer costs and latency.

Via AWS Console
- Navigate to S3 in the AWS Console
- Click Create bucket
- Choose a unique bucket name (e.g., my-training-checkpoints)
- Select your preferred region
- Configure access settings as needed

Via AWS CLI

aws s3 mb s3://my-training-checkpoints --region us-east-1

Configure AWS Credentials

Your training script requires permissions to PutObject and GetObject in your S3 bucket. Ensure your training environment has AWS credentials configured:

# Option 1: Environment variables
export AWS_ACCESS_KEY_ID=<your-access-key>
export AWS_SECRET_ACCESS_KEY=<your-secret-key>
export AWS_DEFAULT_REGION=us-east-1

# Option 2: AWS credentials file (~/.aws/credentials)
aws configure

Checkpointing Methods

There are multiple strategies for integrating Amazon S3 into your training loop, each offering different trade-offs between performance, ease of implementation, and code intrusiveness.

AWS S3 Torch Connector

The official AWS connector for PyTorch provides native S3 integration with optimized streaming.

Installation

pip install s3torchconnector

Saving Checkpoints

from s3torchconnector import S3Checkpoint

# Initialize the checkpoint handler
checkpoint = S3Checkpoint(region="us-east-1")

# Save model checkpoint directly to S3
checkpoint_path = "s3://my-training-checkpoints/model_epoch_10.pt"

with checkpoint.writer(checkpoint_path) as writer:
    torch.save({
        'epoch': epoch,
        'model_state_dict': model.state_dict(),
        'optimizer_state_dict': optimizer.state_dict(),
        'loss': loss,
    }, writer)

Loading Checkpoints

from s3torchconnector import S3Checkpoint

checkpoint = S3Checkpoint(region="us-east-1")
checkpoint_path = "s3://my-training-checkpoints/model_epoch_10.pt"

with checkpoint.reader(checkpoint_path) as reader:
    checkpoint_data = torch.load(reader)

model.load_state_dict(checkpoint_data['model_state_dict'])
optimizer.load_state_dict(checkpoint_data['optimizer_state_dict'])
start_epoch = checkpoint_data['epoch']

PyTorch Lightning

PyTorch Lightning uses fsspec to write checkpoints directly to S3 with minimal configuration.

Installation

pip install lightning s3fs

Basic S3 Checkpointing

from lightning.pytorch import Trainer

trainer = Trainer(
    # Lightning uses fsspec to write directly to S3
    default_root_dir="s3://my-training-checkpoints/experiment-1/",
    enable_checkpointing=True,
)

trainer.fit(model)

Resuming from S3 Checkpoint

from lightning.pytorch import Trainer

trainer = Trainer(
    default_root_dir="s3://my-training-checkpoints/experiment-1/",
    enable_checkpointing=True,
)

# Resume from a specific checkpoint
trainer.fit(
    model,
    ckpt_path="s3://my-training-checkpoints/experiment-1/checkpoints/last.ckpt",
)

Mounting S3 as Filesystem (s3fs-fuse)

Mount your S3 bucket as a local directory, allowing any training code to checkpoint without modification.

Installation

# Ubuntu/Debian
sudo apt-get install s3fs

# Amazon Linux
sudo yum install s3fs-fuse

Setup

Create credentials file

echo AWS_ACCESS_KEY_ID:AWS_SECRET_ACCESS_KEY > ~/.passwd-s3fs
chmod 600 ~/.passwd-s3fs

Mount the bucket

mkdir -p /mnt/s3-checkpoints
s3fs my-training-checkpoints /mnt/s3-checkpoints \
    -o passwd_file=~/.passwd-s3fs \
    -o url=https://s3.us-east-1.amazonaws.com \
    -o use_path_request_style

Verify mount
```
df -h /mnt/s3-checkpoints
```

Usage in Training Code

# No code changes needed - save as if it were local storage
checkpoint_dir = "/mnt/s3-checkpoints/my-experiment"

torch.save({
    'epoch': epoch,
    'model_state_dict': model.state_dict(),
    'optimizer_state_dict': optimizer.state_dict(),
    'loss': loss,
}, f"{checkpoint_dir}/checkpoint_epoch_{epoch}.pt")

Unmounting

fusermount -u /mnt/s3-checkpoints

Manual Sync with AWS CLI

Periodically sync local checkpoints to S3 using AWS CLI or rsync-style commands.

Installation

pip install awscli

Sync After Each Checkpoint

import subprocess
import torch

def save_checkpoint_and_sync(model, optimizer, epoch, local_dir, s3_bucket):
    # Save locally first
    checkpoint_path = f"{local_dir}/checkpoint_epoch_{epoch}.pt"
    torch.save({
        'epoch': epoch,
        'model_state_dict': model.state_dict(),
        'optimizer_state_dict': optimizer.state_dict(),
    }, checkpoint_path)

    # Sync to S3
    subprocess.run([
        "aws", "s3", "sync",
        local_dir,
        f"s3://{s3_bucket}/checkpoints/",
        "--exclude", "*",
        "--include", "*.pt"
    ], check=True)

# Usage in training loop
for epoch in range(start_epoch, num_epochs):
    train_one_epoch(model, dataloader, optimizer)
    save_checkpoint_and_sync(
        model, optimizer, epoch,
        local_dir="/tmp/checkpoints",
        s3_bucket="my-training-checkpoints"
    )

Choosing the Right Method

Selecting the optimal checkpointing strategy involves balancing performance requirements with implementation effort. The following comparison helps you weigh these trade-offs against your specific framework and infrastructure constraints to ensure your training pipeline remains both robust and efficient.

Method	Best For	Complexity	Performance
S3 Torch Connector	PyTorch users wanting native integration	Low	High (streaming)
PyTorch Lightning	Lightning-based training workflows	Very Low	High
s3fs-fuse	Existing code without modifications	Medium	Medium
AWS CLI Sync	Simple setups, full control	Low	Medium

Recommendations

PyTorch Lightning users: Use the built-in default_root_dir with S3 path
Vanilla PyTorch projects: Use AWS S3 Torch Connector for best performance
Legacy code or multi-framework: Use s3fs-fuse mount
Simple or infrequent checkpoints: Use AWS CLI sync

Getting started

Security

Cookbooks

Monitoring

FAQs

Guides

Others

Checkpointing Training to Amazon S3

Table of Contents

Overview

Prerequisites

Create an S3 Bucket

Configure AWS Credentials

Checkpointing Methods

AWS S3 Torch Connector

Installation

Saving Checkpoints

Loading Checkpoints

PyTorch Lightning

Installation

Basic S3 Checkpointing

Resuming from S3 Checkpoint

Mounting S3 as Filesystem (s3fs-fuse)

Installation

Setup

Usage in Training Code

Unmounting

Manual Sync with AWS CLI

Installation

Sync After Each Checkpoint

Choosing the Right Method

Recommendations

Getting started

Security

Cookbooks

Monitoring

FAQs

Guides

Others

Documentation Index

​Table of Contents

​Overview

​Prerequisites

​Create an S3 Bucket

​Configure AWS Credentials

​Checkpointing Methods

​AWS S3 Torch Connector

​Installation

​Saving Checkpoints

​Loading Checkpoints

​PyTorch Lightning

​Installation

​Basic S3 Checkpointing

​Resuming from S3 Checkpoint

​Mounting S3 as Filesystem (s3fs-fuse)

​Installation

​Setup

​Usage in Training Code

​Unmounting

​Manual Sync with AWS CLI

​Installation

​Sync After Each Checkpoint

​Choosing the Right Method

​Recommendations

Table of Contents

Overview

Prerequisites

Create an S3 Bucket

Configure AWS Credentials

Checkpointing Methods

AWS S3 Torch Connector

Installation

Saving Checkpoints

Loading Checkpoints

PyTorch Lightning

Installation

Basic S3 Checkpointing

Resuming from S3 Checkpoint

Mounting S3 as Filesystem (s3fs-fuse)

Installation

Setup

Usage in Training Code

Unmounting

Manual Sync with AWS CLI

Installation

Sync After Each Checkpoint

Choosing the Right Method

Recommendations