Table of Contents
Overview
Checkpointing saves your model’s state during training, allowing you to:- Resume training after interruptions (preemptions, crashes, maintenance)
- Recover from failures without losing hours of training progress
- Enable spot instance usage for cost savings with fault tolerance
Prerequisites
Before implementing a checkpointing strategy, you must establish a secure and efficient connection between your training environment and Amazon S3.Create an S3 Bucket
You need a dedicated container for your model artifacts. It is highly recommended to create this bucket in the same region as your training instances to minimize data transfer costs and latency.-
Via AWS Console
- Navigate to S3 in the AWS Console
- Click Create bucket
- Choose a unique bucket name (e.g.,
my-training-checkpoints) - Select your preferred region
- Configure access settings as needed
-
Via AWS CLI
Configure AWS Credentials
Your training script requires permissions toPutObject and GetObject in your S3 bucket.
Ensure your training environment has AWS credentials configured:
Checkpointing Methods
There are multiple strategies for integrating Amazon S3 into your training loop, each offering different trade-offs between performance, ease of implementation, and code intrusiveness.AWS S3 Torch Connector
The official AWS connector for PyTorch provides native S3 integration with optimized streaming.Installation
Saving Checkpoints
Loading Checkpoints
PyTorch Lightning
PyTorch Lightning uses fsspec to write checkpoints directly to S3 with minimal configuration.Installation
Basic S3 Checkpointing
Resuming from S3 Checkpoint
Mounting S3 as Filesystem (s3fs-fuse)
Mount your S3 bucket as a local directory, allowing any training code to checkpoint without modification.Installation
Setup
-
Create credentials file
-
Mount the bucket
-
Verify mount
Usage in Training Code
Unmounting
Manual Sync with AWS CLI
Periodically sync local checkpoints to S3 using AWS CLI or rsync-style commands.Installation
Sync After Each Checkpoint
Choosing the Right Method
Selecting the optimal checkpointing strategy involves balancing performance requirements with implementation effort. The following comparison helps you weigh these trade-offs against your specific framework and infrastructure constraints to ensure your training pipeline remains both robust and efficient.| Method | Best For | Complexity | Performance |
|---|---|---|---|
| S3 Torch Connector | PyTorch users wanting native integration | Low | High (streaming) |
| PyTorch Lightning | Lightning-based training workflows | Very Low | High |
| s3fs-fuse | Existing code without modifications | Medium | Medium |
| AWS CLI Sync | Simple setups, full control | Low | Medium |
Recommendations
- PyTorch Lightning users: Use the built-in
default_root_dirwith S3 path - Vanilla PyTorch projects: Use AWS S3 Torch Connector for best performance
- Legacy code or multi-framework: Use s3fs-fuse mount
- Simple or infrequent checkpoints: Use AWS CLI sync
