Skip to content

Instantly share code, notes, and snippets.

@SumukhP-dev
Forked from JacobFV/README.md
Last active October 13, 2025 10:18
Show Gist options
  • Save SumukhP-dev/7733ee6a61ceef7df3902ebcf6ebaafe to your computer and use it in GitHub Desktop.
Save SumukhP-dev/7733ee6a61ceef7df3902ebcf6ebaafe to your computer and use it in GitHub Desktop.
huggingface_to_s3

Hugging Face Repository to S3 Transfer Script

This Python script allows you to transfer files from a Hugging Face repository to an Amazon S3 bucket. It iterates over all the files in the specified repository, downloads them one at a time, and uploads them to the designated S3 bucket.

Prerequisites

Before running the script, ensure that you have the following:

  • choose the aws linux ami with 10kIOPS and IO2 storage. make sure you can https out and ssh in

OR

  • Python 3.x installed on your system
  • AWS account with access to S3
  • Hugging Face repository details (owner, repository name, branch)
  • S3 bucket name for storing the transferred files

Setup

  1. ssh into your instance by clicking connect in the top right corner of the console, or using trad methods

  2. Open nano and copypasta the script into huggingface_to_s3.py

  3. Install the required Python packages by running the following command:

    pip install boto3 requests
    
  4. Configure your AWS credentials using one of the following methods:

    • Assign an IAM role with S3 put permissions to your ec2 instance

    OR

    • Set up environment variables: AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY
    • Use an AWS credentials file (~/.aws/credentials)
    • Configure the AWS CLI using aws configure

Usage

To run the script, use the following command:

python huggingface_to_s3.py --repo-owner <repo_owner> --repo-name <repo_name> --branch <branch> --repo-type <datasets|models|spaces> --s3-bucket <s3_bucket_name>

Replace the placeholders with the appropriate values:

  • <repo_owner>: The owner of the Hugging Face repository.
  • <repo_name>: The name of the Hugging Face repository.
  • <branch>: The branch of the repository to transfer files from (default: "main").
  • <datasets|models|spaces>: Type of Hugging Face repo
  • <s3_bucket_name>: The name of the S3 bucket to store the transferred files.

The script will create the S3 bucket if it doesn't already exist.

Advice

  • Ensure that you have the necessary permissions to access the Hugging Face repository and the S3 bucket.
  • Be cautious when transferring large repositories, as it may take a considerable amount of time and consume significant network bandwidth.
  • Monitor the script's output for any error messages or warnings during the transfer process.
  • Regularly review and clean up the S3 bucket to avoid unnecessary storage costs.

Troubleshooting

  • If you encounter any issues related to AWS credentials or permissions, double-check your AWS configuration and ensure that you have the required permissions to access S3.
  • If the script fails to retrieve files from the Hugging Face repository, verify that the repository details (owner, name, branch) are correct and that you have the necessary access rights.
  • If you experience network-related issues, check your internet connection and ensure that you can access the Hugging Face API and AWS S3 endpoints.

For more detailed information and advanced usage, please refer to the script's source code and the documentation of the respective libraries (boto3 and requests):

#!/usr/bin/env python3
import os
import argparse
import boto3
import requests
from botocore.exceptions import ClientError
# ----------------------------
# Parse command-line arguments
# ----------------------------
parser = argparse.ArgumentParser(description="Transfer files from Hugging Face repository to S3")
parser.add_argument("--repo-owner", required=True, help="Owner of the Hugging Face repository")
parser.add_argument("--repo-name", required=True, help="Name of the Hugging Face repository")
parser.add_argument("--branch", default="main", help="Branch of the Hugging Face repository (default: main)")
parser.add_argument("--s3-bucket", required=True, help="Name of the S3 bucket")
parser.add_argument("--repo-type", default="datasets",
choices=["datasets", "models", "spaces"],
help="Type of Hugging Face repository (datasets/models/spaces)")
args = parser.parse_args()
repo_owner = args.repo_owner
repo_name = args.repo_name
branch = args.branch
s3_bucket_name = args.s3_bucket
repo_type = args.repo_type
# ----------------------------
# Set up AWS S3 client
# ----------------------------
s3_client = boto3.client("s3")
try:
s3_client.head_bucket(Bucket=s3_bucket_name)
print(f"S3 bucket '{s3_bucket_name}' already exists")
except ClientError as e:
code = e.response.get("Error", {}).get("Code", "")
if code == "404":
s3_client.create_bucket(Bucket=s3_bucket_name)
print(f"S3 bucket '{s3_bucket_name}' created successfully")
elif code == "403":
print(f"Access denied to bucket '{s3_bucket_name}'. Check IAM permissions.")
exit(1)
else:
raise
# ----------------------------
# Hugging Face API endpoint
# ----------------------------
api_url = f"https://huggingface.co/api/{repo_type}/{repo_owner}/{repo_name}/tree/{branch}"
def download_file(file_path):
"""Download a file from the HF repo."""
file_url = f"https://huggingface.co/{repo_type}/{repo_owner}/{repo_name}/resolve/{branch}/{file_path}"
resp = requests.get(file_url, stream=True)
if resp.status_code == 200:
local_path = os.path.basename(file_path)
with open(local_path, "wb") as f:
f.write(resp.content)
print(f"✅ Downloaded {file_path}")
return local_path
else:
print(f"⚠️ Failed to download {file_path} ({resp.status_code})")
return None
# ----------------------------
# Main
# ----------------------------
def main():
print(f"Fetching Hugging Face repo tree: {api_url}")
try:
response = requests.get(api_url, timeout=15)
response.raise_for_status()
except requests.exceptions.RequestException as e:
print(f"❌ Error connecting to Hugging Face API:\n{e}")
exit(1)
if response.status_code == 404:
print(f"❌ Repository not found: {repo_owner}/{repo_name} ({repo_type})")
exit(1)
try:
files = response.json()
except ValueError:
print("❌ Failed to parse JSON response from Hugging Face API.")
exit(1)
if not isinstance(files, list) or not files:
print("⚠️ No files found in the repository.")
return
for f in files:
path = f.get("path")
if path:
local_file = download_file(path)
if local_file:
s3_key = path.replace("\\", "/") # normalize path separators
s3_client.upload_file(local_file, s3_bucket_name, s3_key)
os.remove(local_file)
print(f"📤 Uploaded to s3://{s3_bucket_name}/{s3_key}")
print("✅ Transfer complete.")
if __name__ == "__main__":
main()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment