Skip to content

Instantly share code, notes, and snippets.

@padeoe
Last active November 14, 2024 12:59
Show Gist options
  • Save padeoe/697678ab8e528b85a2a7bddafea1fa4f to your computer and use it in GitHub Desktop.
Save padeoe/697678ab8e528b85a2a7bddafea1fa4f to your computer and use it in GitHub Desktop.
CLI-Tool for download Huggingface models and datasets with aria2/wget+git

🤗Huggingface Model Downloader

Considering the lack of multi-threaded download support in the official huggingface-cli, and the inadequate error handling in hf_transfer, this command-line tool smartly utilizes wget or aria2 for LFS files and git clone for the rest.

Features

  • ⏯️ Resume from breakpoint: You can re-run it or Ctrl+C anytime.
  • 🚀 Multi-threaded Download: Utilize multiple threads to speed up the download process.
  • 🚫 File Exclusion: Use --exclude or --include to skip or specify files, save time for models with duplicate formats (e.g., *.bin or *.safetensors).
  • 🔐 Auth Support: For gated models that require Huggingface login, use --hf_username and --hf_token to authenticate.
  • 🪞 Mirror Site Support: Set up with HF_ENDPOINT environment variable.
  • 🌍 Proxy Support: Set up with HTTPS_PROXY environment variable.
  • 📦 Simple: Only depend on git, aria2c/wget.

Usage

First, Download hfd.sh or clone this repo, and then grant execution permission to the script.

chmod a+x hfd.sh

you can create an alias for convenience

alias hfd="$PWD/hfd.sh"

Usage Instructions:

$ ./hfd.sh -h
Usage:
  hfd <repo_id> [--include include_pattern1 include_pattern2 ...] [--exclude exclude_pattern1 exclude_pattern2 ...] [--hf_username username] [--hf_token token] [--tool aria2c|wget] [-x threads] [--dataset] [--local-dir path]

Description:
  Downloads a model or dataset from Hugging Face using the provided repo ID.

Parameters:
  repo_id        The Hugging Face repo ID in the format 'org/repo_name'.
  --include       (Optional) Flag to specify string patterns to include files for downloading. Supports multiple patterns.
  --exclude       (Optional) Flag to specify string patterns to exclude files from downloading. Supports multiple patterns.
  include/exclude_pattern The patterns to match against filenames, supports wildcard characters. e.g., '--exclude *.safetensor *.txt', '--include vae/*'.
  --hf_username   (Optional) Hugging Face username for authentication. **NOT EMAIL**.
  --hf_token      (Optional) Hugging Face token for authentication.
  --tool          (Optional) Download tool to use. Can be aria2c (default) or wget.
  -x              (Optional) Number of download threads for aria2c. Defaults to 4.
  --dataset       (Optional) Flag to indicate downloading a dataset.
  --local-dir     (Optional) Local directory path where the model or dataset will be stored.

Example:
  hfd bigscience/bloom-560m --exclude *.safetensors
  hfd meta-llama/Llama-2-7b --hf_username myuser --hf_token mytoken -x 4
  hfd lavita/medical-qa-shared-task-v1-toy --dataset

Download a model:

hfd bigscience/bloom-560m

Download a model need login

Get huggingface token from https://huggingface.co/settings/tokens, then

hfd meta-llama/Llama-2-7b --hf_username YOUR_HF_USERNAME_NOT_EMAIL --hf_token YOUR_HF_TOKEN

Download a model and exclude certain files (e.g., .safetensors):

hfd bigscience/bloom-560m --exclude *.safetensors

Download with aria2c and multiple threads:

hfd bigscience/bloom-560m

Output: During the download, the file URLs will be displayed:

$ hfd bigscience/bloom-560m --tool wget --exclude *.safetensors
...
Start Downloading lfs files, bash script:

wget -c https://huggingface.co/bigscience/bloom-560m/resolve/main/flax_model.msgpack
# wget -c https://huggingface.co/bigscience/bloom-560m/resolve/main/model.safetensors
wget -c https://huggingface.co/bigscience/bloom-560m/resolve/main/onnx/decoder_model.onnx
...
#!/usr/bin/env bash
# Color definitions
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m' # No Color
trap 'printf "${YELLOW}\nDownload interrupted. If you re-run the command, you can resume the download from the breakpoint.\n${NC}"; exit 1' INT
display_help() {
cat << EOF
Usage:
hfd <repo_id> [--include include_pattern1 include_pattern2 ...] [--exclude exclude_pattern1 exclude_pattern2 ...] [--hf_username username] [--hf_token token] [--tool aria2c|wget] [-x threads] [--dataset] [--local-dir path]
Description:
Downloads a model or dataset from Hugging Face using the provided repo ID.
Parameters:
repo_id The Hugging Face repo ID in the format 'org/repo_name'.
--include (Optional) Flag to specify string patterns to include files for downloading. Supports multiple patterns.
--exclude (Optional) Flag to specify string patterns to exclude files from downloading. Supports multiple patterns.
include/exclude_pattern The patterns to match against filenames, supports wildcard characters. e.g., '--exclude *.safetensor *.txt', '--include vae/*'.
--hf_username (Optional) Hugging Face username for authentication. **NOT EMAIL**.
--hf_token (Optional) Hugging Face token for authentication.
--tool (Optional) Download tool to use. Can be aria2c (default) or wget.
-x (Optional) Number of download threads for aria2c. Defaults to 4.
--dataset (Optional) Flag to indicate downloading a dataset.
--local-dir (Optional) Local directory path where the model or dataset will be stored.
Example:
hfd bigscience/bloom-560m --exclude *.safetensors
hfd meta-llama/Llama-2-7b --hf_username myuser --hf_token mytoken -x 4
hfd lavita/medical-qa-shared-task-v1-toy --dataset
EOF
exit 1
}
MODEL_ID=$1
shift
# Default values
TOOL="aria2c"
THREADS=4
HF_ENDPOINT=${HF_ENDPOINT:-"https://huggingface.co"}
INCLUDE_PATTERNS=()
EXCLUDE_PATTERNS=()
while [[ $# -gt 0 ]]; do
case $1 in
--include)
shift
while [[ $# -gt 0 && ! $1 =~ ^-- ]]; do
INCLUDE_PATTERNS+=("$1")
shift
done
;;
--exclude)
shift
while [[ $# -gt 0 && ! $1 =~ ^-- ]]; do
EXCLUDE_PATTERNS+=("$1")
shift
done
;;
--hf_username) HF_USERNAME="$2"; shift 2 ;;
--hf_token) HF_TOKEN="$2"; shift 2 ;;
--tool) TOOL="$2"; shift 2 ;;
-x) THREADS="$2"; shift 2 ;;
--dataset) DATASET=1; shift ;;
--local-dir) LOCAL_DIR="$2"; shift 2 ;;
*) shift ;;
esac
done
# Check if aria2, wget, curl, git, and git-lfs are installed
check_command() {
if ! command -v $1 &>/dev/null; then
echo -e "${RED}$1 is not installed. Please install it first.${NC}"
exit 1
fi
}
# Mark current repo safe when using shared file system like samba or nfs
ensure_ownership() {
if git status 2>&1 | grep "fatal: detected dubious ownership in repository at" > /dev/null; then
git config --global --add safe.directory "${PWD}"
printf "${YELLOW}Detected dubious ownership in repository, mark ${PWD} safe using git, edit ~/.gitconfig if you want to reverse this.\n${NC}"
fi
}
[[ "$TOOL" == "aria2c" ]] && check_command aria2c
[[ "$TOOL" == "wget" ]] && check_command wget
check_command curl; check_command git; check_command git-lfs
[[ -z "$MODEL_ID" || "$MODEL_ID" =~ ^-h ]] && display_help
if [[ -z "$LOCAL_DIR" ]]; then
LOCAL_DIR="${MODEL_ID#*/}"
fi
if [[ "$DATASET" == 1 ]]; then
MODEL_ID="datasets/$MODEL_ID"
fi
echo "Downloading to $LOCAL_DIR"
if [ -d "$LOCAL_DIR/.git" ]; then
printf "${YELLOW}%s exists, Skip Clone.\n${NC}" "$LOCAL_DIR"
cd "$LOCAL_DIR" && ensure_ownership && GIT_LFS_SKIP_SMUDGE=1 git pull || { printf "${RED}Git pull failed.${NC}\n"; exit 1; }
else
REPO_URL="$HF_ENDPOINT/$MODEL_ID"
GIT_REFS_URL="${REPO_URL}/info/refs?service=git-upload-pack"
echo "Testing GIT_REFS_URL: $GIT_REFS_URL"
response=$(curl -s -o /dev/null -w "%{http_code}" "$GIT_REFS_URL")
if [ "$response" == "401" ] || [ "$response" == "403" ]; then
if [[ -z "$HF_USERNAME" || -z "$HF_TOKEN" ]]; then
printf "${RED}HTTP Status Code: $response.\nThe repository requires authentication, but --hf_username and --hf_token is not passed. Please get token from https://huggingface.co/settings/tokens.\nExiting.\n${NC}"
exit 1
fi
REPO_URL="https://$HF_USERNAME:$HF_TOKEN@${HF_ENDPOINT#https://}/$MODEL_ID"
elif [ "$response" != "200" ]; then
printf "${RED}Unexpected HTTP Status Code: $response\n${NC}"
printf "${YELLOW}Executing debug command: curl -v %s\nOutput:${NC}\n" "$GIT_REFS_URL"
curl -v "$GIT_REFS_URL"; printf "\n${RED}Git clone failed.\n${NC}"; exit 1
fi
echo "GIT_LFS_SKIP_SMUDGE=1 git clone $REPO_URL $LOCAL_DIR"
GIT_LFS_SKIP_SMUDGE=1 git clone $REPO_URL $LOCAL_DIR && cd "$LOCAL_DIR" || { printf "${RED}Git clone failed.\n${NC}"; exit 1; }
ensure_ownership
while IFS= read -r file; do
truncate -s 0 "$file"
done <<< $(git lfs ls-files | cut -d ' ' -f 3-)
fi
printf "\nStart Downloading lfs files, bash script:\ncd $LOCAL_DIR\n"
files=$(git lfs ls-files | cut -d ' ' -f 3-)
declare -a urls
file_matches_include_patterns() {
local file="$1"
for pattern in "${INCLUDE_PATTERNS[@]}"; do
if [[ "$file" == $pattern ]]; then
return 0
fi
done
return 1
}
file_matches_exclude_patterns() {
local file="$1"
for pattern in "${EXCLUDE_PATTERNS[@]}"; do
if [[ "$file" == $pattern ]]; then
return 0
fi
done
return 1
}
while IFS= read -r file; do
url="$HF_ENDPOINT/$MODEL_ID/resolve/main/$file"
file_dir=$(dirname "$file")
mkdir -p "$file_dir"
if [[ "$TOOL" == "wget" ]]; then
download_cmd="wget -c \"$url\" -O \"$file\""
[[ -n "$HF_TOKEN" ]] && download_cmd="wget --header=\"Authorization: Bearer ${HF_TOKEN}\" -c \"$url\" -O \"$file\""
else
download_cmd="aria2c --console-log-level=error --file-allocation=none -x $THREADS -s $THREADS -k 1M -c \"$url\" -d \"$file_dir\" -o \"$(basename "$file")\""
[[ -n "$HF_TOKEN" ]] && download_cmd="aria2c --header=\"Authorization: Bearer ${HF_TOKEN}\" --console-log-level=error --file-allocation=none -x $THREADS -s $THREADS -k 1M -c \"$url\" -d \"$file_dir\" -o \"$(basename "$file")\""
fi
if [[ ${#INCLUDE_PATTERNS[@]} -gt 0 ]]; then
file_matches_include_patterns "$file" || { printf "# %s\n" "$download_cmd"; continue; }
fi
if [[ ${#EXCLUDE_PATTERNS[@]} -gt 0 ]]; then
file_matches_exclude_patterns "$file" && { printf "# %s\n" "$download_cmd"; continue; }
fi
printf "%s\n" "$download_cmd"
urls+=("$url|$file")
done <<< "$files"
for url_file in "${urls[@]}"; do
IFS='|' read -r url file <<< "$url_file"
printf "${YELLOW}Start downloading ${file}.\n${NC}"
file_dir=$(dirname "$file")
if [[ "$TOOL" == "wget" ]]; then
[[ -n "$HF_TOKEN" ]] && wget --header="Authorization: Bearer ${HF_TOKEN}" -c "$url" -O "$file" || wget -c "$url" -O "$file"
else
[[ -n "$HF_TOKEN" ]] && aria2c --header="Authorization: Bearer ${HF_TOKEN}" --console-log-level=error --file-allocation=none -x $THREADS -s $THREADS -k 1M -c "$url" -d "$file_dir" -o "$(basename "$file")" || aria2c --console-log-level=error --file-allocation=none -x $THREADS -s $THREADS -k 1M -c "$url" -d "$file_dir" -o "$(basename "$file")"
fi
[[ $? -eq 0 ]] && printf "Downloaded %s successfully.\n" "$url" || { printf "${RED}Failed to download %s.\n${NC}" "$url"; exit 1; }
done
printf "${GREEN}Download completed successfully.\n${NC}"
@RewindL
Copy link

RewindL commented Sep 26, 2024

有没有办法直接跳过已经下载好的文件?下数据集的时候有接近400个文件,每次中断重新下的时候都会从第一个开始request,这可能会导致中间又因为网络timeout中断。文件太多了也不太可能一个一个exclude

@padeoe
Copy link
Author

padeoe commented Sep 26, 2024

@zhang-ziang @RewindL 收到需求,我来修改下

@zbximo
Copy link

zbximo commented Oct 1, 2024

hfd gpt2 --exclude *.safetensors --tool wget
Downloading to gpt2
Testing GIT_REFS_URL: https://hf-mirror.com/gpt2/info/refs?service=git-upload-pack
GIT_LFS_SKIP_SMUDGE=1 git clone https://hf-mirror.com/gpt2 gpt2
Cloning into 'gpt2'...

为什么一直卡在这里,下载啥东西都是这样

@hl0737
Copy link

hl0737 commented Oct 9, 2024

@zhang-ziang @RewindL 收到需求,我来修改下

大佬,exclude好像失效了,麻烦帮忙看下,感谢您!

@padeoe
Copy link
Author

padeoe commented Oct 10, 2024

大佬,exclude好像失效了,麻烦帮忙看下,感谢您!

@hl0737 我测试--exclude是有效的呀:

$ ./hfd.sh openai-community/gpt2 --exclude *.safetensors --exclude 6* --exclude f* --exclude onnx/*
Downloading to gpt2
gpt2 exists, Skip Clone.
Already up to date.

Start Downloading lfs files, bash script:
cd gpt2
# aria2c --console-log-level=error --file-allocation=none -x 4 -s 4 -k 1M -c "https://hf-mirror.com/openai-community/gpt2/resolve/main/64-8bits.tflite" -d "." -o "64-8bits.tflite"
# aria2c --console-log-level=error --file-allocation=none -x 4 -s 4 -k 1M -c "https://hf-mirror.com/openai-community/gpt2/resolve/main/64-fp16.tflite" -d "." -o "64-fp16.tflite"
# aria2c --console-log-level=error --file-allocation=none -x 4 -s 4 -k 1M -c "https://hf-mirror.com/openai-community/gpt2/resolve/main/64.tflite" -d "." -o "64.tflite"
# aria2c --console-log-level=error --file-allocation=none -x 4 -s 4 -k 1M -c "https://hf-mirror.com/openai-community/gpt2/resolve/main/flax_model.msgpack" -d "." -o "flax_model.msgpack"
# aria2c --console-log-level=error --file-allocation=none -x 4 -s 4 -k 1M -c "https://hf-mirror.com/openai-community/gpt2/resolve/main/model.safetensors" -d "." -o "model.safetensors"
# aria2c --console-log-level=error --file-allocation=none -x 4 -s 4 -k 1M -c "https://hf-mirror.com/openai-community/gpt2/resolve/main/onnx/decoder_model.onnx" -d "onnx" -o "decoder_model.onnx"
# aria2c --console-log-level=error --file-allocation=none -x 4 -s 4 -k 1M -c "https://hf-mirror.com/openai-community/gpt2/resolve/main/onnx/decoder_model_merged.onnx" -d "onnx" -o "decoder_model_merged.onnx"
# aria2c --console-log-level=error --file-allocation=none -x 4 -s 4 -k 1M -c "https://hf-mirror.com/openai-community/gpt2/resolve/main/onnx/decoder_with_past_model.onnx" -d "onnx" -o "decoder_with_past_model.onnx"
aria2c --console-log-level=error --file-allocation=none -x 4 -s 4 -k 1M -c "https://hf-mirror.com/openai-community/gpt2/resolve/main/pytorch_model.bin" -d "." -o "pytorch_model.bin"
aria2c --console-log-level=error --file-allocation=none -x 4 -s 4 -k 1M -c "https://hf-mirror.com/openai-community/gpt2/resolve/main/rust_model.ot" -d "." -o "rust_model.ot"
aria2c --console-log-level=error --file-allocation=none -x 4 -s 4 -k 1M -c "https://hf-mirror.com/openai-community/gpt2/resolve/main/tf_model.h5" -d "." -o "tf_model.h5"
Start downloading pytorch_model.bin.
[#1cf1e1 522MiB/522MiB(99%) CN:1 DL:35MiB]
下载结果:
gid   |stat|avg speed  |path/URI
======+====+===========+=======================================================
1cf1e1|OK  |    33MiB/s|./pytorch_model.bin

状态标识:
(OK): 下载完成。
Downloaded https://hf-mirror.com/openai-community/gpt2/resolve/main/pytorch_model.bin successfully.
Start downloading rust_model.ot.
[#688a39 669MiB/669MiB(99%) CN:1 DL:5.4MiB]
下载结果:
gid   |stat|avg speed  |path/URI
======+====+===========+=======================================================
688a39|OK  |    25MiB/s|./rust_model.ot

状态标识:
(OK): 下载完成。
Downloaded https://hf-mirror.com/openai-community/gpt2/resolve/main/rust_model.ot successfully.
Start downloading tf_model.h5.
[#c448c2 469MiB/474MiB(98%) CN:4 DL:10MiB]
下载结果:
gid   |stat|avg speed  |path/URI
======+====+===========+=======================================================
c448c2|OK  |    11MiB/s|./tf_model.h5

状态标识:
(OK): 下载完成。
Downloaded https://hf-mirror.com/openai-community/gpt2/resolve/main/tf_model.h5 successfully.
Download completed successfully.

如上,我设置了多个--exclude内容,所以只从 pytorch_model.bin 开始下载了。

@hl0737
Copy link

hl0737 commented Oct 10, 2024

大佬,exclude好像失效了,麻烦帮忙看下,感谢您!

@hl0737 我测试--exclude是有效的呀:

$ ./hfd.sh openai-community/gpt2 --exclude *.safetensors --exclude 6* --exclude f* --exclude onnx/*
Downloading to gpt2
gpt2 exists, Skip Clone.
Already up to date.

Start Downloading lfs files, bash script:
cd gpt2
# aria2c --console-log-level=error --file-allocation=none -x 4 -s 4 -k 1M -c "https://hf-mirror.com/openai-community/gpt2/resolve/main/64-8bits.tflite" -d "." -o "64-8bits.tflite"
# aria2c --console-log-level=error --file-allocation=none -x 4 -s 4 -k 1M -c "https://hf-mirror.com/openai-community/gpt2/resolve/main/64-fp16.tflite" -d "." -o "64-fp16.tflite"
# aria2c --console-log-level=error --file-allocation=none -x 4 -s 4 -k 1M -c "https://hf-mirror.com/openai-community/gpt2/resolve/main/64.tflite" -d "." -o "64.tflite"
# aria2c --console-log-level=error --file-allocation=none -x 4 -s 4 -k 1M -c "https://hf-mirror.com/openai-community/gpt2/resolve/main/flax_model.msgpack" -d "." -o "flax_model.msgpack"
# aria2c --console-log-level=error --file-allocation=none -x 4 -s 4 -k 1M -c "https://hf-mirror.com/openai-community/gpt2/resolve/main/model.safetensors" -d "." -o "model.safetensors"
# aria2c --console-log-level=error --file-allocation=none -x 4 -s 4 -k 1M -c "https://hf-mirror.com/openai-community/gpt2/resolve/main/onnx/decoder_model.onnx" -d "onnx" -o "decoder_model.onnx"
# aria2c --console-log-level=error --file-allocation=none -x 4 -s 4 -k 1M -c "https://hf-mirror.com/openai-community/gpt2/resolve/main/onnx/decoder_model_merged.onnx" -d "onnx" -o "decoder_model_merged.onnx"
# aria2c --console-log-level=error --file-allocation=none -x 4 -s 4 -k 1M -c "https://hf-mirror.com/openai-community/gpt2/resolve/main/onnx/decoder_with_past_model.onnx" -d "onnx" -o "decoder_with_past_model.onnx"
aria2c --console-log-level=error --file-allocation=none -x 4 -s 4 -k 1M -c "https://hf-mirror.com/openai-community/gpt2/resolve/main/pytorch_model.bin" -d "." -o "pytorch_model.bin"
aria2c --console-log-level=error --file-allocation=none -x 4 -s 4 -k 1M -c "https://hf-mirror.com/openai-community/gpt2/resolve/main/rust_model.ot" -d "." -o "rust_model.ot"
aria2c --console-log-level=error --file-allocation=none -x 4 -s 4 -k 1M -c "https://hf-mirror.com/openai-community/gpt2/resolve/main/tf_model.h5" -d "." -o "tf_model.h5"
Start downloading pytorch_model.bin.
[#1cf1e1 522MiB/522MiB(99%) CN:1 DL:35MiB]
下载结果:
gid   |stat|avg speed  |path/URI
======+====+===========+=======================================================
1cf1e1|OK  |    33MiB/s|./pytorch_model.bin

状态标识:
(OK): 下载完成。
Downloaded https://hf-mirror.com/openai-community/gpt2/resolve/main/pytorch_model.bin successfully.
Start downloading rust_model.ot.
[#688a39 669MiB/669MiB(99%) CN:1 DL:5.4MiB]
下载结果:
gid   |stat|avg speed  |path/URI
======+====+===========+=======================================================
688a39|OK  |    25MiB/s|./rust_model.ot

状态标识:
(OK): 下载完成。
Downloaded https://hf-mirror.com/openai-community/gpt2/resolve/main/rust_model.ot successfully.
Start downloading tf_model.h5.
[#c448c2 469MiB/474MiB(98%) CN:4 DL:10MiB]
下载结果:
gid   |stat|avg speed  |path/URI
======+====+===========+=======================================================
c448c2|OK  |    11MiB/s|./tf_model.h5

状态标识:
(OK): 下载完成。
Downloaded https://hf-mirror.com/openai-community/gpt2/resolve/main/tf_model.h5 successfully.
Download completed successfully.

如上,我设置了多个--exclude内容,所以只从 pytorch_model.bin 开始下载了。

@padeoe 嗯嗯,我昨天试了下ok的,有一次试不行,可能是我的通配符写错了,比如不想下一个文件夹dir,只写了dir/,而不是dir/*,maybe,没有试

感谢您的回复哈!

PS:datasets下载数据集的问题麻烦大佬加急看下!,现在datasets已经更新到3的大版本了,但是 > 2.14.6的版本还是有问题,谢谢您!❤️❤️

@Yancy456
Copy link

Yancy456 commented Nov 5, 2024

太天才了我操,改天帮你写一个python版本的,这个bash版本实在不好维护

@Yancy456
Copy link

Yancy456 commented Nov 5, 2024

Downloading to /home/zhengxiao/dataroot/models/Llama2_7b_chat_hf/ Testing GIT_REFS_URL: https://hf-mirror.com/meta-llama/Llama-2-7b-chat-hf/info/refs?service=git-upload-pack git clone https://zhengxiao:[email protected]/meta-llama/Llama-2-7b-chat-hf /home/zhengxiao/dataroot/models/Llama2_7b_chat_hf/ fatal: destination path '/home/zhengxiao/dataroot/models/Llama2_7b_chat_hf' already exists and is not an empty directory. Git clone failed. 显示目录已存在,请问如何解决?

换个位置

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment