Skip to content

Instantly share code, notes, and snippets.

@padeoe
Last active November 14, 2024 12:59
Show Gist options
  • Save padeoe/697678ab8e528b85a2a7bddafea1fa4f to your computer and use it in GitHub Desktop.
Save padeoe/697678ab8e528b85a2a7bddafea1fa4f to your computer and use it in GitHub Desktop.
CLI-Tool for download Huggingface models and datasets with aria2/wget+git

🤗Huggingface Model Downloader

Considering the lack of multi-threaded download support in the official huggingface-cli, and the inadequate error handling in hf_transfer, this command-line tool smartly utilizes wget or aria2 for LFS files and git clone for the rest.

Features

  • ⏯️ Resume from breakpoint: You can re-run it or Ctrl+C anytime.
  • 🚀 Multi-threaded Download: Utilize multiple threads to speed up the download process.
  • 🚫 File Exclusion: Use --exclude or --include to skip or specify files, save time for models with duplicate formats (e.g., *.bin or *.safetensors).
  • 🔐 Auth Support: For gated models that require Huggingface login, use --hf_username and --hf_token to authenticate.
  • 🪞 Mirror Site Support: Set up with HF_ENDPOINT environment variable.
  • 🌍 Proxy Support: Set up with HTTPS_PROXY environment variable.
  • 📦 Simple: Only depend on git, aria2c/wget.

Usage

First, Download hfd.sh or clone this repo, and then grant execution permission to the script.

chmod a+x hfd.sh

you can create an alias for convenience

alias hfd="$PWD/hfd.sh"

Usage Instructions:

$ ./hfd.sh -h
Usage:
  hfd <repo_id> [--include include_pattern1 include_pattern2 ...] [--exclude exclude_pattern1 exclude_pattern2 ...] [--hf_username username] [--hf_token token] [--tool aria2c|wget] [-x threads] [--dataset] [--local-dir path]

Description:
  Downloads a model or dataset from Hugging Face using the provided repo ID.

Parameters:
  repo_id        The Hugging Face repo ID in the format 'org/repo_name'.
  --include       (Optional) Flag to specify string patterns to include files for downloading. Supports multiple patterns.
  --exclude       (Optional) Flag to specify string patterns to exclude files from downloading. Supports multiple patterns.
  include/exclude_pattern The patterns to match against filenames, supports wildcard characters. e.g., '--exclude *.safetensor *.txt', '--include vae/*'.
  --hf_username   (Optional) Hugging Face username for authentication. **NOT EMAIL**.
  --hf_token      (Optional) Hugging Face token for authentication.
  --tool          (Optional) Download tool to use. Can be aria2c (default) or wget.
  -x              (Optional) Number of download threads for aria2c. Defaults to 4.
  --dataset       (Optional) Flag to indicate downloading a dataset.
  --local-dir     (Optional) Local directory path where the model or dataset will be stored.

Example:
  hfd bigscience/bloom-560m --exclude *.safetensors
  hfd meta-llama/Llama-2-7b --hf_username myuser --hf_token mytoken -x 4
  hfd lavita/medical-qa-shared-task-v1-toy --dataset

Download a model:

hfd bigscience/bloom-560m

Download a model need login

Get huggingface token from https://huggingface.co/settings/tokens, then

hfd meta-llama/Llama-2-7b --hf_username YOUR_HF_USERNAME_NOT_EMAIL --hf_token YOUR_HF_TOKEN

Download a model and exclude certain files (e.g., .safetensors):

hfd bigscience/bloom-560m --exclude *.safetensors

Download with aria2c and multiple threads:

hfd bigscience/bloom-560m

Output: During the download, the file URLs will be displayed:

$ hfd bigscience/bloom-560m --tool wget --exclude *.safetensors
...
Start Downloading lfs files, bash script:

wget -c https://huggingface.co/bigscience/bloom-560m/resolve/main/flax_model.msgpack
# wget -c https://huggingface.co/bigscience/bloom-560m/resolve/main/model.safetensors
wget -c https://huggingface.co/bigscience/bloom-560m/resolve/main/onnx/decoder_model.onnx
...
#!/usr/bin/env bash
# Color definitions
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m' # No Color
trap 'printf "${YELLOW}\nDownload interrupted. If you re-run the command, you can resume the download from the breakpoint.\n${NC}"; exit 1' INT
display_help() {
cat << EOF
Usage:
hfd <repo_id> [--include include_pattern1 include_pattern2 ...] [--exclude exclude_pattern1 exclude_pattern2 ...] [--hf_username username] [--hf_token token] [--tool aria2c|wget] [-x threads] [--dataset] [--local-dir path]
Description:
Downloads a model or dataset from Hugging Face using the provided repo ID.
Parameters:
repo_id The Hugging Face repo ID in the format 'org/repo_name'.
--include (Optional) Flag to specify string patterns to include files for downloading. Supports multiple patterns.
--exclude (Optional) Flag to specify string patterns to exclude files from downloading. Supports multiple patterns.
include/exclude_pattern The patterns to match against filenames, supports wildcard characters. e.g., '--exclude *.safetensor *.txt', '--include vae/*'.
--hf_username (Optional) Hugging Face username for authentication. **NOT EMAIL**.
--hf_token (Optional) Hugging Face token for authentication.
--tool (Optional) Download tool to use. Can be aria2c (default) or wget.
-x (Optional) Number of download threads for aria2c. Defaults to 4.
--dataset (Optional) Flag to indicate downloading a dataset.
--local-dir (Optional) Local directory path where the model or dataset will be stored.
Example:
hfd bigscience/bloom-560m --exclude *.safetensors
hfd meta-llama/Llama-2-7b --hf_username myuser --hf_token mytoken -x 4
hfd lavita/medical-qa-shared-task-v1-toy --dataset
EOF
exit 1
}
MODEL_ID=$1
shift
# Default values
TOOL="aria2c"
THREADS=4
HF_ENDPOINT=${HF_ENDPOINT:-"https://huggingface.co"}
INCLUDE_PATTERNS=()
EXCLUDE_PATTERNS=()
while [[ $# -gt 0 ]]; do
case $1 in
--include)
shift
while [[ $# -gt 0 && ! $1 =~ ^-- ]]; do
INCLUDE_PATTERNS+=("$1")
shift
done
;;
--exclude)
shift
while [[ $# -gt 0 && ! $1 =~ ^-- ]]; do
EXCLUDE_PATTERNS+=("$1")
shift
done
;;
--hf_username) HF_USERNAME="$2"; shift 2 ;;
--hf_token) HF_TOKEN="$2"; shift 2 ;;
--tool) TOOL="$2"; shift 2 ;;
-x) THREADS="$2"; shift 2 ;;
--dataset) DATASET=1; shift ;;
--local-dir) LOCAL_DIR="$2"; shift 2 ;;
*) shift ;;
esac
done
# Check if aria2, wget, curl, git, and git-lfs are installed
check_command() {
if ! command -v $1 &>/dev/null; then
echo -e "${RED}$1 is not installed. Please install it first.${NC}"
exit 1
fi
}
# Mark current repo safe when using shared file system like samba or nfs
ensure_ownership() {
if git status 2>&1 | grep "fatal: detected dubious ownership in repository at" > /dev/null; then
git config --global --add safe.directory "${PWD}"
printf "${YELLOW}Detected dubious ownership in repository, mark ${PWD} safe using git, edit ~/.gitconfig if you want to reverse this.\n${NC}"
fi
}
[[ "$TOOL" == "aria2c" ]] && check_command aria2c
[[ "$TOOL" == "wget" ]] && check_command wget
check_command curl; check_command git; check_command git-lfs
[[ -z "$MODEL_ID" || "$MODEL_ID" =~ ^-h ]] && display_help
if [[ -z "$LOCAL_DIR" ]]; then
LOCAL_DIR="${MODEL_ID#*/}"
fi
if [[ "$DATASET" == 1 ]]; then
MODEL_ID="datasets/$MODEL_ID"
fi
echo "Downloading to $LOCAL_DIR"
if [ -d "$LOCAL_DIR/.git" ]; then
printf "${YELLOW}%s exists, Skip Clone.\n${NC}" "$LOCAL_DIR"
cd "$LOCAL_DIR" && ensure_ownership && GIT_LFS_SKIP_SMUDGE=1 git pull || { printf "${RED}Git pull failed.${NC}\n"; exit 1; }
else
REPO_URL="$HF_ENDPOINT/$MODEL_ID"
GIT_REFS_URL="${REPO_URL}/info/refs?service=git-upload-pack"
echo "Testing GIT_REFS_URL: $GIT_REFS_URL"
response=$(curl -s -o /dev/null -w "%{http_code}" "$GIT_REFS_URL")
if [ "$response" == "401" ] || [ "$response" == "403" ]; then
if [[ -z "$HF_USERNAME" || -z "$HF_TOKEN" ]]; then
printf "${RED}HTTP Status Code: $response.\nThe repository requires authentication, but --hf_username and --hf_token is not passed. Please get token from https://huggingface.co/settings/tokens.\nExiting.\n${NC}"
exit 1
fi
REPO_URL="https://$HF_USERNAME:$HF_TOKEN@${HF_ENDPOINT#https://}/$MODEL_ID"
elif [ "$response" != "200" ]; then
printf "${RED}Unexpected HTTP Status Code: $response\n${NC}"
printf "${YELLOW}Executing debug command: curl -v %s\nOutput:${NC}\n" "$GIT_REFS_URL"
curl -v "$GIT_REFS_URL"; printf "\n${RED}Git clone failed.\n${NC}"; exit 1
fi
echo "GIT_LFS_SKIP_SMUDGE=1 git clone $REPO_URL $LOCAL_DIR"
GIT_LFS_SKIP_SMUDGE=1 git clone $REPO_URL $LOCAL_DIR && cd "$LOCAL_DIR" || { printf "${RED}Git clone failed.\n${NC}"; exit 1; }
ensure_ownership
while IFS= read -r file; do
truncate -s 0 "$file"
done <<< $(git lfs ls-files | cut -d ' ' -f 3-)
fi
printf "\nStart Downloading lfs files, bash script:\ncd $LOCAL_DIR\n"
files=$(git lfs ls-files | cut -d ' ' -f 3-)
declare -a urls
file_matches_include_patterns() {
local file="$1"
for pattern in "${INCLUDE_PATTERNS[@]}"; do
if [[ "$file" == $pattern ]]; then
return 0
fi
done
return 1
}
file_matches_exclude_patterns() {
local file="$1"
for pattern in "${EXCLUDE_PATTERNS[@]}"; do
if [[ "$file" == $pattern ]]; then
return 0
fi
done
return 1
}
while IFS= read -r file; do
url="$HF_ENDPOINT/$MODEL_ID/resolve/main/$file"
file_dir=$(dirname "$file")
mkdir -p "$file_dir"
if [[ "$TOOL" == "wget" ]]; then
download_cmd="wget -c \"$url\" -O \"$file\""
[[ -n "$HF_TOKEN" ]] && download_cmd="wget --header=\"Authorization: Bearer ${HF_TOKEN}\" -c \"$url\" -O \"$file\""
else
download_cmd="aria2c --console-log-level=error --file-allocation=none -x $THREADS -s $THREADS -k 1M -c \"$url\" -d \"$file_dir\" -o \"$(basename "$file")\""
[[ -n "$HF_TOKEN" ]] && download_cmd="aria2c --header=\"Authorization: Bearer ${HF_TOKEN}\" --console-log-level=error --file-allocation=none -x $THREADS -s $THREADS -k 1M -c \"$url\" -d \"$file_dir\" -o \"$(basename "$file")\""
fi
if [[ ${#INCLUDE_PATTERNS[@]} -gt 0 ]]; then
file_matches_include_patterns "$file" || { printf "# %s\n" "$download_cmd"; continue; }
fi
if [[ ${#EXCLUDE_PATTERNS[@]} -gt 0 ]]; then
file_matches_exclude_patterns "$file" && { printf "# %s\n" "$download_cmd"; continue; }
fi
printf "%s\n" "$download_cmd"
urls+=("$url|$file")
done <<< "$files"
for url_file in "${urls[@]}"; do
IFS='|' read -r url file <<< "$url_file"
printf "${YELLOW}Start downloading ${file}.\n${NC}"
file_dir=$(dirname "$file")
if [[ "$TOOL" == "wget" ]]; then
[[ -n "$HF_TOKEN" ]] && wget --header="Authorization: Bearer ${HF_TOKEN}" -c "$url" -O "$file" || wget -c "$url" -O "$file"
else
[[ -n "$HF_TOKEN" ]] && aria2c --header="Authorization: Bearer ${HF_TOKEN}" --console-log-level=error --file-allocation=none -x $THREADS -s $THREADS -k 1M -c "$url" -d "$file_dir" -o "$(basename "$file")" || aria2c --console-log-level=error --file-allocation=none -x $THREADS -s $THREADS -k 1M -c "$url" -d "$file_dir" -o "$(basename "$file")"
fi
[[ $? -eq 0 ]] && printf "Downloaded %s successfully.\n" "$url" || { printf "${RED}Failed to download %s.\n${NC}" "$url"; exit 1; }
done
printf "${GREEN}Download completed successfully.\n${NC}"
@zodiacg
Copy link

zodiacg commented Jul 4, 2024

能否指定多个exclude或include pattern?

@Au3C2
Copy link

Au3C2 commented Jul 4, 2024

请问是需要在提前安装aria2c吗,还是说我使用方法不对

$ ./hfd.sh deepseek-ai/DeepSeek-V2-Chat --tool aria2c -x 4 
aria2c is not installed. Please install it first.                

apt install aria2

@yzf072
Copy link

yzf072 commented Jul 23, 2024

下载到一半中断了,然后再下载,每个文件都会检索一遍,有1108个文件,检索起来好慢,请问大佬有解决方法吗qwq

@xhx1022
Copy link

xhx1022 commented Jul 24, 2024

hfd.sh有没有办法下载到~/.cache目录下呢

@threegold116
Copy link

下载中断了怎么自动重试呢?

@zhaoxin-web
Copy link

想问一下这是什么原因呢
Downloading to cinepile
cinepile exists, Skip Clone.
Already up to date.

Start Downloading lfs files, bash script:
cd cinepile
aria2c --header="Authorization: Bearer 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" --console-log-level=error --file-allocation=none -x 4 -s 4 -k 1M -c "https://hf-mirror.com/datasets/tomg-group-umd/cinepile/resolve/main/data/test-00000-of-00001.parquet" -d "data" -o "test-00000-of-00001.parquet"
aria2c --header="Authorization: Bearer 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" --console-log-level=error --file-allocation=none -x 4 -s 4 -k 1M -c "https://hf-mirror.com/datasets/tomg-group-umd/cinepile/resolve/main/data/train-00000-of-00003.parquet" -d "data" -o "train-00000-of-00003.parquet"
aria2c --header="Authorization: Bearer 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" --console-log-level=error --file-allocation=none -x 4 -s 4 -k 1M -c "https://hf-mirror.com/datasets/tomg-group-umd/cinepile/resolve/main/data/train-00001-of-00003.parquet" -d "data" -o "train-00001-of-00003.parquet"
aria2c --header="Authorization: Bearer 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" --console-log-level=error --file-allocation=none -x 4 -s 4 -k 1M -c "https://hf-mirror.com/datasets/tomg-group-umd/cinepile/resolve/main/data/train-00002-of-00003.parquet" -d "data" -o "train-00002-of-00003.parquet"
Start downloading data/test-00000-of-00001.parquet.
[#127d73 0B/0B CN:1 DL:0B]
08/01 11:36:59 [ERROR] CUID#7 - Download aborted. URI=https://hf-mirror.com/datasets/tomg-group-umd/cinepile/resolve/main/data/test-00000-of-00001.parquet
Exception: [AbstractCommand.cc:403] errorCode=18 URI=https://cdn-lfs-us-1.hf-mirror.com/repos/27/86/27864067717b3f938d06d61f89fe8d38e30ab1e533a7f05f541f53d5abb17e44/eef38ba3fbf349b42bafe6dea4af6316bd6ff8e0a3e25701e0678bfdbf2ed274?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27test-00000-of-00001.parquet%3B+filename%3D%22test-00000-of-00001.parquet%22%3B&Expires=1722742619&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTcyMjc0MjYxOX19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy11cy0xLmh1Z2dpbmdmYWNlLmNvL3JlcG9zLzI3Lzg2LzI3ODY0MDY3NzE3YjNmOTM4ZDA2ZDYxZjg5ZmU4ZDM4ZTMwYWIxZTUzM2E3ZjA1ZjU0MWY1M2Q1YWJiMTdlNDQvZWVmMzhiYTNmYmYzNDliNDJiYWZlNmRlYTRhZjYzMTZiZDZmZjhlMGEzZTI1NzAxZTA2NzhiZmRiZjJlZDI3ND9yZXNwb25zZS1jb250ZW50LWRpc3Bvc2l0aW9uPSoifV19&Signature=WvxqnAW2VUmMx3Smn4La9k41J-HrmTKMAt1Bd9zSQcSytx6W6-K3jtaXOSmvmNPxsfJyUUOCoWeDTn5TmXa7c2d2-eVXRIzdU3-J0CfNYUl3awBSljWpK2cBaIev7ZuSeOtPbWems1VZ4ZbbGsh0y5UTtdm4cNB9RTjPy7oLbUkAhV5g%7EbE%7EdQ1hSa2a7hvoSN1NVvUJ6GLPk11z11gx4t9w%7EsM7fsJZnyUyGkaZmhIkyGYLC4tdJ9SOmAMPf-ndOnP1woswKUDVpOPohpNd1Tue0%7Eext9nscpfQzxhxjGBNIDmc6AaBfxpErUeJKWNmH433Nc82Hclxt4whgbusQQ__&Key-Pair-Id=K24J24Z295AEI9
-> [RequestGroup.cc:760] errorCode=18 Download aborted.
-> [util.cc:1951] errNum=13 errorCode=18 Failed to make the directory data, cause: Permission denied

Download Results:
gid |stat|avg speed |path/URI
======+====+===========+=======================================================
127d73|ERR | 0B/s|data/test-00000-of-00001.parquet

Status Legend:
(ERR):error occurred.

aria2 will resume download if the transfer is restarted.
If there are any errors, then see the log file. See '-l' option in help/man page for details.

08/01 11:37:00 [ERROR] CUID#7 - Download aborted. URI=https://hf-mirror.com/datasets/tomg-group-umd/cinepile/resolve/main/data/test-00000-of-00001.parquet
Exception: [AbstractCommand.cc:351] errorCode=24 URI=https://hf-mirror.com/datasets/tomg-group-umd/cinepile/resolve/main/data/test-00000-of-00001.parquet
-> [HttpSkipResponseCommand.cc:215] errorCode=24 Authorization failed.

Download Results:
gid |stat|avg speed |path/URI
======+====+===========+=======================================================
f7b3d0|ERR | 0B/s|data/test-00000-of-00001.parquet

Status Legend:
(ERR):error occurred.

aria2 will resume download if the transfer is restarted.
If there are any errors, then see the log file. See '-l' option in help/man page for details.
Failed to download https://hf-mirror.com/datasets/tomg-group-umd/cinepile/resolve/main/data/test-00000-of-00001.parquet.

@padeoe
Copy link
Author

padeoe commented Aug 1, 2024

@zhaoxin-web 权限问题,这个目录下似乎没有写权限

@zhaoxin-web
Copy link

@zhaoxin-web 权限问题,这个目录下似乎没有写权限

大佬,上次的问题解决了,但是在运行的一段时间后停了,尽管我已经换了token,但是还是报错,请问这是什么原因呢?

remote: Access to dataset mlfoundations/MINT-1T-PDF-CC-2023-40 is restricted. You must be authenticated to access it.
fatal: Authentication failed for 'https://hf-mirror.com/datasets/mlfoundations/MINT-1T-PDF-CC-2023-40/'
Git pull failed.

@verigle
Copy link

verigle commented Aug 6, 2024

下载中断了怎么自动重试呢?

同问

Status Legend:
(ERR):error occurred.

aria2 will resume download if the transfer is restarted.
If there are any errors, then see the log file. See '-l' option in help/man page for details.
Failed to download https://hf-mirror.com/internlm/internlm2_5-7b-chat/resolve/main/model-00004-of-00008.safetensors.

@tianbuwei
Copy link

能否指定多个exclude或include pattern?

我也用这个问题,你解决了吗

@padeoe
Copy link
Author

padeoe commented Aug 7, 2024

@tianbuwei @zodiacg 🎉现已支持多文件的排除或指定,用法示例

hfd facebook/opt-125m --tool wget --local-dir facebook/opt-125m --exclude flax_model.msgpack tf_model.h5

@Peng154
Copy link

Peng154 commented Aug 7, 2024

下载到一半中断了,然后再下载,每个文件都会检索一遍,有1108个文件,检索起来好慢,请问大佬有解决方法吗qwq

我是用python去检查每个下载的文件夹里面是否下完了所有的数据,下完了的话就把这个文件夹加到exclude的list中,最后输出整个list,放到hfd 的 --exclude参数后面就好

@Peng154
Copy link

Peng154 commented Aug 7, 2024

下载到一半中断了,然后再下载,每个文件都会检索一遍,有1108个文件,检索起来好慢,请问大佬有解决方法吗qwq

我是用python去检查每个下载的文件夹里面是否下完了所有的数据,下完了的话就把这个文件夹加到exclude的list中,最后输出整个list,放到hfd 的 --exclude参数后面就好

from pathlib import Path
from glob import glob
import re

data_dir_path = Path("../data/lotsa_data2")
exclude_dirs = []
include_dirs = []
for sud_dir in glob(str(data_dir_path / "*")):
    if Path(sud_dir).is_dir():
        print(sud_dir)
        # 檢查arrow文件個數
        is_all_arrow_files_exist = False
        is_all_json_files_exist = False
        arrow_file_count = len(glob(str(Path(sud_dir) / "*.arrow")))
        if arrow_file_count !=0:
            print(glob(str(Path(sud_dir) / "*.arrow"))[0])
            total_arrow_file_count = int(re.match(r".+-([0-9]+).arrow",
                                                  glob(str(Path(sud_dir) / "*.arrow"))[0]
                                                  ).group(1))
        else:
            total_arrow_file_count = -1
        if arrow_file_count == total_arrow_file_count:
            print(f"all arrow files exist")
            is_all_arrow_files_exist = True
        
        # 檢查json文件個數
        json_file_count = len(glob(str(Path(sud_dir) / "*.json")))
        if json_file_count == 2:
            print(f"all json files exist")
            is_all_json_files_exist = True
        
        if is_all_arrow_files_exist and is_all_json_files_exist:
            exclude_dirs.append(str(Path(sud_dir).name) + "/*")
            print(f"exclude {sud_dir}")
        else:
            include_dirs.append(str(Path(sud_dir).name) + "/*")
            print(f"include {sud_dir}")

print(" ".join(exclude_dirs))

可以参考我这样做

@zhaoxin-web
Copy link

下载到一半出现这个问题,重新运行还是报错,有解决办法吗?
remote: Access to dataset mlfoundations/MINT-1T-PDF-CC-2023-06 is restricted. You must be authenticated to access it.
fatal: Authentication failed for 'https://hf-mirror.com/datasets/mlfoundations/MINT-1T-PDF-CC-2023-06/'
Git pull failed.

@Peng154
Copy link

Peng154 commented Aug 8, 2024

下载到一半中断了,然后再下载,每个文件都会检索一遍,有1108个文件,检索起来好慢,请问大佬有解决方法吗qwq

我是用python去检查每个下载的文件夹里面是否下完了所有的数据,下完了的话就把这个文件夹加到exclude的list中,最后输出整个list,放到hfd 的 --exclude参数后面就好

from pathlib import Path from glob import glob import re

data_dir_path = Path("../data/lotsa_data2") exclude_dirs = [] include_dirs = [] for sud_dir in glob(str(data_dir_path / "")): if Path(sud_dir).is_dir(): print(sud_dir) # 檢查arrow文件個數 is_all_arrow_files_exist = False is_all_json_files_exist = False arrow_file_count = len(glob(str(Path(sud_dir) / ".arrow"))) if arrow_file_count !=0: print(glob(str(Path(sud_dir) / ".arrow"))[0]) total_arrow_file_count = int(re.match(r".+-([0-9]+).arrow", glob(str(Path(sud_dir) / ".arrow"))[0] ).group(1)) else: total_arrow_file_count = -1 if arrow_file_count == total_arrow_file_count: print(f"all arrow files exist") is_all_arrow_files_exist = True

    # 檢查json文件個數
    json_file_count = len(glob(str(Path(sud_dir) / "*.json")))
    if json_file_count == 2:
        print(f"all json files exist")
        is_all_json_files_exist = True
    
    if is_all_arrow_files_exist and is_all_json_files_exist:
        exclude_dirs.append(str(Path(sud_dir).name) + "/*")
        print(f"exclude {sud_dir}")
    else:
        include_dirs.append(str(Path(sud_dir).name) + "/*")
        print(f"include {sud_dir}")

print(" ".join(exclude_dirs))

可以参考我这样做

我后面发现似乎不对。。。。最后还是用回了huggingface_hub的snapshot_download函数,代码如下:

from huggingface_hub import hf_hub_download, snapshot_download
from huggingface_hub import constants
constants._HF_DEFAULT_ENDPOINT = "https://hf-mirror.com"  # use mirror for faster download
try:
    import hf_transfer
    constants.HF_HUB_ENABLE_HF_TRANSFER = True  # enable transfer from Hugging Face
except ImportError:
    constants.HF_HUB_ENABLE_HF_TRANSFER = False
    
# 下载某个文件
# hf_hub_download(repo_id=f"Salesforce/moirai-1.0-R-{SIZE}", local_dir=f"../pretrained_models/moirai-1.0-R-{SIZE}")
# snapshot_download(repo_id=f"Salesforce/moirai-1.0-R-{SIZE}",
#                   local_dir=f"../pretrained_models/moirai-1.0-R-{SIZE}")

# 下载整个LOTSA 数据集
while True:
    try:
        snapshot_download(repo_id="Salesforce/lotsa_data",
                        local_dir="../data/lotsa_data2",
                        repo_type="dataset",
                        max_workers=4)
        break
    except Exception as e:
        print(e)
        print("retrying...")
        continue

@DavinciEvans
Copy link

大佬,我想问一下如果是想要把模型文件下载到 huggingface 的缓存也就是 ./cache/huggingface 下面,应该怎么做呢,看代码似乎是直接将权重下载到当前目录下

@i-square
Copy link

@tianbuwei @zodiacg 🎉现已支持多文件的排除或指定,用法示例

hfd facebook/opt-125m --tool wget --local-dir facebook/opt-125m --exclude flax_model.msgpack tf_model.h5

你好,脚本示例里面的排除通配符写法 --exclude *.safetensors 似乎一直都不行,无法匹配到文件,只能用文件全名

@T-Atlas
Copy link

T-Atlas commented Aug 16, 2024

@tianbuwei @zodiacg 🎉现已支持多文件的排除或指定,用法示例

hfd facebook/opt-125m --tool wget --local-dir facebook/opt-125m --exclude flax_model.msgpack tf_model.h5

你好,脚本示例里面的排除通配符写法 --exclude *.safetensors 似乎一直都不行,无法匹配到文件,只能用文件全名

同样的问题

@achristianson
Copy link

Is there a way to specify the tag/branch/revision? Many repos store things like different quant levels as different branches in the repo. An example with huggingface-cli would be:

huggingface-cli download ${MODEL_ID} --revision ${MODEL_REVISION}

@haukzero
Copy link

我之前使用hfd可以正常下载,为什么现在会报这样的错,我试过重新下载hfd但似乎也并不管用

Downloading to gpt2
Testing GIT_REFS_URL: https://hf-mirror.com/gpt2/info/refs?service=git-upload-pack
Unexpected HTTP Status Code: 000
Executing debug command: curl -v https://hf-mirror.com/gpt2/info/refs?service=git-upload-pack
Output:
* Host hf-mirror.com:443 was resolved.
* IPv6: (none)
* IPv4: 153.121.57.40, 160.16.199.204, 133.242.169.68
*   Trying 153.121.57.40:443...
* Connected to hf-mirror.com (153.121.57.40) port 443
* schannel: disabled automatic use of client certificate
* using HTTP/1.x
> GET /gpt2/info/refs?service=git-upload-pack HTTP/1.1
> Host: hf-mirror.com
> User-Agent: curl/8.8.0
> Accept: */*
>
* Request completely sent off
* schannel: remote party requests renegotiation
* schannel: renegotiating SSL/TLS connection
* schannel: SSL/TLS connection renegotiated
< HTTP/1.1 200 OK
< Access-Control-Allow-Origin: https://hf-mirror.com
< Access-Control-Expose-Headers: X-Repo-Commit,X-Request-Id,X-Error-Code,X-Error-Message,X-Total-Count,ETag,Link,Accept-Ranges,Content-Range
< Alt-Svc: h3=":443"; ma=2592000
< Content-Type: application/x-git-upload-pack-advertisement
< Cross-Origin-Opener-Policy: same-origin
< Date: Thu, 12 Sep 2024 05:17:56 GMT
< Referrer-Policy: strict-origin-when-cross-origin
< Server: hf-mirror
< Vary: Origin
< Via: 1.1 746d9b263e5f72ff5dc6d5120e20f00e.cloudfront.net (CloudFront)
< X-Amz-Cf-Id: 5P8WslKSnkIpXLeQt4eXOpfG2c4uSa2-VsKpVcCoAv_otrQ4PtHvHg==
< X-Amz-Cf-Pop: NRT51-P2
< X-Cache: Miss from cloudfront
< X-Powered-By: huggingface-moon
< X-Request-Id: Root=1-66e27984-366b3db27298ceb702252a26
< Transfer-Encoding: chunked
<
Warning: Binary output can mess up your terminal. Use "--output -" to tell
Warning: curl to output it to your terminal anyway, or consider "--output
Warning: <FILE>" to save to a file.
* client returned ERROR on write of 3561 bytes
* Failed reading the chunked-encoded stream
* Closing connection
* schannel: shutting down SSL/TLS connection with hf-mirror.com port 443

Git clone failed.

@zhang-ziang
Copy link

有可能做到在下载数据集的时候暂时跳过下载失败的文件吗?我在下载一个有很多文件的数据集但是单个文件的下载失败似乎中止了整个进程

@RewindL
Copy link

RewindL commented Sep 26, 2024

有没有办法直接跳过已经下载好的文件?下数据集的时候有接近400个文件,每次中断重新下的时候都会从第一个开始request,这可能会导致中间又因为网络timeout中断。文件太多了也不太可能一个一个exclude

@padeoe
Copy link
Author

padeoe commented Sep 26, 2024

@zhang-ziang @RewindL 收到需求,我来修改下

@zbximo
Copy link

zbximo commented Oct 1, 2024

hfd gpt2 --exclude *.safetensors --tool wget
Downloading to gpt2
Testing GIT_REFS_URL: https://hf-mirror.com/gpt2/info/refs?service=git-upload-pack
GIT_LFS_SKIP_SMUDGE=1 git clone https://hf-mirror.com/gpt2 gpt2
Cloning into 'gpt2'...

为什么一直卡在这里,下载啥东西都是这样

@hl0737
Copy link

hl0737 commented Oct 9, 2024

@zhang-ziang @RewindL 收到需求,我来修改下

大佬,exclude好像失效了,麻烦帮忙看下,感谢您!

@padeoe
Copy link
Author

padeoe commented Oct 10, 2024

大佬,exclude好像失效了,麻烦帮忙看下,感谢您!

@hl0737 我测试--exclude是有效的呀:

$ ./hfd.sh openai-community/gpt2 --exclude *.safetensors --exclude 6* --exclude f* --exclude onnx/*
Downloading to gpt2
gpt2 exists, Skip Clone.
Already up to date.

Start Downloading lfs files, bash script:
cd gpt2
# aria2c --console-log-level=error --file-allocation=none -x 4 -s 4 -k 1M -c "https://hf-mirror.com/openai-community/gpt2/resolve/main/64-8bits.tflite" -d "." -o "64-8bits.tflite"
# aria2c --console-log-level=error --file-allocation=none -x 4 -s 4 -k 1M -c "https://hf-mirror.com/openai-community/gpt2/resolve/main/64-fp16.tflite" -d "." -o "64-fp16.tflite"
# aria2c --console-log-level=error --file-allocation=none -x 4 -s 4 -k 1M -c "https://hf-mirror.com/openai-community/gpt2/resolve/main/64.tflite" -d "." -o "64.tflite"
# aria2c --console-log-level=error --file-allocation=none -x 4 -s 4 -k 1M -c "https://hf-mirror.com/openai-community/gpt2/resolve/main/flax_model.msgpack" -d "." -o "flax_model.msgpack"
# aria2c --console-log-level=error --file-allocation=none -x 4 -s 4 -k 1M -c "https://hf-mirror.com/openai-community/gpt2/resolve/main/model.safetensors" -d "." -o "model.safetensors"
# aria2c --console-log-level=error --file-allocation=none -x 4 -s 4 -k 1M -c "https://hf-mirror.com/openai-community/gpt2/resolve/main/onnx/decoder_model.onnx" -d "onnx" -o "decoder_model.onnx"
# aria2c --console-log-level=error --file-allocation=none -x 4 -s 4 -k 1M -c "https://hf-mirror.com/openai-community/gpt2/resolve/main/onnx/decoder_model_merged.onnx" -d "onnx" -o "decoder_model_merged.onnx"
# aria2c --console-log-level=error --file-allocation=none -x 4 -s 4 -k 1M -c "https://hf-mirror.com/openai-community/gpt2/resolve/main/onnx/decoder_with_past_model.onnx" -d "onnx" -o "decoder_with_past_model.onnx"
aria2c --console-log-level=error --file-allocation=none -x 4 -s 4 -k 1M -c "https://hf-mirror.com/openai-community/gpt2/resolve/main/pytorch_model.bin" -d "." -o "pytorch_model.bin"
aria2c --console-log-level=error --file-allocation=none -x 4 -s 4 -k 1M -c "https://hf-mirror.com/openai-community/gpt2/resolve/main/rust_model.ot" -d "." -o "rust_model.ot"
aria2c --console-log-level=error --file-allocation=none -x 4 -s 4 -k 1M -c "https://hf-mirror.com/openai-community/gpt2/resolve/main/tf_model.h5" -d "." -o "tf_model.h5"
Start downloading pytorch_model.bin.
[#1cf1e1 522MiB/522MiB(99%) CN:1 DL:35MiB]
下载结果:
gid   |stat|avg speed  |path/URI
======+====+===========+=======================================================
1cf1e1|OK  |    33MiB/s|./pytorch_model.bin

状态标识:
(OK): 下载完成。
Downloaded https://hf-mirror.com/openai-community/gpt2/resolve/main/pytorch_model.bin successfully.
Start downloading rust_model.ot.
[#688a39 669MiB/669MiB(99%) CN:1 DL:5.4MiB]
下载结果:
gid   |stat|avg speed  |path/URI
======+====+===========+=======================================================
688a39|OK  |    25MiB/s|./rust_model.ot

状态标识:
(OK): 下载完成。
Downloaded https://hf-mirror.com/openai-community/gpt2/resolve/main/rust_model.ot successfully.
Start downloading tf_model.h5.
[#c448c2 469MiB/474MiB(98%) CN:4 DL:10MiB]
下载结果:
gid   |stat|avg speed  |path/URI
======+====+===========+=======================================================
c448c2|OK  |    11MiB/s|./tf_model.h5

状态标识:
(OK): 下载完成。
Downloaded https://hf-mirror.com/openai-community/gpt2/resolve/main/tf_model.h5 successfully.
Download completed successfully.

如上,我设置了多个--exclude内容,所以只从 pytorch_model.bin 开始下载了。

@hl0737
Copy link

hl0737 commented Oct 10, 2024

大佬,exclude好像失效了,麻烦帮忙看下,感谢您!

@hl0737 我测试--exclude是有效的呀:

$ ./hfd.sh openai-community/gpt2 --exclude *.safetensors --exclude 6* --exclude f* --exclude onnx/*
Downloading to gpt2
gpt2 exists, Skip Clone.
Already up to date.

Start Downloading lfs files, bash script:
cd gpt2
# aria2c --console-log-level=error --file-allocation=none -x 4 -s 4 -k 1M -c "https://hf-mirror.com/openai-community/gpt2/resolve/main/64-8bits.tflite" -d "." -o "64-8bits.tflite"
# aria2c --console-log-level=error --file-allocation=none -x 4 -s 4 -k 1M -c "https://hf-mirror.com/openai-community/gpt2/resolve/main/64-fp16.tflite" -d "." -o "64-fp16.tflite"
# aria2c --console-log-level=error --file-allocation=none -x 4 -s 4 -k 1M -c "https://hf-mirror.com/openai-community/gpt2/resolve/main/64.tflite" -d "." -o "64.tflite"
# aria2c --console-log-level=error --file-allocation=none -x 4 -s 4 -k 1M -c "https://hf-mirror.com/openai-community/gpt2/resolve/main/flax_model.msgpack" -d "." -o "flax_model.msgpack"
# aria2c --console-log-level=error --file-allocation=none -x 4 -s 4 -k 1M -c "https://hf-mirror.com/openai-community/gpt2/resolve/main/model.safetensors" -d "." -o "model.safetensors"
# aria2c --console-log-level=error --file-allocation=none -x 4 -s 4 -k 1M -c "https://hf-mirror.com/openai-community/gpt2/resolve/main/onnx/decoder_model.onnx" -d "onnx" -o "decoder_model.onnx"
# aria2c --console-log-level=error --file-allocation=none -x 4 -s 4 -k 1M -c "https://hf-mirror.com/openai-community/gpt2/resolve/main/onnx/decoder_model_merged.onnx" -d "onnx" -o "decoder_model_merged.onnx"
# aria2c --console-log-level=error --file-allocation=none -x 4 -s 4 -k 1M -c "https://hf-mirror.com/openai-community/gpt2/resolve/main/onnx/decoder_with_past_model.onnx" -d "onnx" -o "decoder_with_past_model.onnx"
aria2c --console-log-level=error --file-allocation=none -x 4 -s 4 -k 1M -c "https://hf-mirror.com/openai-community/gpt2/resolve/main/pytorch_model.bin" -d "." -o "pytorch_model.bin"
aria2c --console-log-level=error --file-allocation=none -x 4 -s 4 -k 1M -c "https://hf-mirror.com/openai-community/gpt2/resolve/main/rust_model.ot" -d "." -o "rust_model.ot"
aria2c --console-log-level=error --file-allocation=none -x 4 -s 4 -k 1M -c "https://hf-mirror.com/openai-community/gpt2/resolve/main/tf_model.h5" -d "." -o "tf_model.h5"
Start downloading pytorch_model.bin.
[#1cf1e1 522MiB/522MiB(99%) CN:1 DL:35MiB]
下载结果:
gid   |stat|avg speed  |path/URI
======+====+===========+=======================================================
1cf1e1|OK  |    33MiB/s|./pytorch_model.bin

状态标识:
(OK): 下载完成。
Downloaded https://hf-mirror.com/openai-community/gpt2/resolve/main/pytorch_model.bin successfully.
Start downloading rust_model.ot.
[#688a39 669MiB/669MiB(99%) CN:1 DL:5.4MiB]
下载结果:
gid   |stat|avg speed  |path/URI
======+====+===========+=======================================================
688a39|OK  |    25MiB/s|./rust_model.ot

状态标识:
(OK): 下载完成。
Downloaded https://hf-mirror.com/openai-community/gpt2/resolve/main/rust_model.ot successfully.
Start downloading tf_model.h5.
[#c448c2 469MiB/474MiB(98%) CN:4 DL:10MiB]
下载结果:
gid   |stat|avg speed  |path/URI
======+====+===========+=======================================================
c448c2|OK  |    11MiB/s|./tf_model.h5

状态标识:
(OK): 下载完成。
Downloaded https://hf-mirror.com/openai-community/gpt2/resolve/main/tf_model.h5 successfully.
Download completed successfully.

如上,我设置了多个--exclude内容,所以只从 pytorch_model.bin 开始下载了。

@padeoe 嗯嗯,我昨天试了下ok的,有一次试不行,可能是我的通配符写错了,比如不想下一个文件夹dir,只写了dir/,而不是dir/*,maybe,没有试

感谢您的回复哈!

PS:datasets下载数据集的问题麻烦大佬加急看下!,现在datasets已经更新到3的大版本了,但是 > 2.14.6的版本还是有问题,谢谢您!❤️❤️

@Yancy456
Copy link

Yancy456 commented Nov 5, 2024

太天才了我操,改天帮你写一个python版本的,这个bash版本实在不好维护

@Yancy456
Copy link

Yancy456 commented Nov 5, 2024

Downloading to /home/zhengxiao/dataroot/models/Llama2_7b_chat_hf/ Testing GIT_REFS_URL: https://hf-mirror.com/meta-llama/Llama-2-7b-chat-hf/info/refs?service=git-upload-pack git clone https://zhengxiao:[email protected]/meta-llama/Llama-2-7b-chat-hf /home/zhengxiao/dataroot/models/Llama2_7b_chat_hf/ fatal: destination path '/home/zhengxiao/dataroot/models/Llama2_7b_chat_hf' already exists and is not an empty directory. Git clone failed. 显示目录已存在,请问如何解决?

换个位置

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment