Skip to content

Instantly share code, notes, and snippets.

@ser1zw
Last active December 11, 2023 02:05
Show Gist options
  • Save ser1zw/df32363ea5845e9877949b8906f9a053 to your computer and use it in GitHub Desktop.
Save ser1zw/df32363ea5845e9877949b8906f9a053 to your computer and use it in GitHub Desktop.
#!/bin/bash
set -euo pipefail
WAIT_SECONDS=1
USER_AGENT='Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/119.0'
logger_info() {
echo "$(LANG=C date --rfc-3339=seconds) [INFO] ${1}"
}
if [ $# -ne 2 ]; then
echo "Usage: ${0} url_list_file output_dir" >&2
exit
fi
url_list_file="$1"
output_dir="$2"
for url in $(cat "${url_list_file}"); do
logger_info "${url}"
wget --quiet --user-agent="${USER_AGENT}" --level=1 --page-requisites --convert-links --directory-prefix="${output_dir}" "${url}"
sleep ${WAIT_SECONDS}
done
#!/bin/bash
set -euo pipefail
if [ $# -ne 2 ]; then
echo "Usage: ${0} input_dir output_dir" >&2
exit
fi
input_dir="$1"
output_dir="$2"
if [ ! -d "${output_dir}" ]; then
mkdir -p "${output_dir}"
fi
for f in $(find "${input_dir}" -maxdepth 1 -type f); do
filename=$(basename "${f}")
xmllint --html --xpath '//div[@class="article"]' "${f}" 2>/dev/null \
| w3m -T text/html -dump -cols 4096 2>/dev/null \
| grep -vE '^\s*$' \
| sed -E 's/━+/\n/g' \
> "${output_dir}/${filename}"
done
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment