Skip to content

Instantly share code, notes, and snippets.

@edwios
Created January 8, 2020 04:21
Show Gist options
  • Save edwios/bb9ca2f31827530a89c25dd5ce265823 to your computer and use it in GitHub Desktop.
Save edwios/bb9ca2f31827530a89c25dd5ce265823 to your computer and use it in GitHub Desktop.
Script to preprocess text into dataset suitable for training with GPT2-Chinese and TF2-Chinese
#!/bin/bash
#
# Script to preprocess text into dataset suitable for training with GPT2-Chinese and TF2-Chinese
#
echo -n "Result file save to [train.txt]:"
read OUTF
if [ "${OUTF}" == "" ]; then
OUTF="train.txt"
fi
if [ -f "${OUTF}" ]; then
echo -n "${OUTF} already exist, replace (Y/n)?"
read ANS
if [ "${ANS}" == "y" -o "${ANS}" == "Y" ]; then
rm "${OUTF}"
else
echo "Not replacing ${OUTF}, aborted."
exit 1
fi
fi
echo "Output will save to ${OUTF}"
JSONF=0
P1=`basename ${OUTF} ".json"`
if [ "${P1}.json" == "${OUTF}" ]; then
echo "Using JSON format"
JSONF=1
fi
IFS=$'\n'
FL=`ls -1 *.txt`
TMPTXT="/tmp/tmptxt$$"
if [ ${JSONF} -eq 1 ]; then
echo "[" > "${OUTF}"
fi
for i in ${FL}; do
FN=`basename ${i} ".txt"`
echo "Reading ${FN}"
if [ -f "${i}" ]; then
cat "${i}" | tr -d '\000-\011\013\014\015-\037' | sed -e "s/'/"/g" -e 's/ / /g' -e 's/["‘’“”]/"/g' -e 'y/.,?!:;\`*\[\]\(\){}\\\/\\/。,?!:;\`*\【\】\(\)「」\/\/' -e 's/ / /g' > "${TMPTXT}"
sed -e ':a' -e 'N' -e '$!ba' -e 's/\n/\\n/g' "${TMPTXT}" | tr -d '\012' > "${TMPTXT}.0"
if [ ${JSONF} -eq 1 ]; then
echo -n '"' >> "${OUTF}"
else
echo -n '["' >> "${OUTF}"
fi
cat "${TMPTXT}.0" >> "${OUTF}"
if [ ${JSONF} -eq 1 ]; then
echo '",' >> "${OUTF}"
else
echo '"]' >> "${OUTF}"
fi
fi
done
unset IFS
if [ ${JSONF} -eq 1 ]; then
echo "]" >> "${OUTF}"
fi
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment