Skip to content

Instantly share code, notes, and snippets.

@UncleCJ
Last active April 14, 2024 08:33
Show Gist options
  • Save UncleCJ/2e97c8269b33404f1467354d4ab89242 to your computer and use it in GitHub Desktop.
Save UncleCJ/2e97c8269b33404f1467354d4ab89242 to your computer and use it in GitHub Desktop.
I frequently need to migrate my family tree from its main location in MyHeritage to the Gramps application. MyHeritage has a number of issues in its gedcom export, such as inconsistent line endings and even (swedish) multibyte UTF-8 characters broken over multiple lines (within gedcom notes). Using this as a filter - I lose less data. Finalize w…
#!/usr/bin/env bash
# See https://www.gramps-project.org/wiki/index.php/Gramps_and_GEDCOM
set -Eeuo pipefail
for COMMAND in dos2unix sed tr grep
do
if ! command -v $COMMAND &> /dev/null
then
echo "$COMMAND command could not be found" &> /dev/stderr
exit 1
fi
done
# Use C/posix locale for commands to operate on otherwise unprintable characters - https://www.gnu.org/software/sed/manual/html_node/Locale-Considerations.html#Invalid-multibyte-characters
export LC_ALL=C
# Probably need to do some line break conversion first - iconv, tr or dos2unix?
dos2unix --to-stdout |
# Strip away (UTF-8?!) BOM
# Prepend all lines not starting with a digit (GEDCOM syntax) with '4 CONC ', as they seem to not have been recognized as line breaks in MyHeritage export
# ... also include any empty lines, as they likely are a special case of the above
# Tab separate the source metadata fields
sed -e '1s/^\xEF\xBB\xBF//' \
-e 's/^\([^[:digit:]]\)/4 CONC \1/' \
-e 's/^$/4 CONC /' \
-e 's/\(\w\)\?\(Land\|Bok\|Provins\|Plats\|Län\|Sida\|Församling\|Rad\)\(\w\)/\1\t\2\t\3/g' |
# Use tr to temporarily make the data one continuous line - because we operate across line breaks
# Some Swedish multibyte UTF-8 characters (in notes?) have been broken across lines, let them finish on the first line... https://www.utf8-chartable.de
tr '\n' '\r' |
sed -e 's/\xC3\r\([[:digit:]] \w\+ \)\([\x80-\xbf]\)/\xC3\2\r\1/g' |
tr '\r' '\n' |
# Throw away fields not supported by Gramps mostly to decrease error noise
grep -vE '^(1 _UPD|1 DATE|1 _PROJECT_GUID|1 _EXPORTED_FROM_SITE_ID|2 _RTLSAVE|2 _FORMERNAME)'
# Let's see if NoteCleanupTool can handle the rest of issues: https://gramps-project.org/wiki/index.php/Addon:NoteCleanupTool
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment