Last active
April 14, 2024 08:33
-
-
Save UncleCJ/2e97c8269b33404f1467354d4ab89242 to your computer and use it in GitHub Desktop.
I frequently need to migrate my family tree from its main location in MyHeritage to the Gramps application. MyHeritage has a number of issues in its gedcom export, such as inconsistent line endings and even (swedish) multibyte UTF-8 characters broken over multiple lines (within gedcom notes). Using this as a filter - I lose less data. Finalize w…
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/usr/bin/env bash | |
# See https://www.gramps-project.org/wiki/index.php/Gramps_and_GEDCOM | |
set -Eeuo pipefail | |
for COMMAND in dos2unix sed tr grep | |
do | |
if ! command -v $COMMAND &> /dev/null | |
then | |
echo "$COMMAND command could not be found" &> /dev/stderr | |
exit 1 | |
fi | |
done | |
# Use C/posix locale for commands to operate on otherwise unprintable characters - https://www.gnu.org/software/sed/manual/html_node/Locale-Considerations.html#Invalid-multibyte-characters | |
export LC_ALL=C | |
# Probably need to do some line break conversion first - iconv, tr or dos2unix? | |
dos2unix --to-stdout | | |
# Strip away (UTF-8?!) BOM | |
# Prepend all lines not starting with a digit (GEDCOM syntax) with '4 CONC ', as they seem to not have been recognized as line breaks in MyHeritage export | |
# ... also include any empty lines, as they likely are a special case of the above | |
# Tab separate the source metadata fields | |
sed -e '1s/^\xEF\xBB\xBF//' \ | |
-e 's/^\([^[:digit:]]\)/4 CONC \1/' \ | |
-e 's/^$/4 CONC /' \ | |
-e 's/\(\w\)\?\(Land\|Bok\|Provins\|Plats\|Län\|Sida\|Församling\|Rad\)\(\w\)/\1\t\2\t\3/g' | | |
# Use tr to temporarily make the data one continuous line - because we operate across line breaks | |
# Some Swedish multibyte UTF-8 characters (in notes?) have been broken across lines, let them finish on the first line... https://www.utf8-chartable.de | |
tr '\n' '\r' | | |
sed -e 's/\xC3\r\([[:digit:]] \w\+ \)\([\x80-\xbf]\)/\xC3\2\r\1/g' | | |
tr '\r' '\n' | | |
# Throw away fields not supported by Gramps mostly to decrease error noise | |
grep -vE '^(1 _UPD|1 DATE|1 _PROJECT_GUID|1 _EXPORTED_FROM_SITE_ID|2 _RTLSAVE|2 _FORMERNAME)' | |
# Let's see if NoteCleanupTool can handle the rest of issues: https://gramps-project.org/wiki/index.php/Addon:NoteCleanupTool |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment