Skip to content

Instantly share code, notes, and snippets.

@kuzeko
Last active August 10, 2024 17:46
Show Gist options
  • Save kuzeko/7ce71c6088c866b0639c50cf9504869a to your computer and use it in GitHub Desktop.
Save kuzeko/7ce71c6088c866b0639c50cf9504869a to your computer and use it in GitHub Desktop.
Import DBpedia 2020 into Neo4j v4 with Neosemantics

Import DBpedia 2020 into Neo4j v4 with Neosemantics

  1. Prerequisite: OpenJDK 11. If you run ubuntu with root you can use

    apt-get install default-jdk
    

    Otherwise, consider using docker : https://hub.docker.com/_/openjdk

    Third option, not recommended, you can install Java in userspace, you will have to play around with terminal configuration. Here is a starting point under "Installing OpenJDK Manually": https://dzone.com/articles/installing-openjdk-11-on-ubuntu-1804-for-real

  2. Get Neo4j v4.1.X Community server and install Neosemantics plugin, also configure neosemantics and add required index

    ./get-neo4j.sh
    
  3. Download DBpedia Files, uncompress, ready to be imported

    ./download-dbpedia.sh dbpedia_files.txt
    
  4. Load the data files Notice 1: DBpedia contains malformed IRIs, I've done my best to exclude those, but still some can pass through. A better solution is needed. Notice 2: DBpedia has multi-valued properties with inconsistent types. At the moment handleMultival: "OVERWRITE" could be an option.

    ./import-dbpedia.sh
    
  5. Test data is all right:

    • Count nodes

      ${NEO4J_HOME}/bin/cypher-shell -u neo4j -p 'admin' "MATCH (r:Resource) RETURN COUNT(r)"
      
    • Count edges

      ${NEO4J_HOME}/bin/cypher-shell -u neo4j -p 'admin' "MATCH (r1:Resource)-[l]->(r2:Resource) RETURN COUNT(l)"
      
    • Distinct relationship types

      ${NEO4J_HOME}/bin/cypher-shell -u neo4j -p 'admin' "CALL db.relationshipTypes() YIELD relationshipType RETURN relationshipType"
      
    • Example node-edges

      ${NEO4J_HOME}/bin/cypher-shell -u neo4j -p 'admin' "MATCH (r1:Resource)-[l]->(r2:Resource) RETURN r1, l, r2 LIMIT 20"
      
#https://downloads.dbpedia.org/repo/dbpedia/generic/redirects/2020.10.01/redirects_lang=en.ttl.bz2
#https://downloads.dbpedia.org/repo/dbpedia/generic/redirects/2020.10.01/redirects_lang=en_transitive.ttl.bz2
#https://downloads.dbpedia.org/repo/dbpedia/generic/revisions/2020.10.01/revisions_lang=en_ids.ttl.bz2
#https://downloads.dbpedia.org/repo/dbpedia/generic/revisions/2020.10.01/revisions_lang=en_uris.ttl.bz2
#https://downloads.dbpedia.org/repo/dbpedia/generic/categories/2020.10.01/categories_lang=en_articles.ttl.bz2
#https://downloads.dbpedia.org/repo/dbpedia/generic/categories/2020.10.01/categories_lang=en_labels.ttl.bz2
#https://downloads.dbpedia.org/repo/dbpedia/generic/categories/2020.10.01/categories_lang=en_skos.ttl.bz2
https://downloads.dbpedia.org/repo/dbpedia/generic/commons-sameas-links/2020.10.01/commons-sameas-links_lang=en.ttl.bz2
#https://downloads.dbpedia.org/repo/dbpedia/generic/geo-coordinates/2020.10.01/geo-coordinates_lang=en.ttl.bz2
#https://downloads.dbpedia.org/repo/dbpedia/generic/homepages/2020.10.01/homepages_lang=en.ttl.bz2
https://downloads.dbpedia.org/repo/dbpedia/generic/infobox-properties/2020.10.01/infobox-properties_lang=en.ttl.bz2
https://downloads.dbpedia.org/repo/dbpedia/generic/infobox-property-definitions/2020.10.01/infobox-property-definitions_lang=en.ttl.bz2
https://downloads.dbpedia.org/repo/dbpedia/generic/labels/2020.10.01/labels_lang=en.ttl.bz2
https://downloads.dbpedia.org/repo/dbpedia/generic/wikipedia-links/2020.10.01/wikipedia-links_lang=en.ttl.bz2
#https://downloads.dbpedia.org/repo/dbpedia/mappings/geo-coordinates-mappingbased/2020.10.01/geo-coordinates-mappingbased_lang=en.ttl.bz2
https://downloads.dbpedia.org/repo/dbpedia/mappings/instance-types/2020.10.01/instance-types_lang=en_specific.ttl.bz2
https://downloads.dbpedia.org/repo/dbpedia/mappings/instance-types/2020.10.01/instance-types_lang=en_transitive.ttl.bz2
https://downloads.dbpedia.org/repo/dbpedia/mappings/mappingbased-literals/2020.10.01/mappingbased-literals_lang=en.ttl.bz2
https://downloads.dbpedia.org/repo/dbpedia/mappings/mappingbased-objects/2020.10.01/mappingbased-objects_lang=en.ttl.bz2
#https://downloads.dbpedia.org/repo/dbpedia/mappings/mappingbased-objects/2020.10.01/mappingbased-objects_lang=en_disjointDomain.ttl.bz2
#https://downloads.dbpedia.org/repo/dbpedia/mappings/mappingbased-objects/2020.10.01/mappingbased-objects_lang=en_disjointRange.ttl.bz2
#https://downloads.dbpedia.org/repo/dbpedia/mappings/specific-mappingbased-properties/2020.10.01/specific-mappingbased-properties_lang=en.ttl.bz2
#!/bin/bash
set -e
export DATA_DIR="${PWD}/data"
export NEO4J_HOME=${PWD}/neo4j-server
export NEO4J_IMPORT="${NEO4J_HOME}/import"
mkdir -p -v "${DATA_DIR}"
mkdir -p -v "${NEO4J_IMPORT}"
if [ "$#" -ne 1 ]; then
echo "Illegal number of parameters."
exit 1
fi
if [ -d $DATA_DIR ]
then
echo "Downloading files..."
rm -v ${DATA_DIR}/*.* || true
while read -r line; do
[[ "$line" =~ ^#.*$ ]] && continue
wget -P ${DATA_DIR}/ $line
bzip2 -dk ${DATA_DIR}/${line##*/}
filename=$(basename -- "${DATA_DIR}/${line##*/}")
filename="${filename%.*}"
# Remove corrupted chars and lines
iconv -f utf-8 -t ascii -c "${DATA_DIR}/${filename}" | grep -E '^<(https?|ftp|file)://[-A-Za-z0-9\+&@#/%?=~_|!:,.;]*[A-Za-z0-9\+&@#/%?=~_|]>\W<' | grep -v 'xn--b1aew' > ${DATA_DIR}/clean-${filename}
rm -v "${DATA_DIR}/${filename}"
split -l 5000000 --numeric-suffixes ${DATA_DIR}/clean-${filename} ${NEO4J_IMPORT}/part-${filename}
done < $1
chmod -R 777 ${NEO4J_IMPORT}
else
echo "No destination folder ${DATA_DIR}"
fi
#!/bin/bash
ulimit -n 65535
NEO4J_VERSION=4.1.4
rm -rf neo4j-server
wget https://neo4j.com/artifact.php?name=neo4j-community-${NEO4J_VERSION}-unix.tar.gz -O neo4j.tar.gz
tar xf neo4j.tar.gz
mv neo4j-community-${NEO4J_VERSION} neo4j-server
rm neo4j.tar.gz
export NEO4J_HOME=${PWD}/neo4j-server
export NEO4J_DATA_DIR=${NEO4J_HOME}/data
rm -rf $NEO4J_DATA_DIR
# APOC_VERSION=4.1.0.2
# APOC_FILE=apoc-${APOC_VERSION}-core.jar
# there is a difference between `core` and `all`
# In theory we don't need this, since
# apoc-4.1.0.2-core.jar contains a subset of the functionality and will be bundled from Neo4j 4.1.1
#if [ ! -f ${NEO4J_HOME}/plugins/${APOC_FILE} ]
#then
# echo "Downloading Neo4j APOC plugin..."
# wget -P ${NEO4J_HOME}/plugins/ https://github.com/neo4j-contrib/neo4j-apoc-procedures/releases/download/${APOC_VERSION}/${APOC_FILE}
#fi
# Do we need the following?
echo "Installing Neo4j APOC plugin..."
echo 'dbms.security.procedures.unrestricted=apoc.*' >> ${NEO4J_HOME}/conf/neo4j.conf
echo 'apoc.export.file.enabled=true' >> ${NEO4J_HOME}/conf/neo4j.conf
echo 'apoc.import.file.use_neo4j_config=false' >> ${NEO4J_HOME}/conf/neo4j.conf
NEOSEM_VERSION=4.1.0.1
NEOSEM_FILE=neosemantics-${NEOSEM_VERSION}.jar
if [ ! -f ${NEO4J_HOME}/plugins/${NEOSEM_FILE} ]
then
echo "Downloading Neo4j RDF plugin..."
wget -P ${NEO4J_HOME}/plugins/ https://github.com/neo4j-labs/neosemantics/releases/download/${NEOSEM_VERSION}/${NEOSEM_FILE}
fi
echo "Installing Neo4j RDF plugin..."
echo 'dbms.unmanaged_extension_classes=n10s.endpoint=/rdf' >> ${NEO4J_HOME}/conf/neo4j.conf
${NEO4J_HOME}/bin/neo4j start
sleep 10
$NEO4J_HOME/bin/neo4j-admin set-initial-password admin
$NEO4J_HOME/bin/neo4j restart
sleep 10
echo "Creating index"
${NEO4J_HOME}/bin/cypher-shell -u neo4j -p 'admin' "CREATE CONSTRAINT n10s_unique_uri ON (r:Resource) ASSERT r.uri IS UNIQUE;"
${NEO4J_HOME}/bin/cypher-shell -u neo4j -p 'admin' 'call n10s.graphconfig.init( { handleMultival: "OVERWRITE", handleVocabUris: "SHORTEN", keepLangTag: false, handleRDFTypes: "NODES" })'
echo Neo4j log:
tail -n 12 $NEO4J_HOME/logs/neo4j.log
#!/bin/bash
export NEO4J_HOME=${PWD}
export NEO4J_IMPORT="${NEO4J_HOME}/neo4j-server/import"
export NEO4J_DB_DIR=$NEO4J_HOME/neo4j-server/data/databases/graph.db
ulimit -n 65535
echo "Importing"
for file in ${NEO4J_IMPORT}/*.ttl*; do
# Extracting filename
echo $file
filename="$(basename "${file}")"
echo "Importing $filename from ${NEO4J_HOME}"
${NEO4J_HOME}/neo4j-server/bin/cypher-shell -u neo4j -p 'admin' "CALL n10s.rdf.import.fetch(\"file://${NEO4J_IMPORT}/$filename\",\"Turtle\");"
done
@mh-augment4
Copy link

Thanks for creating this. It worked perfectly. Download and import took about 3.5 hours. The numbers for the above files: 29,831,524 Resources and 115,653,828 relations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment