-
Prerequisite: OpenJDK 11. If you run ubuntu with root you can use
apt-get install default-jdk
Otherwise, consider using docker : https://hub.docker.com/_/openjdk
Third option, not recommended, you can install Java in userspace, you will have to play around with terminal configuration. Here is a starting point under "Installing OpenJDK Manually": https://dzone.com/articles/installing-openjdk-11-on-ubuntu-1804-for-real
-
Get Neo4j v4.1.X Community server and install Neosemantics plugin, also configure neosemantics and add required index
./get-neo4j.sh
-
Download DBpedia Files, uncompress, ready to be imported
./download-dbpedia.sh dbpedia_files.txt
-
Load the data files Notice 1: DBpedia contains malformed IRIs, I've done my best to exclude those, but still some can pass through. A better solution is needed. Notice 2: DBpedia has multi-valued properties with inconsistent types. At the moment
handleMultival: "OVERWRITE"
could be an option../import-dbpedia.sh
-
Test data is all right:
-
Count nodes
${NEO4J_HOME}/bin/cypher-shell -u neo4j -p 'admin' "MATCH (r:Resource) RETURN COUNT(r)"
-
Count edges
${NEO4J_HOME}/bin/cypher-shell -u neo4j -p 'admin' "MATCH (r1:Resource)-[l]->(r2:Resource) RETURN COUNT(l)"
-
Distinct relationship types
${NEO4J_HOME}/bin/cypher-shell -u neo4j -p 'admin' "CALL db.relationshipTypes() YIELD relationshipType RETURN relationshipType"
-
Example node-edges
${NEO4J_HOME}/bin/cypher-shell -u neo4j -p 'admin' "MATCH (r1:Resource)-[l]->(r2:Resource) RETURN r1, l, r2 LIMIT 20"
-
Last active
August 10, 2024 17:46
-
-
Save kuzeko/7ce71c6088c866b0639c50cf9504869a to your computer and use it in GitHub Desktop.
Import DBpedia 2020 into Neo4j v4 with Neosemantics
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#https://downloads.dbpedia.org/repo/dbpedia/generic/redirects/2020.10.01/redirects_lang=en.ttl.bz2 | |
#https://downloads.dbpedia.org/repo/dbpedia/generic/redirects/2020.10.01/redirects_lang=en_transitive.ttl.bz2 | |
#https://downloads.dbpedia.org/repo/dbpedia/generic/revisions/2020.10.01/revisions_lang=en_ids.ttl.bz2 | |
#https://downloads.dbpedia.org/repo/dbpedia/generic/revisions/2020.10.01/revisions_lang=en_uris.ttl.bz2 | |
#https://downloads.dbpedia.org/repo/dbpedia/generic/categories/2020.10.01/categories_lang=en_articles.ttl.bz2 | |
#https://downloads.dbpedia.org/repo/dbpedia/generic/categories/2020.10.01/categories_lang=en_labels.ttl.bz2 | |
#https://downloads.dbpedia.org/repo/dbpedia/generic/categories/2020.10.01/categories_lang=en_skos.ttl.bz2 | |
https://downloads.dbpedia.org/repo/dbpedia/generic/commons-sameas-links/2020.10.01/commons-sameas-links_lang=en.ttl.bz2 | |
#https://downloads.dbpedia.org/repo/dbpedia/generic/geo-coordinates/2020.10.01/geo-coordinates_lang=en.ttl.bz2 | |
#https://downloads.dbpedia.org/repo/dbpedia/generic/homepages/2020.10.01/homepages_lang=en.ttl.bz2 | |
https://downloads.dbpedia.org/repo/dbpedia/generic/infobox-properties/2020.10.01/infobox-properties_lang=en.ttl.bz2 | |
https://downloads.dbpedia.org/repo/dbpedia/generic/infobox-property-definitions/2020.10.01/infobox-property-definitions_lang=en.ttl.bz2 | |
https://downloads.dbpedia.org/repo/dbpedia/generic/labels/2020.10.01/labels_lang=en.ttl.bz2 | |
https://downloads.dbpedia.org/repo/dbpedia/generic/wikipedia-links/2020.10.01/wikipedia-links_lang=en.ttl.bz2 | |
#https://downloads.dbpedia.org/repo/dbpedia/mappings/geo-coordinates-mappingbased/2020.10.01/geo-coordinates-mappingbased_lang=en.ttl.bz2 | |
https://downloads.dbpedia.org/repo/dbpedia/mappings/instance-types/2020.10.01/instance-types_lang=en_specific.ttl.bz2 | |
https://downloads.dbpedia.org/repo/dbpedia/mappings/instance-types/2020.10.01/instance-types_lang=en_transitive.ttl.bz2 | |
https://downloads.dbpedia.org/repo/dbpedia/mappings/mappingbased-literals/2020.10.01/mappingbased-literals_lang=en.ttl.bz2 | |
https://downloads.dbpedia.org/repo/dbpedia/mappings/mappingbased-objects/2020.10.01/mappingbased-objects_lang=en.ttl.bz2 | |
#https://downloads.dbpedia.org/repo/dbpedia/mappings/mappingbased-objects/2020.10.01/mappingbased-objects_lang=en_disjointDomain.ttl.bz2 | |
#https://downloads.dbpedia.org/repo/dbpedia/mappings/mappingbased-objects/2020.10.01/mappingbased-objects_lang=en_disjointRange.ttl.bz2 | |
#https://downloads.dbpedia.org/repo/dbpedia/mappings/specific-mappingbased-properties/2020.10.01/specific-mappingbased-properties_lang=en.ttl.bz2 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/bin/bash | |
set -e | |
export DATA_DIR="${PWD}/data" | |
export NEO4J_HOME=${PWD}/neo4j-server | |
export NEO4J_IMPORT="${NEO4J_HOME}/import" | |
mkdir -p -v "${DATA_DIR}" | |
mkdir -p -v "${NEO4J_IMPORT}" | |
if [ "$#" -ne 1 ]; then | |
echo "Illegal number of parameters." | |
exit 1 | |
fi | |
if [ -d $DATA_DIR ] | |
then | |
echo "Downloading files..." | |
rm -v ${DATA_DIR}/*.* || true | |
while read -r line; do | |
[[ "$line" =~ ^#.*$ ]] && continue | |
wget -P ${DATA_DIR}/ $line | |
bzip2 -dk ${DATA_DIR}/${line##*/} | |
filename=$(basename -- "${DATA_DIR}/${line##*/}") | |
filename="${filename%.*}" | |
# Remove corrupted chars and lines | |
iconv -f utf-8 -t ascii -c "${DATA_DIR}/${filename}" | grep -E '^<(https?|ftp|file)://[-A-Za-z0-9\+&@#/%?=~_|!:,.;]*[A-Za-z0-9\+&@#/%?=~_|]>\W<' | grep -v 'xn--b1aew' > ${DATA_DIR}/clean-${filename} | |
rm -v "${DATA_DIR}/${filename}" | |
split -l 5000000 --numeric-suffixes ${DATA_DIR}/clean-${filename} ${NEO4J_IMPORT}/part-${filename} | |
done < $1 | |
chmod -R 777 ${NEO4J_IMPORT} | |
else | |
echo "No destination folder ${DATA_DIR}" | |
fi |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/bin/bash | |
ulimit -n 65535 | |
NEO4J_VERSION=4.1.4 | |
rm -rf neo4j-server | |
wget https://neo4j.com/artifact.php?name=neo4j-community-${NEO4J_VERSION}-unix.tar.gz -O neo4j.tar.gz | |
tar xf neo4j.tar.gz | |
mv neo4j-community-${NEO4J_VERSION} neo4j-server | |
rm neo4j.tar.gz | |
export NEO4J_HOME=${PWD}/neo4j-server | |
export NEO4J_DATA_DIR=${NEO4J_HOME}/data | |
rm -rf $NEO4J_DATA_DIR | |
# APOC_VERSION=4.1.0.2 | |
# APOC_FILE=apoc-${APOC_VERSION}-core.jar | |
# there is a difference between `core` and `all` | |
# In theory we don't need this, since | |
# apoc-4.1.0.2-core.jar contains a subset of the functionality and will be bundled from Neo4j 4.1.1 | |
#if [ ! -f ${NEO4J_HOME}/plugins/${APOC_FILE} ] | |
#then | |
# echo "Downloading Neo4j APOC plugin..." | |
# wget -P ${NEO4J_HOME}/plugins/ https://github.com/neo4j-contrib/neo4j-apoc-procedures/releases/download/${APOC_VERSION}/${APOC_FILE} | |
#fi | |
# Do we need the following? | |
echo "Installing Neo4j APOC plugin..." | |
echo 'dbms.security.procedures.unrestricted=apoc.*' >> ${NEO4J_HOME}/conf/neo4j.conf | |
echo 'apoc.export.file.enabled=true' >> ${NEO4J_HOME}/conf/neo4j.conf | |
echo 'apoc.import.file.use_neo4j_config=false' >> ${NEO4J_HOME}/conf/neo4j.conf | |
NEOSEM_VERSION=4.1.0.1 | |
NEOSEM_FILE=neosemantics-${NEOSEM_VERSION}.jar | |
if [ ! -f ${NEO4J_HOME}/plugins/${NEOSEM_FILE} ] | |
then | |
echo "Downloading Neo4j RDF plugin..." | |
wget -P ${NEO4J_HOME}/plugins/ https://github.com/neo4j-labs/neosemantics/releases/download/${NEOSEM_VERSION}/${NEOSEM_FILE} | |
fi | |
echo "Installing Neo4j RDF plugin..." | |
echo 'dbms.unmanaged_extension_classes=n10s.endpoint=/rdf' >> ${NEO4J_HOME}/conf/neo4j.conf | |
${NEO4J_HOME}/bin/neo4j start | |
sleep 10 | |
$NEO4J_HOME/bin/neo4j-admin set-initial-password admin | |
$NEO4J_HOME/bin/neo4j restart | |
sleep 10 | |
echo "Creating index" | |
${NEO4J_HOME}/bin/cypher-shell -u neo4j -p 'admin' "CREATE CONSTRAINT n10s_unique_uri ON (r:Resource) ASSERT r.uri IS UNIQUE;" | |
${NEO4J_HOME}/bin/cypher-shell -u neo4j -p 'admin' 'call n10s.graphconfig.init( { handleMultival: "OVERWRITE", handleVocabUris: "SHORTEN", keepLangTag: false, handleRDFTypes: "NODES" })' | |
echo Neo4j log: | |
tail -n 12 $NEO4J_HOME/logs/neo4j.log |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/bin/bash | |
export NEO4J_HOME=${PWD} | |
export NEO4J_IMPORT="${NEO4J_HOME}/neo4j-server/import" | |
export NEO4J_DB_DIR=$NEO4J_HOME/neo4j-server/data/databases/graph.db | |
ulimit -n 65535 | |
echo "Importing" | |
for file in ${NEO4J_IMPORT}/*.ttl*; do | |
# Extracting filename | |
echo $file | |
filename="$(basename "${file}")" | |
echo "Importing $filename from ${NEO4J_HOME}" | |
${NEO4J_HOME}/neo4j-server/bin/cypher-shell -u neo4j -p 'admin' "CALL n10s.rdf.import.fetch(\"file://${NEO4J_IMPORT}/$filename\",\"Turtle\");" | |
done |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Thanks for creating this. It worked perfectly. Download and import took about 3.5 hours. The numbers for the above files: 29,831,524 Resources and 115,653,828 relations.