Sam Zeitlin szeitlin

##Neo4j GraphGist - Enterprise Architectures: Real-time Graph Updates using Kafka Messaging

Neo4j Use Case: Low Latency Graph Analytics & OLTP - Update 1M Nodes in 90 secs with Kafka and Neo4j Bolt

Introduction

A recent Neo4j whitepaper describes how Monsanto is performing real-time updates on a 600M node Neo4j graph using Kafka to consume data extracted from a large Oracle Exadata instance.

This modern data architecture combines a fast, scalable messaging platform (Kafka) for low latency data provisioning and an enterprise graph database (Neo4j) for high performance, in-memory analytics & OLTP - creating new and powerful real-time graph analytics capabilities for your enterprise applications.

Spark internals through code

Nothing gives you more detail about spark internals than actually reading it source code. In addition, you get to learn many design techniques and improve your scala coding skills. These are the random notes I make while reading the spark code. The best way to comprehend the notes is to load spark code into an IDE, e.g. IntelliJ, and navigate the code on the side.

Genesis - creation of a spark cluster

The scripts for creating a spark cluster are: start-master.sh and start-slave.sh. Read them carefully, and you can see that both scripts are very similar except the values for $CLASS variable. For start-master.sh, the value is CLASS="org.apache.spark.deploy.master.Master", while the value for start-slave.sh is shown below with more context.

# NOTE: This exact class name is matched downstream by SparkSubmit.

Neo4j Tutorial

Fundamentals

Store any kind of data using the following graph concepts:

Node: Graph data records
Relationship: Connect nodes (has direction and a type)
Property: Stores data in key-value pair in nodes and relationships
Label: Groups nodes and relationships (optional)

	import boto3

	# http://boto3.readthedocs.io/en/latest/reference/services/ec2.html#service-resource

	ec2 = boto3.resource('ec2', aws_access_key_id='AWS_ACCESS_KEY_ID',
	aws_secret_access_key='AWS_SECRET_ACCESS_KEY',
	region_name='us-west-2')

	# create VPC
	vpc = ec2.create_vpc(CidrBlock='192.168.0.0/16')

	import boto3

	def role_arn_to_session(**args):
	"""
	Usage :
	session = role_arn_to_session(
	RoleArn='arn:aws:iam::012345678901:role/example-role',
	RoleSessionName='ExampleSessionName')
	client = session.client('sqs')
	"""

	# For Windows users# Note: <> denotes changes to be made

	#Create a conda environment
	conda create --name <environment-name> python=<version:2.7/3.5>

	#To create a requirements.txt file:
	conda list #Gives you list of packages used for the environment

	conda list -e > requirements.txt #Save all the info about packages to your folder

	"""
	how to count as fast as possible
	(numbers from Python 3.5.2 on a Macbook Pro)
	YMMV, but these results are pretty stable for me, say +/- 0.1s on repeated runs
	"""

	from collections import Counter, defaultdict
	import random

	random_numbers = [random.randrange(10000) for _ in range(10000000)]

	[core]
	# The home folder for airflow, default is ~/airflow
	airflow_home = /Users/p1nox/airflow

	# The folder where your airflow pipelines live, most likely a
	# subfolder in a code repository
	dags_folder = /Users/p1nox/airflow/dags

	# The folder where airflow should store its log files. This location
	base_log_folder = /Users/p1nox/airflow/logs

	#!/bin/bash

	echo "Getting list of Availability Zones"
	all_regions=$(aws ec2 describe-regions --output text --query 'Regions[*].[RegionName]' \| sort)
	all_az=()

	while read -r region; do
	az_per_region=$(aws ec2 describe-availability-zones --region $region --query 'AvailabilityZones[*].[ZoneName]' --output text \| sort)

	while read -r az; do