Skip to content

Instantly share code, notes, and snippets.

@habedi
habedi / yfcc100m_dataset_schema.sql
Last active June 24, 2022 06:54
Schema for the table that can contain the data from YFCC100m dataset. The schema is compatile with a MySQL or a MariaDB table.
-- Dataset available from http://multimedia-commons.s3-website-us-west-2.amazonaws.com/?prefix=tools/etc/ in Sqlite3 database format ('yfcc100m_dataset.sql' file)
SET NAMES utf8;
SET time_zone = '+00:00';
SET foreign_key_checks = 0;
SET sql_mode = 'NO_AUTO_VALUE_ON_ZERO';
DROP TABLE IF EXISTS `yfcc100m_dataset`;
CREATE TABLE `yfcc100m_dataset` (
`photoid` int NOT NULL,
@habedi
habedi / MyNotebook.py
Created January 17, 2022 09:38
Example code for loading a CSV file as a DF in Databricks Community Edition and saving it as a table
# Loading PySpark modules
from pyspark.sql import DataFrame
from pyspark.sql.types import *
#from pyspark.context import SparkContext
#from pyspark.sql.session import SparkSession
# sc = SparkContext('local')
# spark = SparkSession(sc)
@habedi
habedi / StackExchange DB schema.sql
Last active January 1, 2022 15:28
Schemas for some of the tables in the StackExchnages database dumps (available here: https://archive.org/download/stackexchange); schemas work with MySQL and MariaDB
-- 'xxx.stackexchange.com' is the name of the database
-- `xxx.stackexchange.com`.badges definition
CREATE TABLE `badges` (
`Id` int(11) NOT NULL,
`UserId` int(11) NOT NULL,
`Name` varchar(30) NOT NULL,
`Date` datetime NOT NULL,
`Class` int(11) NOT NULL,
@habedi
habedi / load_data_to_neo4j.cyp
Last active September 2, 2024 16:01
A script with the commands to load CSV files into a Neo4j graph database. The data is from StackExchange data dump (https://archive.org/details/stackexchange) #cypher #neo4j #stackoverflow_data #csv #graphdb
// Loading the posts
LOAD CSV WITH HEADERS FROM 'file:///posts_all_csv.csv' AS row
WITH toInteger(row[0]) AS postId, row[5] AS postBody, toInteger(row[3]) AS postScore
RETURN count(row);
LOAD CSV WITH HEADERS FROM 'file:///posts_all_csv.csv' AS row FIELDTERMINATOR '\t'
WITH row[0] AS postId, row[3] AS postScore, row[5] AS postBody
MERGE (p:Post {postId: postId})
SET p.postBody = postBody, p.postScore = postScore
RETURN p;
@habedi
habedi / pyspark-helloworld-app-graphframes.ipynb
Last active June 17, 2022 07:40
PySpark HelloWorld App + GraphFrames
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@habedi
habedi / pyspark_helloworld-app.ipynb
Last active February 24, 2021 12:25
PySpark HelloWorld App
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@habedi
habedi / big_list_of_english_stopwords
Created January 31, 2021 10:47
A large list of English stopwords (original source: https://gist.github.com/sebleier/554280)
a
about
above
after
again
against
ain
all
am
an
@habedi
habedi / start_single_worker_spark_cluster.sh
Created February 12, 2020 21:50
Starting Spark cluster with minimum requirements
## run the following commands in BASH
start-master.sh
# go to http://localhost:8080 and check if the Spark's master service is started
start-slave.sh spark://$(hostname):7077
# if the worker's service is started successfully you should be able to see the worker in http://localhost:8080, at the connected worker's section
@habedi
habedi / get_spark.sh
Last active February 12, 2020 21:18
Simple commands to download and extract Apache Spark's pre-built binaries from its websites
## run the following commands in BASH
cd # let's get back to your user's home directory
wget -c https://www-us.apache.org/dist/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz # this will download spark
tar xvfz spark-2.4.5-bin-hadoop2.7.tgz # this will extract the downloaded file to current directory
mv spark-2.4.5-bin-hadoop2.7 spark # renaming the extarcted folder to "spark"
# appending the JAVA_HOME and SPARK_HOME environement variables to end of your BASH startup script
# we are assuming that our JRE 8 is installed in "/usr/lib/jvm/java-1.8.0-openjdk-amd64"
cat >> .bashrc <<'EOF'
@habedi
habedi / first_time_ubuntu_setup.sh
Last active February 12, 2020 21:03
Installing and setting up required packages and utilities in Ubuntu 18.04
## run the following commands in BASH
sudo apt-get update -y && sudo apt-get upgrade -y
# you may need to enter your password when you use "sudo" before another command
# if you are asked a Yes, and No question during the execution of the previous command choose "Yes"
sudo apt-get install -y htop nload netcat emacs nano openjdk-8-jdk-headless python-pip python3-pip wget \
curl python-mode scala-mode-el
# be patient it can take a while for all the packages to be downloaded and be installed
sudo pip install pyspark
sudo pip3 install pyspark
# again it may take a while for PySpark packages be downloaded for both Python 2 [which is defunct] and Python 3