Skip to content

Instantly share code, notes, and snippets.

@pierdom
pierdom / rotate_axis_labels.py
Last active September 3, 2018 09:12
[Rotate axis labels in Seaborn/Matplotlib] Useful when labels are long (e.g., long names, full DD/MM/YYYY HH:mm dates, etc.) and don't fit on the axis. #python #matplotlib #visualization
g = sns.boxplot(x="poi", y="visitor_cnt", hue="day_type", data=pois_averages, palette="Set1")
for item in g.get_xticklabels():
item.set_rotation(60)
@pierdom
pierdom / local_spark_jupyter.md
Last active September 7, 2017 07:49
[PySpark and Jupyter single node] Quick and dirty notes for running Spark single-node and Jupyer notebook with PySpark #python #spark #bigdata #sysadmin

Running single-node Spark and Jupyer notebook with PySpark

Pre-requisites:

  1. Java Development Kit (JDK): download at http://www.oracle.com/technetwork/java/javase/downloads/index.html
  2. Spark pre-built for Hadoop 2.6 (in case not already installed): download at http://spark.apache.org/downloads.html
  3. Jupyter notebook: sudo pip install jupyter / sudo pip3 istall jupyter

In this example, I have extracted the tar.gz file in /opt. Remember to use the correct path for $SPARK_HOME environment variable (see below), depending on how you installed it.

@pierdom
pierdom / plot_precounted_hist.py
Last active August 3, 2023 11:56
[Plot histograms with pre-computed counters] Plot histograms with Marplotlib hist function or Seaborn distplot function using pre-counted values using 'weights' argument. Very useful for plotting distributions of values queried from a very large dataset, where it is impossible to retrieve and load in memory every element of the distribution inde…
#!/usr/bin/env python3
import matplotlib.pyplot as plt
import seaborn as sns
# dictionary with pre-counted bins
test = {1:1,2:1,3:1,4:2,5:3,6:5,7:4,8:2,9:1,10:1}
# with matplotlib
plt.hist(list(test.keys()), weights=list(test.values()))
@pierdom
pierdom / pandas_convert_time.py
Last active September 7, 2017 07:47
[From string to datetime in Pandas] Convert a column in Padas DataFrame from string (object) to datetime64 type. #python #datascience #pandas
df["time_column"] = pd.to_datetime(pd.Series(df["time_column"]))
@pierdom
pierdom / jupyter_code_toggler.py
Last active September 7, 2017 07:47
[Code toggle in Jupyter] Add the following code in a cell in Jupyter notebook (with Python kernel). It allows to toggle the code cells, only leaving markdown and figures visible. Very convenient when exporting a notebook to HTML. #jupyter #python
from IPython.display import HTML
HTML('''<script>
code_show=true;
function code_toggle() {
if (code_show){
$('div.input').hide();
} else {
$('div.input').show();
}
@pierdom
pierdom / fix_python_rocket_icon_mac.md
Last active January 24, 2019 18:22
[Matplotlib and Jupyter on Mac: rocket icon] Avoid Python rocket icon on Mac dock when running Python scripts that use Matplotlib (very frequent when running Jupyter Notebook). The problem occurs because Python thinks that something interactive (e.g., a GUI) is going on, while we are simply plotting in-line in Jupyter. The fix below tells Matplo…

Change import matplotlib.pyplot as plt to:

import matplotlib
matplotlib.use("Agg")
import matplotlib.pyplot as plt
@pierdom
pierdom / git_ignore_files.md
Last active September 7, 2017 07:46
[Ignore files in git] Remove and ignore specific file names (e.g., .DS_Store files on Macs) from git repositories #git #sysadmin

To remove all '.DS_STore' files in an existent git repo:

find . -name .DS_Store -print0 | xargs -0 git rm --ignore-unmatch

To permanently ignore '.DS_STore' files in GIT repos:

echo .DS_Store >> ~/.gitignore_global
git config --global core.excludesfile ~/.gitignore_global
@pierdom
pierdom / jupyter_pyspark.md
Last active January 28, 2020 21:35
[Run Jupyter with Pyspark integration] This how-to assumes that pyspark is installed and correctly configured to access the cluster (or the stand-alone configuration). Jupyter and other Python packages are executed in a virtualenv. #python #spark #bigdata #sysadmin

Jupyter + Pyspark how-to

Version 1.0 2016/11/14
Pierdomenico Fiadino | [email protected]

Synopysis

Install Jupyter Notebook in a dedicated Python virtualenv and integrate with Spark terminal pyspark on a cluster client (for this example, we will use the tourism-lab node).

This how-to assumes that we have SSH access to the machine and pyspark already installed and configure (try executing it and see if the variables sc, HiveContext and sqlContext are already installed).

@pierdom
pierdom / yarn_scheduler_notes.md
Last active September 6, 2017 12:36
[Yarn Capacity Scheduler configs] and how to change it to Fair Scheduler. #bigdata #yarn #sysadmin

My Capacity Scheduler configs

Field: yarn.resourcemanager.scheduler.class Value: org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler`

The capacity scheduler is configured with 3 queues (default 75%, batch1 25%, batch2 0%) with a queue currently not used (batch2). Here the config:

yarn.scheduler.capacity.root.queues=batch1,batch2,default
yarn.scheduler.capacity.root.default.user-limit-factor=1
@pierdom
pierdom / hive_optimization.md
Last active May 28, 2020 12:29
[Hive performance tuning]. Source: http://hortonworks.com/blog/5-ways-make-hive-queries-run-faster/ #bigdata #hive #sysadmin

Five ways to tune Hive performance

1. Use Tez

set hive.execution.engine=tez;

2. Store tables as ORC