Skip to content

Instantly share code, notes, and snippets.

@zouzias
zouzias / gist:6db3cb72f7e35f5a4c8d267151ed176e
Created December 15, 2017 10:57
One hot encoding issue Sklearn and Pipeline (solved)
See comment https://github.com/jpmml/jpmml-sklearn/issues/38 (vruusmann commented on Apr 19)
If you want to apply one-hot-encoding to string columns, then you should simply use the sklearn.preprocessing.LabelBinarizer transformer class for that. It has exactly the same effect as a sequence of LabelEncoder followed by OneHotEncoder.
mapper = DataFrameMapper([
("country_name", LabelBinarizer())
])
The OneHotEncoder transformation makes sense if your input data contains categorical integer columns.
Currently, sklearn_pandas.DataFrameMapper is unable to apply [LabelEncoder(), OneHotEncoder()] on a string column due to the above "matrix transpose" problem. You could additionally open an issue with the sklearn_pandas project, and ask for their opinion about it.
@zouzias
zouzias / PyUtils.cpp
Created October 6, 2017 14:20 — forked from rjzak/PyUtils.cpp
Convert between Python list/tuples and C++ vectors
#include <Python.h> // Must be first
#include <vector>
#include <stdexcept>
#include "PyUtils.h"
using namespace std;
// =====
// LISTS
// =====
@zouzias
zouzias / spark_shell.scala
Last active October 5, 2017 14:51
LuceneRDD: Example using Quora Question Pairs Dataset
import org.zouzias.spark.lucenerdd.LuceneRDD
import org.zouzias.spark.lucenerdd._
val df = spark.read.parquet("spark-lucenerdd/quora_duplicate_questions.parquet")
val linker = {r: Row => { val tokens = r.getString(r.fieldIndex("question1")).split(" ").map(_.replaceAll("[^a-zA-Z0-9]", "")).filter(_.length > 3).mkString(" AND ")
if (tokens.nonEmpty) s"question1:(${tokens})" else "*:*"}}
@zouzias
zouzias / setup.py
Created October 4, 2017 19:57 — forked from hryk/setup.py
an example for using boost::unordered_map in Cython.
from distutils.core import setup
from distutils.extension import Extension
from Cython.Distutils import build_ext
ext_modules = [
Extension("test_um",
["test_um.pyx"],
include_dirs=["/usr/local/include/boost"],
library_dirs=["/usr/local/lib"],
language="c++")
@zouzias
zouzias / sparkDataFrameZipWithIndex.scala
Last active October 28, 2021 14:25
Spark DataFrame zipWithIndex
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StructField,StructType,IntegerType, LongType}
val df = sc.parallelize(Seq((1.0, 2.0), (0.0, -1.0), (3.0, 4.0), (6.0, -2.3))).toDF("x", "y")
// Append "rowid" column of type Long
val newSchema = StructType(df.schema.fields ++ Array(StructField("rowid", LongType, false)))
// Zip on RDD level
val rddWithId = df.rdd.zipWithIndex
@zouzias
zouzias / geometryPolygonsToWKT.scala
Created August 15, 2017 14:43
Convert from GeoJSON geometry.coordinates property to WKT (Polygons)
val geometryToWKT = udf((a: Seq[Seq[Seq[Double]]]) => "POLYGON ((" + a.head.map(x => x.head + " " + x.last).mkString(", ") + "))")
@zouzias
zouzias / jupyter_virtualenv_notebooks.txt
Last active August 14, 2017 09:18
How to setup virtualenv and Jupyter Notebooks
1. Install kernelspec for python3
Make sure it is not there
jupyter kernelspec list
jupyter kernelspec install /usr/local/Cellar/python3/3.6.1/bin/
@zouzias
zouzias / tmux-cheatsheet.markdown
Created June 12, 2017 11:15 — forked from MohamedAlaa/tmux-cheatsheet.markdown
tmux shortcuts & cheatsheet

tmux shortcuts & cheatsheet

start new:

tmux

start new with session name:

tmux new -s myname
@zouzias
zouzias / gist:44f214f922cc63c026e079d753e436be
Created May 18, 2017 15:51
Install Elasticsearch 5.x on Ubuntu
wget -qO - https://artifacts.elastic.co/GPG-KEY-elasticsearch | sudo apt-key add -
sudo apt-get install apt-transport-https
echo "deb https://artifacts.elastic.co/packages/5.x/apt stable main" | sudo tee -a /etc/apt/sources.list.d/elastic-5.x.list
sudo apt-get update && sudo apt-get install elasticsearch
sudo update-rc.d elasticsearch defaults 95 10
@zouzias
zouzias / anupotaksia.txt
Last active November 24, 2021 12:35
Greek Army
31. Ποιος κηρύσσεται ανυπότακτος και πως διακόπτεται η ανυποταξία;
Ανυπότακτοι, σύμφωνα με το Νόμο περί Στρατολογίας των Ελλήνων, κηρύσσονται όσοι, μετά από γενική ή ειδική πρόσκληση για κατάταξη στις Ένοπλες Δυνάμεις, δεν κατατάσσονται στις ορισμένες ημερομηνίες ή προθεσμίες στις μονάδες κατάταξης, χωρίς να υπάρχει νόμιμος λόγος παραμονής τους εκτός των Ενόπλων Δυνάμεων (αναβολή ή απαλλαγή). Η ανυποταξία διακόπτεται στις παρακάτω περιπτώσεις:
α. Με τη συμπλήρωση του τεσσαρακοστού πέμπτου (45ου )έτους της ηλικίας του ανυπότακτου.
β. Με την κατάταξη στις Ένοπλες Δυνάμεις.
γ. Με τη σύλληψη για την ανυποταξία.
δ. Με την παρουσίαση του ανυπότακτου σε οποιαδήποτε στρατιωτική αρχή για την διακοπή της ανυποταξίας του.