- If values are integers in [0, 255], Parquet will automatically compress to use 1 byte unsigned integers, thus decreasing the size of saved DataFrame by a factor of 8.
- Partition DataFrames to have evenly-distributed, ~128MB partition sizes (empirical finding). Always err on the higher side w.r.t. number of partitions.
- Pay particular attention to the number of partitions when using
flatMap
, especially if the following operation will result in high memory usage. TheflatMap
op usually results in a DataFrame with a [much] larger number of rows, yet the number of partitions will remain the same. Thus, if a subsequent op causes a large expansion of memory usage (i.e. converting a DataFrame of indices to a DataFrame of large Vectors), the memory usage per partition may become too high. In this case, it is beneficial to repartition the output offlatMap
to a number of partitions that will safely allow for appropriate partition memory sizes, based upon the
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# delete all containers | |
docker rm $(docker ps -a -q) | |
# delete images without tags | |
docker rmi $(docker images | grep '^<none>' | awk '{print $3}') | |
# clean up volumnes | |
docker volume prune |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
package com.github.jongwook | |
import net.recommenders.rival.core.DataModel | |
import net.recommenders.rival.evaluation.metric.ranking.NDCG | |
import net.recommenders.rival.evaluation.metric.ranking.NDCG.TYPE | |
import org.apache.spark.SparkConf | |
import org.apache.spark.mllib.evaluation.RankingMetrics | |
import org.apache.spark.sql.SparkSession | |
import scala.util.{Failure, Success, Random, Try} |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import tensorflow as tf | |
from random import randint, seed | |
seed(42) | |
current_x = tf.placeholder(tf.float32) | |
x = tf.Variable(2.1, name='x', dtype=tf.float32) | |
log_x = tf.log(x) | |
result = current_x * tf.square(log_x) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import tensorflow as tf | |
x_1 = tf.placeholder(tf.float32) | |
x_2 = tf.placeholder(tf.float32) | |
x_3 = tf.placeholder(tf.float32) | |
x = tf.Variable(2, name='x', dtype=tf.float32) | |
log_x = tf.log(x) | |
result = (x_1 + x_2 + x_3) * tf.square(log_x) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import tensorflow as tf | |
x = tf.Variable(2, name='x', dtype=tf.float32) | |
log_x = tf.log(x) | |
log_x_squared = tf.square(log_x) | |
optimizer = tf.train.GradientDescentOptimizer(0.5) | |
train = optimizer.minimize(log_x_squared) | |
init = tf.initialize_all_variables() |
The paper presents some key lessons and "folk wisdom" that machine learning researchers and practitioners have learnt from experience and which are hard to find in textbooks.
All machine learning algorithms have three components:
- Representation for a learner is the set if classifiers/functions that can be possibly learnt. This set is called hypothesis space. If a function is not in hypothesis space, it can not be learnt.
- Evaluation function tells how good the machine learning model is.
- Optimisation is the method to search for the most optimal learning model.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Shell to use with Make | |
SHELL := /bin/bash | |
# Set important Paths | |
PROJECT := # Set to your project name | |
LOCALPATH := $(CURDIR)/$(PROJECT) | |
PYTHONPATH := $(LOCALPATH)/ | |
PYTHON_BIN := $(VIRTUAL_ENV)/bin | |
# Export targets not associated with files |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import pandas as pd | |
# http://blog.yhathq.com/static/misc/data/WineKMC.xlsx | |
df_offers = pd.read_excel("./WineKMC.xlsx", sheetname=0) | |
df_offers.columns = ["offer_id", "campaign", "varietal", "min_qty", "discount", "origin", "past_peak"] | |
df_offers.head() | |
df_transactions = pd.read_excel("./WineKMC.xlsx", sheetname=1) | |
df_transactions.columns = ["customer_name", "offer_id"] | |
df_transactions['n'] = 1 | |
df_transactions.head() |
NewerOlder