Skip to content

Instantly share code, notes, and snippets.

@whizzmler
whizzmler / create-batchcentroid.json
Last active June 7, 2016 17:42 — forked from jaor/create_batchcentroid.whizzml
One click batchcentroid dataset
{
"name": "Batch-centroid",
"description": "Create a dataset with a new column showing the cluster each instance belongs to",
"inputs": [
{
"name": "source-id",
"type": "source-id",
"description": "Identifier of the input source"
}
@whizzmler
whizzmler / model-or-ensemble.json
Last active May 10, 2016 21:15
Model or Ensemble?
{
"name": "Model or ensemble",
"description": "Select the best option for modeling a source: a model or an ensemble?",
"inputs": [
{
"name": "input-source-id",
"type": "source-id",
"description": "Source for training/test the model and ensemble"
}
],
@whizzmler
whizzmler / one-click-dataset.json
Last active November 2, 2019 17:40
One-Click Dataset from a Source
{
"name": "One-Click Dataset",
"description": "Create a new dataset from a source with a click",
"outputs": [
{"name": "dataset-id", "type": "dataset-id", "description": "The new dataset"},
{"name": "rows", "type": "number", "description": "The number of rows of the new dataset"}
],
"inputs": [
{"name": "source-id", "type": "source-id", "default": "", "description": "Source from which to create a new dataset"},
{"name": "source-name", "type": "string", "default": "", "description": "Name for the new dataset"}
@whizzmler
whizzmler / best-first.whizzml
Created April 21, 2016 01:54 — forked from charleslparker/best-first.whizzml
Best-first feature selection via WhizzML
;; A simple function to get the max value in a list
(define (get-max xs) (reduce (lambda (x y) (if (> x y) x y)) (head xs) xs))
;; Get feature names given ids
(define (feature-names dataset-id ids)
(let (fields (get (fetch dataset-id) "fields"))
(map (lambda (id) (get-in fields [id "name"])) ids)))
;; Create a dataset sample
(define (sample-dataset ds-id rate oob)
@whizzmler
whizzmler / gradient-boosting.json
Last active May 10, 2016 21:13 — forked from charleslparker/gradient-boosting.whizzml
A vanilla implementation of gradient boosting in WhizzML
{
"name": "Gradient Boosting",
"description": "Perform gradient boosting for classification",
"inputs": [{
"name": "dataset-id",
"type": "dataset-id",
"description": "Select the training dataset"
}],
"outputs": [{
"name": "model-array",
@whizzmler
whizzmler / normalize-dataset.json
Last active May 15, 2016 03:50 — forked from osroca/README.md
Remove anomalies from dataset
{
"name": "Normalize Dataset",
"description": "Remove the top n anomalies from a dataset",
"inputs": [
{"name": "dataset-id", "type": "dataset-id", "description": "Dataset Id"},
{"name": "top-n", "type": "number", "description": "Top N Anomalies to Remove"}
],
"outputs": [
{"name": "normalized-dataset", "type": "dataset-id", "description": "Normalized Dataset"}
],
@whizzmler
whizzmler / model-by-clusters.json
Last active May 10, 2016 21:20
Model by Clusters
{"description": "A script that generates a cluster and a set of models from its centroid datasets",
"name": "Model by Clusters",
"inputs": [{"name": "source-id", "type": "source-id"}],
"outputs": [{"name": "dataset-id", "type": "dataset-id", "description": "Full dataset from input source"},
{"name": "cluster-id", "type": "cluster-id", "description": "G-means cluster from full dataset"},
{"name": "models", "type": "map", "description": "Map from centroid id to associated predictive model"},
{"name": "names", "type": "map", "description": "Map from centroid id to centroid name"},
{"name": "evaluations", "type": "map", "description": "Evaluations for each of the per centroid models"}]}
@whizzmler
whizzmler / predict-by-clusters.json
Last active May 10, 2016 21:19
Predict by Clusters Library
{"description": "A library with functions to make predictions from the results of a Model by Clusters execution",
"name": "Predict by clusters"}

Stacked generalization

Objective: Improve predictions by modeling the output scores of multiple trained models.

  • Create a training and a holdout set
  • Create n different models on the training set (with some difference among them; e.g., single-tree vs. ensemble vs. logistic regression)
  • Make predictions from those models on the holdout set
  • Train a model to predict the class based on the other models' predictions
@whizzmler
whizzmler / find-neighbors.json
Last active May 14, 2016 10:37
Cluster neighbors
{
"name": "Find neighbors",
"description": "Find the closest cluster rows to a given one",
"inputs": [
{"name": "cluster-id", "type": "cluster-id", "description": "The cluster to select rows from"},
{"name": "n", "type": "number", "description": "The number of points to return"},
{"name": "instance", "type": "map", "description": "Base row to compute distances, as a map from field identifiers to values"}
],
"outputs": [
{"name": "rows", "type": "list", "description": "The list of the n closest rows to `base`"}