Chris van den Berg Bergvca

All round data scientist with a background in AI

95 followers · 0 following

@bnpparibas

View GitHub Profile

Recently created

Least recently created

Recently updated

Least recently updated

Bergvca / stratified_sampling.R

Created April 23, 2017 15:25

Example on how to do stratified sampling in Caret. This is useful for imbalanced datasets, and can be used to give more weight to a minority class

	len_pos <- nrow(example_dataset[example_dataset$target==1,])
	len_neg <- nrow(example_dataset[example_dataset$target==0,])

	train_model <- function(training_data, labels, model_type, ...) {
	experiment_control <- trainControl(method="repeatedcv",
	number = 10,
	repeats = 2,
	classProbs = T,
	summaryFunction = custom_summary_function)
	train(x = training_data,

Bergvca / unnittestexample.py

Last active December 22, 2022 20:54

Some unnittest + Mock examples in Python. Includes examples on how to mock a entire class (ZipFile), mock an Iterator object and how to use side_effect properly

	import unittest
	import os

	from zipfile import ZipFile
	from mock import MagicMock, patch, Mock, mock_open

	# The functions that are tested:
	def function_to_test_zipfile(example_arg):
	with ZipFile(example_arg, 'r') as zip_in:
	for input_file in zip_in.infolist():

Bergvca / Name matching in SQL Server.sql

Last active April 4, 2018 11:51

Name matching in SQL Server example

	-- First create matches using a UDF, here I am using a combination of Jaro Winkler and (a normalized version of) Levensthein
	--
	-- Input: cleaned_table: a table with "cleaned" names
	-- Output: tmp_groups: a table with uid - group_id tuples. Each group_id contains all uid's that belong to names that match.


	DROP TABLE #matches
	SELECT a.clean_Name,
	a.uid,
	b.clean_Name clean_name_2,

Bergvca / Pyspark_LDA_Example.py

Created February 3, 2016 13:59

Example on how to do LDA in Spark ML and MLLib with python

	import findspark
	findspark.init("[spark install location]")

	import pyspark
	import string
	from pyspark import SparkContext
	from pyspark.sql import SQLContext
	from pyspark.mllib.util import MLUtils
	from pyspark.sql.types import *
	from pyspark.ml.feature import CountVectorizer, CountVectorizerModel, Tokenizer, RegexTokenizer, StopWordsRemover