jlln’s gists

jlln / separator.py

Last active November 9, 2023 19:59

Efficiently split Pandas Dataframe cells containing lists into multiple rows, duplicating the other column's values.

	def splitDataFrameList(df,target_column,separator):
	''' df = dataframe to split,
	target_column = the column containing the values to split
	separator = the symbol used to perform the split

	returns: a dataframe with each entry for the target column separated, with each element moved into a new row.
	The values in the other columns are duplicated across the newly divided rows.
	'''
	def splitListToRows(row,row_accumulator,target_column,separator):
	split_row = row[target_column].split(separator)

jlln / spark OneHot encoder.scala

Last active June 2, 2018 14:29

One-hot encoder for use with Spark DataFrames.

	import scala.collection.JavaConverters._
	import org.apache.spark.sql.types.{StructType,StructField,StringType}
	import org.apache.spark.sql.Row



	def identityMatrix(n:Int):Array[Array[String]]=Array.tabulate(n,n)((x,y) => if(x==y) "1" else "0")
	def encodeStringOneHot(table:org.apache.spark.sql.DataFrame,column:String) = {
	//Accepts the dataframe and the target column name. Returns a new dataframe in which the target column has been replaced with a one-hot/dummy encoding.
	table.registerTempTable("temp")

jlln / spark_df_pivot.scala

Last active February 24, 2016 03:34

How to pivot a spark dataframe and cast the values into a vector

	val cameo_maps = event_data_ag1.rdd
	.groupBy(x=> (x.getAs[String]("Country"),x.getAs[Int]("ElapsedMonths")))
	.map { case (group_features,codes) => group_features -> codes
	.map {code => code.getAs[Int]("CAMEO Code") -> code.getAs[Long]("count") }
	.toMap
	}
	val cameos = sc.broadcast(cameo_maps.map(_._2.keySet).reduce(_ union _).toArray.sorted)

	val cameo_arrays = cameo_maps.map{
	case ((country,total_months),cameo_map) => (country,total_months) -> cameos.value.map(cameo_map.getOrElse(_,0L))

jlln / spark_group_fraction.scala

Last active April 20, 2016 08:13

Spark/Scala function for determining the fractions of examples falling into different groups, taking into account other grouping criteria.

	def groupOutcomeFractions(df:DataFrame,outcome:String,outer_group_criteria:Seq[String]):DataFrame = {
	df.registerTempTable("df")
	val count_variable:String = outer_group_criteria.head
	val inner_group_criteria = outer_group_criteria :+ outcome
	val outer_group_query = "SELECT "+ outer_group_criteria.mkString(" , ") +s", COUNT($count_variable) AS outer_count FROM df GROUP BY " + outer_group_criteria.mkString(" , ")
	val outer_count = sqlContext.sql(outer_group_query)
	val inner_count_query = "SELECT "+ inner_group_criteria.mkString(" , ") +s", COUNT($count_variable) AS inner_count FROM df GROUP BY " + inner_group_criteria.mkString(" , ")
	val inner_count = sqlContext.sql(inner_count_query)
	val combined_counts = inner_count.join(outer_count,outer_group_criteria)

jlln / group_fractions_pandas.py

Last active June 3, 2016 06:55

Pandas/python function for determining the fractions of examples falling into different groups, taking into account other grouping criteria.

	def groupCountFractionals(dataframe,target,outer):
	'''
	dataframe: a pandas dataframe
	target: a string corresponding to the column of interest in the dataframe
	outer: a list of the columns by which the counts should be conditioned

	Returns the fraction of target_criteria_group / outer_criteria_group counts.

	Be mindful to take group sizes (Outer Count) into consideration.
	As outer count gets smaller, the fraction value

jlln / Tigers.ipynb

Created May 10, 2016 05:35

Hypergeometric sampling of Tigers.

Sorry, something went wrong. Reload?

Sorry, we cannot display this file.

Sorry, this file is invalid so it cannot be displayed.

jlln / RHCPropensity.ipynb

Last active September 20, 2016 04:18

RHC Propensity Analysis

Sorry, something went wrong. Reload?

Sorry, we cannot display this file.

Sorry, this file is invalid so it cannot be displayed.

jlln / AnimalRescue.ipynb

Last active June 27, 2016 05:55

Exploratory data analysis for the Kaggle Shelter Animal Outcome Project https://www.kaggle.com/c/shelter-animal-outcomes. Amongst other things I tried using the classifications of dogs according to The Kennel Club to predict outcomes.

Sorry, something went wrong. Reload?

Sorry, we cannot display this file.

Sorry, this file is invalid so it cannot be displayed.

jlln / SparkRowApply.scala

Last active March 9, 2021 06:00

How to apply a function to every row in a Spark DataFrame.

	def findNull(row:Row):String = {
	if (row.anyNull) {
	val indices = (0 to row.length-1).toArray.filter(i => row.isNullAt(i))
	indices.mkString(",")
	}
	else "-1"
	}

	sqlContext.udf.register("findNull", findNull _)
	df = df.withColumn("MissingGroups",callUDF("findNull",struct(df.columns.map(df(_)) : _*)))

jlln / SparkSeqSelect.scala

Created August 4, 2016 04:29

Selecting spark columns using a Seq

	val retained_features:List[String] = group_mean_columns.filter(x=> !missings.contains(x._2)).map(_._1).toList :+ "LogAdjustedDemand"
	//Select these columns in the training dataset
	val model_training_data = training_data_all.select(retained_features.head,retained_features.tail: _*)

James Allen jlln