Joining CSV Tables in Haskell

This article describes a technique for joining (in an SQL-style) lists of haskell data structures.

Maybe not the fastest, maybe not the smartest, but it works.

Why I like it

The technique of always doing a full outer join, and then reporting on the mismatched sides of the join, is very important when you are doing "data science". We need to track where our data goes missing, and joins are a common place where that happens. these+discrimination makes this easy.

Haskell tech involved

cassava - for reading the CSV files (although you can use any subsitute, as long as you get a list of rows back.)
discrimination - for the join functions
these - for handling the result of the join
GHC >= 8.0, for the DuplicateRecordFields extension. Not strictly necessary but useful.

Read your CSV files

I use cassava, and lean heavily on the generic deriving support. Cassava's generic implementation will map column names in the data to field names in a record. You can find out more here.

import GHC.Generics

data Record1 = Record1 { some_key :: Text, some_data_1 :: Int } deriving (Eq, Show, Generic)
data Record2 = Record1 { some_key :: Text, some_data_2 :: Float } deriving (Eq, Show, Generic)

instance DefaultOrdered Record1 where
instance DefaultOrdered Record2 where

instance FromNamedRecord Record1 where
instance FromNamedRecord Record2 where

instance ToNamedRecord Record1 where
instance ToNamedRecord Record2 where

Our key is some_key, a column of the same name in both csv files.

ASIDE: Resolving collisions in records

Because we are using the Generic capabilities of cassava, we need our fields to match the columns in our CSV files. Sometimes, 2 csv files might have the same column, and we end up with a collision.

We are defining our records in the same Haskell module which means our code will fail to compile - duplicate record fields are not usually allowed within a Haskell module.

The solution is:

Turn on LANGUAGE DuplicateRecordFields. This allows you to use the field name in multiple records.
Write explicit accessor for each of your duplicate fields, with unique names. This is not strictly necessary, but I've found it's often difficult to convince the compiler which Record you are dealing with, and by using a unique accessor instead, we avoid that problem.

record1_some_key :: Record1 -> Text
record1_some_key = some_key

record2_some_key :: Record2 -> Text
record2_some_key = some_key

Joining with Discrimination

Ed Kmett's discrimination library provides functions for linear time sorting and grouping. It also includes linear time joins.

The various joining functions are described here. The one we are most interested in is a full outer join. This function is from discrimination:

-- | /O(n)/. Perform a full outer join with operations defined one row at a time.
--
-- The results are grouped by the discriminator.
--
-- This takes operation time linear in both the input and result sets.
outer
  :: Discriminating f
  => f d           -- ^ the discriminator to use
  -> (a -> b -> c) -- ^ how to join two rows
  -> (a -> c)      -- ^ row present on the left, missing on the right
  -> (b -> c)      -- ^ row present on the right, missing on the left
  -> (a -> d)      -- ^ selector for the left table
  -> (b -> d)      -- ^ selector for the right table 
  -> [a]           -- ^ left table
  -> [b]           -- ^ right table
  -> [[c]]

We can now start assembling the pieces we need to do a join:

We need a discriminator. This is the operation that sorts our keys. If your key is one of the integral types defined here, you can using grouping for this argument.

For text and bytestring argument we need to define a grouper ourselves. It's easy enough to use the Grouping a => Grouping [a] instance to build one up from Grouping Char by unpacking our text or bytestring.

ourKeyGrouper :: Grouping ByteString -- same implementation for Text
ourKeyGrouper = contramap unpack grouping

We need a way of wrapping our matched rows.
We need a way of wrapping our rows present only on the left.
We need a way of wrapping our rows present only on the right.

These 3 arguments are handled by These from Data.These in the these package. The definition looks like this:

data These a b = This a | That b | These a b

 It extends `Either` with a 3rd constructor that handles a pairing of 2 values.
 We use the `This` and `That` constructors for the rows missing a partner, and `These` for the matched results.
 Read the [these](http://hackage.haskell.org/package/these-0.7.4/docs/Data-These.html) package haddocks for more information.

A key selector for the left table.
A key selector for the right table.

These will be our disambiguated colliding some_key field accessors.

Left table.
Right table.

Pretty self explanatory.

Putting it all together

I define a convenient helper that applies the These constructors in the right places, and then unpacks the results into 3 lists that are usually more convenient to handle - the join, and then the no-match lefts and no-match rights.

joiner :: Discriminating f
  => f d      -- "Discriminator", describes how to sort the join key
  -> (a -> d) -- Key function for first collection
  -> (b -> d) -- Key function for second collection
  -> [a]      -- First collection
  -> [b]      -- Second collection
  -> ([(a, b)], ([a], [b])) -- The join, and the missed values
joiner grouper f g a b =
  partitionThese . join . outer grouper These This That f g a $ b

groupByteString :: Group Text
groupByteString = contramap unpack grouping

We can now join our records:

import Data.Csv
import qualified Data.ByteString.Lazy as LBS
import All.The.Stuff.From.Above

main :: IO ()
main =
  do Right (_, r1s) <- decodeByName <$> LBS.readFile "record1s.csv"
     Right (_, r2s) <- decodeByName <$> LBS.readFile "record2s.csv"
     
     let 
       goodJoin :: [(Record1, Record2)]
       r1Only :: [Record1]
       r2Only :: [Record2]
       (goodJoin, (r1Only, r2Only)) = joiner ourKeyGrouper record1_some_key record2_some_key r1s r2s
       
     print (length goodJoin, length r1Only, length r2Only)

nkpart/Joining CSV Tables in Haskell.md