This article describes a technique for joining (in an SQL-style) lists of haskell data structures.
Maybe not the fastest, maybe not the smartest, but it works.
- The technique of always doing a full outer join, and then reporting on the mismatched sides of the join, is very important when you are doing "data science". We need to track where our data goes missing, and joins are a common place where that happens.
these
+discrimination
makes this easy.
cassava
- for reading the CSV files (although you can use any subsitute, as long as you get a list of rows back.)discrimination
- for the join functionsthese
- for handling the result of the join- GHC >= 8.0, for the
DuplicateRecordFields
extension. Not strictly necessary but useful.
I use cassava, and lean heavily on the generic deriving support. Cassava's generic implementation will map column names in the data to field names in a record. You can find out more here.
import GHC.Generics
data Record1 = Record1 { some_key :: Text, some_data_1 :: Int } deriving (Eq, Show, Generic)
data Record2 = Record1 { some_key :: Text, some_data_2 :: Float } deriving (Eq, Show, Generic)
instance DefaultOrdered Record1 where
instance DefaultOrdered Record2 where
instance FromNamedRecord Record1 where
instance FromNamedRecord Record2 where
instance ToNamedRecord Record1 where
instance ToNamedRecord Record2 where
Our key is some_key
, a column of the same name in both csv files.
Because we are using the Generic
capabilities of cassava
, we need our fields to match the columns in our CSV files. Sometimes, 2 csv files might have the same column, and we end up with a collision.
We are defining our records in the same Haskell module which means our code will fail to compile - duplicate record fields are not usually allowed within a Haskell module.
The solution is:
- Turn on
LANGUAGE DuplicateRecordFields
. This allows you to use the field name in multiple records. - Write explicit accessor for each of your duplicate fields, with unique names. This is not strictly necessary, but I've found it's often difficult to convince the compiler which Record you are dealing with, and by using a unique accessor instead, we avoid that problem.
record1_some_key :: Record1 -> Text
record1_some_key = some_key
record2_some_key :: Record2 -> Text
record2_some_key = some_key
Ed Kmett's discrimination
library provides functions for linear time sorting and grouping. It also includes linear time joins.
The various joining functions are described here. The one we are most interested in is a full outer join. This function is from discrimination:
-- | /O(n)/. Perform a full outer join with operations defined one row at a time.
--
-- The results are grouped by the discriminator.
--
-- This takes operation time linear in both the input and result sets.
outer
:: Discriminating f
=> f d -- ^ the discriminator to use
-> (a -> b -> c) -- ^ how to join two rows
-> (a -> c) -- ^ row present on the left, missing on the right
-> (b -> c) -- ^ row present on the right, missing on the left
-> (a -> d) -- ^ selector for the left table
-> (b -> d) -- ^ selector for the right table
-> [a] -- ^ left table
-> [b] -- ^ right table
-> [[c]]
We can now start assembling the pieces we need to do a join:
-
We need a discriminator. This is the operation that sorts our keys. If your key is one of the integral types defined here, you can using
grouping
for this argument.For text and bytestring argument we need to define a grouper ourselves. It's easy enough to use the
Grouping a => Grouping [a]
instance to build one up fromGrouping Char
by unpacking our text or bytestring.
ourKeyGrouper :: Grouping ByteString -- same implementation for Text
ourKeyGrouper = contramap unpack grouping
-
We need a way of wrapping our matched rows.
-
We need a way of wrapping our rows present only on the left.
-
We need a way of wrapping our rows present only on the right.
These 3 arguments are handled by
These
fromData.These
in thethese
package. The definition looks like this:
data These a b = This a | That b | These a b
It extends `Either` with a 3rd constructor that handles a pairing of 2 values.
We use the `This` and `That` constructors for the rows missing a partner, and `These` for the matched results.
Read the [these](http://hackage.haskell.org/package/these-0.7.4/docs/Data-These.html) package haddocks for more information.
- A key selector for the left table.
- A key selector for the right table.
These will be our disambiguated colliding some_key
field accessors.
- Left table.
- Right table.
Pretty self explanatory.
I define a convenient helper that applies the These
constructors in the right places, and then unpacks the results into 3 lists that are usually more convenient to handle - the join, and then the no-match lefts and no-match rights.
joiner :: Discriminating f
=> f d -- "Discriminator", describes how to sort the join key
-> (a -> d) -- Key function for first collection
-> (b -> d) -- Key function for second collection
-> [a] -- First collection
-> [b] -- Second collection
-> ([(a, b)], ([a], [b])) -- The join, and the missed values
joiner grouper f g a b =
partitionThese . join . outer grouper These This That f g a $ b
groupByteString :: Group Text
groupByteString = contramap unpack grouping
We can now join our records:
import Data.Csv
import qualified Data.ByteString.Lazy as LBS
import All.The.Stuff.From.Above
main :: IO ()
main =
do Right (_, r1s) <- decodeByName <$> LBS.readFile "record1s.csv"
Right (_, r2s) <- decodeByName <$> LBS.readFile "record2s.csv"
let
goodJoin :: [(Record1, Record2)]
r1Only :: [Record1]
r2Only :: [Record2]
(goodJoin, (r1Only, r2Only)) = joiner ourKeyGrouper record1_some_key record2_some_key r1s r2s
print (length goodJoin, length r1Only, length r2Only)