Skip to content

Instantly share code, notes, and snippets.

@saurabhnanda
Last active April 2, 2018 11:09
Show Gist options
  • Save saurabhnanda/2271b5e1af2af77db9db9f6e14f5c12c to your computer and use it in GitHub Desktop.
Save saurabhnanda/2271b5e1af2af77db9db9f6e14f5c12c to your computer and use it in GitHub Desktop.
Audit Logs in Haskell/Persistent

Context

I've built sizeable products (~100,000+ LOC) using Rails at Cleartrip, as well as my own startup Vacation Labs. Even at Cleartrip, before settling on Rails, I tried very hard to make Common Lisp work for us. But, it felt like a losing battle. Instead of solving business problems we were writing reams of boilerplate code to make Lisp interact with the DB in a sensible manner. At Vacation Labs, I have a hunch that rewriting core parts of our app in Haskell will lead to safer code, fewer bugs, lesser dependancy on testing, and eventually faster product evolution. However, history seems to be repeating itself when it comes to accessing a large, complicated DB schema (100+ tables) using Haskell (Persistent).

I'm currently grappling with audit logs - a core requirement for any app dealing with sensitive customer info (especially e-commerce or payments). Unforuntaly, apart from Snaplet ActionLog, I could not find any easy+performant way to build audit logs in Haskell.

Here's my take on how this problem can be solved. I would really appreciate feedback from more experienced Haskell minds.

Creating audit logs using Persistent

  1. Problems with Snaplet ActionLog's appraoch - http://hackage.haskell.org/package/snaplet-actionlog-0.2.0.1/docs/Snap-Snaplet-ActionLog.html
  2. Take an approach similar to Rails' ActiveModel::Dirty - http://api.rubyonrails.org/classes/ActiveModel/Dirty.html
  3. Global query cache approach

Problems with Snaplet ActionLog's approach

Here's their approach in a nutshell: Don't call Persistent's methods directly. Wrap them up in log-aware functions like loggedInsert, loggedReplace, loggedUpdate, etc. The log-aware functions execute the original DB action, as well as update the audit-logs.

However, the biggest downside of this approach is that, to be able to compute the deltas required by the audit log, one has to fetch the original/current state of the row again. Twice the number of reads, which can possibly get worse when dealing with multiple rows of the same table (I still don't know how Persistent avoids the N+1 SQL issue).

At least in PostreSQL, there is possibly a way of avoiding the duplicate read by making the UPDATE call itself return the pre-update version of the row, but I'm not sure if that's undefined behaviour or we can depend on this behaviour persisting across PostgreSQL versions.

Rails' ActiveModel::Dirty approach

Since ActiveRecord models are not plain hashes/dicts/records, but first-class objects, they have accessor-functions through which all attributes must be accessed (read or written). An (over) simplified way of maintaining an audit-log within a Rails model (actually this will work with any plain ol' Ruby-object, as well):

def name
  @name
end

def name=(n)
  @changes['name'] = [@name.dup, n] # old and new value of name
  @name=n
end

However, in Haskell, there is no native concept of accessors built into the language. There is a Lens library that can provide Getters & Setters to be used instead of raw access to a record's fields. If we try to replicate the Ruby/Rails appraoch, we can look at building something on top of lenses:

data User = User {
  name :: Text,
  email :: Text,
  passwordHash :: Text,
  postCount :: Int,
  createdAt :: UTCTime,
  updatedAt :: UTCTime,
  _changes :: ChangeLog
} deriving (Show, Eq)

-- This is probably an incorrect definition of the `ChangeLog` type, but I hope
-- it gets the idea across. A map of field-name to old-value
type ChangeLog = HashMap Text PersistValue 

$(makeAuditableLenses ''User)

All the setters created by makeAuditableLenses will change the respective field's old-value in the _changes field as well.

Global query-cache approach

An related shortcoming of Persistent is an absence of any sort of query-cache. How is this related to audit logs? Here's a longish explanation.

Think about a request/response cycle in a type DB-backed web-app:

  1. A request comes in
  2. DB transaction starts
  3. Various DB rows are loaded into the memory, which are mapped to various records/object-graphs.
  4. Some functions are called, which operate on the in-memory records/objects.
  5. Finally the new version of the records/objects are written back to the DB.
  6. DB transaction is committed (if there are any validation errors, the transaction is rolled-back instead)
  7. Execute any side-effects that don't need to be inside the DB transaction, eg. sending emails, notifying internal microservices, etc.
  8. The browser/client is sent back a response.

Now, in step 4, there is a chance that one would end-up fetching the same DB-row multiple times. This happens because one tries to write de-coupled low-level functions. Example:

-- A function that creates a reservation AND reduces the availability, which is
-- stored in a different table
createReservation :: Reservation -> SqlPersistT m (PostId)

Now, before calling this funtion, the availability may have already been fetched from the DB. Without a query-cache it would result in multiple reads of the same data from the DB. One may argue that this is anti-pattern. That the function should clearly declare its side-effects in it's type signature, as such:

createReservation :: Availability -> Reservation -> SqlPersitT m (PostId, Availability)

I'm not sure if this is an anti-pattern. I haven't written a large project using the latter style yet, so I don't know how pragmatic/better it is. But, if we continue with this line of thought and assume that for every request/response cycle there is a (mutable?) map being maintained:

-- Pretty sure that this is NOT the way to define this type, but hopefully it gets the idea across
type QueryCache = HashMap (Key val) (PersistEnity val)

Then, we simply replace all low-level DB functions with log-aware DB functions (like in the very first approach) that compute the deltas using the QueryCache instead of re-fetching the rows from the DB. Kill two birds with one stone

Your thoughts?

This is not supposed to be an anti-Haskell gripe, but a way to solve a real problem. I do not know enough about using Haskell to "program in the large" to be able to take a sensible direction in this regard. I'm looking for feedback and comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment