I've built sizeable products (~100,000+ LOC) using Rails at Cleartrip, as well as my own startup Vacation Labs. Even at Cleartrip, before settling on Rails, I tried very hard to make Common Lisp work for us. But, it felt like a losing battle. Instead of solving business problems we were writing reams of boilerplate code to make Lisp interact with the DB in a sensible manner. At Vacation Labs, I have a hunch that rewriting core parts of our app in Haskell will lead to safer code, fewer bugs, lesser dependancy on testing, and eventually faster product evolution. However, history seems to be repeating itself when it comes to accessing a large, complicated DB schema (100+ tables) using Haskell (Persistent).
I'm currently grappling with audit logs - a core requirement for any app dealing with sensitive customer info (especially e-commerce or payments). Unforuntaly, apart from Snaplet ActionLog, I could not find any easy+performant way to build audit logs in Haskell.
Here's my take on how this problem can be solved. I would really appreciate feedback from more experienced Haskell minds.
- Problems with Snaplet ActionLog's appraoch - http://hackage.haskell.org/package/snaplet-actionlog-0.2.0.1/docs/Snap-Snaplet-ActionLog.html
- Take an approach similar to Rails' ActiveModel::Dirty - http://api.rubyonrails.org/classes/ActiveModel/Dirty.html
- Global query cache approach
Here's their approach in a nutshell: Don't call Persistent's methods directly. Wrap them up in log-aware functions like loggedInsert
, loggedReplace
, loggedUpdate
, etc. The log-aware functions execute the original DB action, as well as update the audit-logs.
However, the biggest downside of this approach is that, to be able to compute the deltas required by the audit log, one has to fetch the original/current state of the row again. Twice the number of reads, which can possibly get worse when dealing with multiple rows of the same table (I still don't know how Persistent avoids the N+1 SQL issue).
At least in PostreSQL, there is possibly a way of avoiding the duplicate read by making the UPDATE
call itself return the pre-update version of the row, but I'm not sure if that's undefined behaviour or we can depend on this behaviour persisting across PostgreSQL versions.
Since ActiveRecord models are not plain hashes/dicts/records, but first-class objects, they have accessor-functions through which all attributes must be accessed (read or written). An (over) simplified way of maintaining an audit-log within a Rails model (actually this will work with any plain ol' Ruby-object, as well):
def name
@name
end
def name=(n)
@changes['name'] = [@name.dup, n] # old and new value of name
@name=n
end
However, in Haskell, there is no native concept of accessors built into the language. There is a Lens library that can provide Getters & Setters to be used instead of raw access to a record's fields. If we try to replicate the Ruby/Rails appraoch, we can look at building something on top of lenses:
data User = User {
name :: Text,
email :: Text,
passwordHash :: Text,
postCount :: Int,
createdAt :: UTCTime,
updatedAt :: UTCTime,
_changes :: ChangeLog
} deriving (Show, Eq)
-- This is probably an incorrect definition of the `ChangeLog` type, but I hope
-- it gets the idea across. A map of field-name to old-value
type ChangeLog = HashMap Text PersistValue
$(makeAuditableLenses ''User)
All the setters created by makeAuditableLenses
will change the respective field's old-value in the _changes
field as well.
An related shortcoming of Persistent is an absence of any sort of query-cache. How is this related to audit logs? Here's a longish explanation.
Think about a request/response cycle in a type DB-backed web-app:
- A request comes in
- DB transaction starts
- Various DB rows are loaded into the memory, which are mapped to various records/object-graphs.
- Some functions are called, which operate on the in-memory records/objects.
- Finally the new version of the records/objects are written back to the DB.
- DB transaction is committed (if there are any validation errors, the transaction is rolled-back instead)
- Execute any side-effects that don't need to be inside the DB transaction, eg. sending emails, notifying internal microservices, etc.
- The browser/client is sent back a response.
Now, in step 4, there is a chance that one would end-up fetching the same DB-row multiple times. This happens because one tries to write de-coupled low-level functions. Example:
-- A function that creates a reservation AND reduces the availability, which is
-- stored in a different table
createReservation :: Reservation -> SqlPersistT m (PostId)
Now, before calling this funtion, the availability may have already been fetched from the DB. Without a query-cache it would result in multiple reads of the same data from the DB. One may argue that this is anti-pattern. That the function should clearly declare its side-effects in it's type signature, as such:
createReservation :: Availability -> Reservation -> SqlPersitT m (PostId, Availability)
I'm not sure if this is an anti-pattern. I haven't written a large project using the latter style yet, so I don't know how pragmatic/better it is. But, if we continue with this line of thought and assume that for every request/response cycle there is a (mutable?) map being maintained:
-- Pretty sure that this is NOT the way to define this type, but hopefully it gets the idea across
type QueryCache = HashMap (Key val) (PersistEnity val)
Then, we simply replace all low-level DB functions with log-aware DB functions (like in the very first approach) that compute the deltas using the QueryCache
instead of re-fetching the rows from the DB. Kill two birds with one stone
This is not supposed to be an anti-Haskell gripe, but a way to solve a real problem. I do not know enough about using Haskell to "program in the large" to be able to take a sensible direction in this regard. I'm looking for feedback and comments.