Skip to content

Instantly share code, notes, and snippets.

@jmgimeno
Forked from philandstuff/euroclojure2014.org
Created June 26, 2014 17:26
Show Gist options
  • Save jmgimeno/9bd65b0973099389dcd6 to your computer and use it in GitHub Desktop.
Save jmgimeno/9bd65b0973099389dcd6 to your computer and use it in GitHub Desktop.

EuroClojure 2014, Krakow

Fergal Byrne, Clortex: Machine Intelligence based on Jeff Hawkins’ HTM Theory

  • @fergbyrne
  • HTM = Hierarchical Temporal Memory

big data

  • big data is like teenage sex
    • noone knows how to do it
    • everyone thinks everyone else is doing it
    • so everyone claims to be doing
    • (Dan Ariely)

machine learning is important

  • people don’t trust other people
    • they have their own agendas
  • so they place too much trust in machines

asimov’s take

  • we gain knowledge faster than we gain wisdom
    • applies to human knowledge
    • applies to data: gathering data is easy, drawing conclusions is not

a problem in neuroscience

  • rate of papers published is growing exponentially
  • 2013: 1 every 32 minutes
  • 2014 so far: 1 every 17 minutes

can AI learn from neuroscience?

Jeff Hawkins’ goals in HTM

  • Study the neocortex and establish its principles
  • open sourced NuPIC in 2013

neocortex

  • the wrinkly part at the surface of the brain
    • grey matter: processing
    • white matter: wiring
  • about 2mm thick, 10cm^2 in area
  • 30-50MM neurons
  • 1G connections
  • hierarchical
  • uniform
    • ie all looks physically the same
    • all regions have the same algorithm

6 key principles

on-line learning from streaming data

  • up to 10 million senses feed the brain
  • we don’t (can’t) store this data
  • we build models from live data
  • models constantly updated

hierarchy jof regions

  • sensory data enters at the bottom
  • models are built in every region
  • things change more slowly as you go up
  • hierarchy enables sequences of sequences
    • seq of waves
    • seq of phonemes
    • seq of words
    • seq of sentences
  • hierarchy works upwards and downwards

sequence memory

  • all sensory data involves time
  • sequence memory allows predictions
  • structure in data elaborated over time
  • sequences can be c

sparse distributed representations

  • in each region, many neruons, few active
  • SDRs represent spatial patterns
  • fault-tolerant, semantic ops, high-capacity
  • key to understanding & building intelligent systems

all regions are both sensory and motor

  • behaviour provides context for sensory data
  • structure in model navigated via behaviour

attention

  • use attention to manage the neocortex
  • planning and previsualisation
  • whole subhierarchies can be switched on and off

layers of neocortex

  • from molecular upwards
  • around 5 or 6

neurons

  • distral dendrites detect coincidence of incoming activity from neighbouring cells
  • you don’t just see what you’re seeing now, you predict what you’re going to see next
  • (reality is much more complicated, but this algorithm is sufficient to explain a lot)

clortex

background: numenta’s nupic

  • in dev since 2005
  • partially implements HTM/CLA
  • python/c++
  • open source

strengths

  • skilled dev team
  • eat their own dog food (grok uses nupic)
  • operates on subset of HTM/CLA principles
  • tunable using swarming on your data
  • works well on streaming scalar data (eg machine-generated)
  • great community – http://numenta.org

limitations

  • codebase has evolved as theory has developed
  • difficult/scary to rewrite for flexibility
  • OO with large, coupled, classes (~1500 LoC per class)
  • need to swarm to find parameters, no real-time control
  • not easy to extend beyonnd streaming scalar use case

clortex requirements

  • directly analogous to HTM/CLA theory
  • transparenntly understandable source code
    • a neuroscientist should be able to read & review code
  • directly observable data
  • sufficiently performant
  • useful metrics
  • appropriate platform
    • portability
    • scalability

architectural simplicity

  • first role: be useful!
  • best software is that which is not needed at all
  • human comprehension is king
    • if people can’t understand your code, your code is not finished
    • unit tests are not sufficient in themselves
  • machine sympathy is queen
  • software is a process of R&D
  • software development is challenging & intellectual
    • more science than engineering
      • engineering: you have a good model already, you just have to plug in the particular parameters
      • science: there are a bunch of unknowns which you have to learn & understand

#1: Just use data!

  • maps, vectors, sets
  • all done in a one-page datomic schema

#2: Clojure & its ecosystem

  • clojure data not domain objects

#3: russ miles’ life preserver

  • everything either “core” or “integration”
  • core: a datomic database for the neocortex
  • core: each “patch” of neurons is a graph
  • integration: algorithms, encoders, classifiers, SDRs

key clj libs & tools

  • datomic (+adi)
  • quil/processing
  • incanter
  • lein-midje-doc for literate documentation
  • hoplon-reveal-js for presentations
  • lighttable

review

  • Big Data isn’t just Machine Intelligence problem
  • HTM is exciting

links

Logan Campbell, Clojure at a Post OFfice

history:

  • was at clojure user group
  • a guy turns up and says he’s hiring a team of clojure developers
  • he was at Australia Post
    • a million lines of Java worked on by a team in India
    • wanted to bring it back in-house

project: digital mailbox

  • big companies spend a lot of money sending out bills & junk mail
  • product to seamlessly replace that workflow
  • switch from physical mail to cheaper model
  • consumer can sign up to receive water bill online
  • I was brought on as the “clojure expert”
    • (I’d been playing with it for a couple of years)
  • drama:
    • the people they could hire:
      • really experienced java devs
      • keen on FP
    • they said as they were hiring “you might be doing clojure or you might be doing scala”
    • first few people were scala fans
    • scala v clojure battles
      • “we need static typing”
      • “we need OO for domain modelling”
      • “clojure is slow” (?)
      • “what framework do you use?”
  • “we need static typing? okay, we’ll use core.typed”
  • domain modelling:
    • when people are used to domain modelling in OO, telling them to just use maps feels like a cop-out
    • records + protocols kind of feel like classes
    • wasn’t til I showed them code I’d written and comparing it with their code that they realized that you can just use maps
  • online scala course
    • we did it as a team
    • I also did the exercises in clojure
    • did one exercise three different ways in clojure
      • conditional
      • match
      • stream processing
    • showed them my solutions
      • they already understood the problems because they’d solved them themselves
  • clojure performance was a surprise, because I’d come from ruby (!)
    • clojure is fast
    • there was an underlying feeling that “we need scala for performance”
  • I’m a consultant, so was happy for the team to make the language decisions
    • “if you’re keen on scala, let’s find out a way to pitch it to management”
  • web stack: kept hearing “async async async”
    • felt like premature optimization
    • but still we used http-kit
      • benchmark started to allay fears that clojure was slow

feature: make a payment on a bill

  • not necessarily a full payment
    POST /bills/:bill-id/payments
    Session: user-id
    Post Data: amount
  • GET credit card token for user
    • POST request to payment gateway
  • GET how much left to be paid
  • if payment succeeds: display amount remaining
  • if payment fails: display error

candidates solutions

  • synchronous promises
  • promise monad
  • lamina
  • etc etc

solution 0: synchronous

  • http-kit’s requests return a promise
    • just @deref the promise (blocks the thread)

solution 1.1: promise monad

  • do is aware of promises
    • doesn’t block thread, but waits for promise to be executed before continuing
    • felt natural way to write with promises
    • but incorrect: too much waiting, no concurrency

solution 1.2: promise monad let/do

  • let to define promises
    • do to pseudo-block on them
    • introduces correctness but reduces readability

solution 1.3: let/do/do

  • okay, let’s step away from monads

solution 2: (?)

solution 3: raw promises

  • when to explicitly wait for a particular promise

solution 4: raw callbacks

  • not viable
  • would have just written a hacky little promise library

solution 5: core.async:

  • great! same shape as synchronous code, but correct concurrency

solution 6: lamina

  • didn’t feel totally suited to the situation

solution 7: meltdown (LMAX disruptor based)

  • not appropriate

solution 8: pulsar promises

  • looks exactly the same as the synchronous code, except for one character
  • pulsar rearranges your code at the bytecode level
    • uses JVM agents (normally used for tracing/debugging)
  • pass a fn to one of pulsar’s functions
    • turns synchronous code to async code

solution 9: pulsar actors

  • not appropriate

winners

  • 0: synchronous
  • 5: core.async
  • 8: pulsar

scala solution, for comparison

  • scala futures (basically promises)
  • all monadic
  • I don’t understand it entirely
  • concise
  • battle of the benchmarks, fastest first
    • pulsar-async
    • pulsar-sync
    • core-async
    • raw-callback
    • scala-play-future (significantly less than others)

CQRS (command-query responsibility segregation)

  • want fast reads
  • reduce number of queries
  • don’t want to have to update write code every time we add a new reader

structure

  • service A → cassandra → service B
  • custom triggers in cassandra in clojure (just drop in the .jar!)
    • publish to rabbitmq
    • notify index maintainer
    • write index to cassandra
    • service B reads from cassandra

cassandra triggers

  • can just throw the clojure jar in there
  • everything is byte buffers
    • you need to know the type of all the fields out-of-band
    • not self-describing data at all

microservices

  • I thought we would have a user service and a provider service and a mail service
    • but this gets tricky when you want data about users and providers
  • you need to split things much more fine grained
  • user service →
    • authentication
    • multi-factor auth
    • authorization
    • user profile
    • password reset
      • does it belong in user profile?
      • there’s a bit of workflow here
        • send out email
        • get user to click link
        • enough to warrant its own service
  • drama: needed to talk to systems team to deploy
    • I did things badly
    • I didn’t get anything into production in my 6 months there
    • systems team: we need monitoring and config and stuff
      • if we’d had something early on which had gone through these barriers, we would have had much less stress
      • benchmarks end petty arguments

Q&A

can you share some experience with monitoring & resilience?

  • appdynamics
  • classnames are expected to be java-style class names
    • clojure ones are close enough
  • clj-metrics to expose more high-level metrics
    • requests/second from ring
    • number of bills paid
    • appdynamics could pick it up from jmx
  • nomad for configuration

with http-kit+core.async, what happens when server dies and there’s loads of threads?

  • bottleneck was amount of memory
  • when server runs out, it slows down a lot
  • way to get around that is to monitor resources on your machine and ideally have autoscaling

were the scala guys finally writing clojure in the end?

  • we have one person still hardcore for scala, but sees the merits of clojure
  • a few who did the online scala courses are clojure folks now
  • people who come from the java world of static typing feel they need that
  • but now they’ve written code that actually works, they’re more comfortable with that now

Tom Hall, Escaping DSL Hell by having parens all the way down

  • @thattommyhall

DSLs

  • languages made for specific purposes
    • config mgmt
    • science
    • learning
  • distinction between:
    • internal DSLs: embedded in another language
    • external DSLs: implemented in another language

problems with puppet

  • zen of python:
    • namespaces are a honking great idea, let’s do more of them!

puppet namespaces

  • Exec[‘install’] in two different modules will result in a naming collision
  • fail :(
  • end up with Exec[‘tom::install’] but this is a hack

iteration

  • file type lets you pass in an array
  • nagios_host doesn’t
  • iteration is responsibility of type, not language
    • as far as I know

but you need to know ruby anyway

  • if you want to extend puppet, you need ruby
  • if you need to know ruby, why do we bother with the puppet DSL in the first place?

experimental features: lambdas and iteration

  • any language where lambdas arrive late is not a good language

ansible

  • just YAML
    • oh wait, I might want to iterate
    • oh wait, I’ve got embedded ginger templates in my YAML strings
      • what’s the scope of names in my templates?

if you give people a “language” they will expect loops

  • maybe lambdas
  • probably namespaces
  • this has been done before

chef gets it right

  • it’s embedded in ruby
  • you get iteration and namespaces from ruby

teaching people to program

  • if you design a language:
    • you need a parser, which is hard
    • you need an interpreter/compiler, which is hard
  • if you embed it, you get that stuff for free

geomlab

  • minimal language for teaching
  • talks about pictures
  • intro to FP
  • gets you into recursion early on
  • man $ woman - “next to”
  • man & man - “on top of”
  • (man $ woman) $ tree = man $ (woman $ tree)
  • man $ (woman & tree) – scales nicely to get a nice aspect ratio
  • learn about operator precedence
  • de morgan’s laws
    • although not always held, due to scale
  • define functions
   define manrow(n) = manrow(n-1) $ man when n>1
                    ~ manrow(1) = man
  • builds up to an escher tiling
  • but once you’ve done that, where do we go?
    • only exists in this sim
    • if you want to extend it, you need java
    • “I’m really excited about FP now, but I’ve got nowhere to go”

what if we did it in clojurescript?

  • let’s use ‘below and ‘beside instead of $ and &
  • (below man woman)
  • (beside tree star)
  • http://cljsfiddle.net/fiddle/thattommyhall.geomlab.demo
  • let’s say I want to change man – what does it mean?
    • it’s implemented in the same sort of language
    • I can see there’s a url in there where I fetch an image from the internet
    • I know recursion, because I learned that from the geomlab exercises
    • I can extend the language itself

science languages

  • R
  • wolfram alpha
  • maple
  • matlab
  • these things just aren’t very good languages, even if they are good at their domain

another problem with DSLs

  • netlogo
  • If you’re based on applets, and Oracle drops applet support, you find you need to port your whole language to a new platform (in this case javascript)
  • again, reimplement in clojurescript?
    • anyone interested in hacking on this with me?

conclusion

  • you probably don’t need to make a new language
  • if you do it will probably be rubbish
    • at least for a while
  • think about power and reach
  • you should embed /deeply/ into clojure

links

Q&A

what makes a good first language?

  • clojure needs a better day 0 story
  • at some coder dojos where I’ve taught kids, some don’t even know about files and folders
    • so if you say “open a terminal, cd into a directory” you’ve lost them
      • and it’s not their fault

have you had any kids look at your examples here?

  • I’ve done the geomlab example
  • otherwise this is all a recent exploration
  • errors in cljsfiddle are not reported well
    • again problematic for day zero

Mathieu Gauthron, JVM-breakglass

troubleshooting a java application

  • debugger
    • only powerful when you can narrow down the problem to a series of breakpoints
    • when the problem is a race condition, it will change the nature of the problem you’re studying
  • log/print statements
    • you need to plan before compilation
    • when the problem is in production, it might be too late
  • jmx
    • again, you need to plan for it in advance
  • ad-hoc interactive mechanism

what is jvm-breakglass

  • open source
  • integrates with any jvm process
  • console onto a jvm process

main features

  • interactive prompt
  • see inside private members
  • call arbitrary methods
  • create new object instances
  • create new classes
  • monitor object state
  • no need to use clojure to develop the app

how does it work?

  • jvm-breakglass runs inside the JVM and starts an nrepl server
  • you can then connect using an nrepl client (eg lein)

how to use it?

  • add it to your maven dependencies
  • add an entry point (as a <bean> or in java code)
  • connect with lein repl :connect localhost:1112

demo (enterprise application)

  • tomcat JVM
  • employee/dept data structure
  • report generation
  • java/spring mvc webapp
  • jvm-breakglass
  • spring data
    • in XML, naturally

homepage

  • oh no! one of the reports isn’t working?
  • “list employees in london” is empty
    • but we know that employee Mick Jagger lives in london
    • what’s going on?

breakglass to the rescue

  • view environment:
    • current directory, System/getProperties
    • view conf directory
  • list all loaded Spring beans
  • instrospect into object private members
    • bean builtin fn
    • to-tree to do so recursively
  • view methods or fields for a given object
  • redefine a class
    • in this case, (proxy [Address] ["1 Mayfair", "SW1", "London"] (getCity [] "London")) to define the new version, overriding a method
    • (.setAddress (:Mick employees) address) to inject it into the live data

take a step back

  • remember what it’s like to be a java programmer?
  • working with jmx beans and suchlike to try to understand why production is down
  • this stuff looks like magic

Q: how do you convince production people to put nrepl server in place?

  • short answer: impossible
  • that’s not how you present it
  • either you do it sneakily (that’s bad), and only pull the trump card when the team is desparate
  • or you convince the team that it would be useful in the UAT environment, and “of course it’s never going to be used in production” -

Q: have you considered a high-level switch that would prevent you mutating anything in the host application?

  • don’t know how you’d be able to do that
  • have been thinking about it
  • maybe using clojail
  • kind of defeats the point

Q: have you tested this with a scala app?

  • haven’t tried
  • I’ve reverse-engineered the java bytecode, and it’s readable
  • as long as you know how it compiles, it seems reasonable

Q: you were using methods like get-obj and passing string name. how does breakglass know which object to get?

  • eg if you have multiple instances of Department, how does it know which department?
    • in Spring it’s a Spring bean which is named
    • if you’re not using Spring, what’s your entry point?
      • when you create your NreplServer to enable jvm-breakglass, you can add your entry points there
      • new NreplServer(port).put("department"),myObject);
      • static methods & fields can be used too

Gary Crawford, Using Clojure for Sentiment Analysis of the Twittersphere

  • leiningen versus the ants, carl stephenson
  • leiningen versus apache ant?
  • clojure versus java?
  • FP versus OO?

stratified medicine

  • determine the best treatment for someone based on their genetic makeup to manage their chronic disease

sentiment analysis

  • Paper: “Twitter mood predicts the stock market”
    • predicted Dow Jones average through monitoring tweets
  • people who suffer chronic disease tend to be neurocompromised
    • what would normally be a minor illness can prove fatal
  • can we use twitter to predict spread of disease?

so we tried

  • score tweets for flu symptoms
  • the data science wasn’t very difficult
    • but scaling it was
  • 30 million geo-tagged tweets sent from UK
  • couldn’t scale, even with
    • HDFS/hadoop
    • mongo/aggregation
    • mongo/mapreduce
    • postgres

how can we do fast, real-time analytics of social media?

  • application: how do people feel about Scotland’s independence referendum?
  • data increases in value as we analyse it
    • tweets
    • analytically prepared data
    • analysis
    • insight
    • predictions
  • the raw data isn’t what you care about
  • don’t store the raw tweets, only store the analytically prepared data
  • stored in redis using ptaoussanis/carmine
    • it has great support for bitmaps

example

  • (car/setbit sentiment tweet-id 1)
  • (car/bitcount "SCOTLAND") – tells me how many tweets have mentioned Scotland
  • how many people in england are happy?
(wcar*
 (car/bitop "AND" "ENGLAND&JOVIALITY" "ENGLAND" "JOVIALITY")
 (car/expire "ENGLAND&JOVIALITY" 10) ;; don't keep the data longer than 10 seconds
 (car/bitcount "ENGLAND&JOVIALITY"))
  • further: “how many people in Scotland are tired or grumpy?”

getting the data in

  • adamwynne/twitter-api
  • you can specify you only want tweets from a certain geographical locality with a bounding box
    • but this is literally a rectangle
    • need it around Europe
  • LMAX-Exchange/disruptor to communicate
    • journaling
    • syncing
  • business logic

what sentiment?

  • this is hard!
  • “I’m loving #EuroClojure! :D”
  • Positive Affect: enthusiastic, active, alert
  • Negative Affect: subjective distress
  • actually two separate dimensions, not opposites
  • Watson et al, 1988
  • PANAS
  • then PANAS-x
  • then PANAS-t
    • accounts for bias on social media
    • outlines sanitisation
    • validate against 10 real events

sanitisation

where? reverse geocoding

  • don’t want to rely on external services
  • don’t want heavy IO
  • don’t want round trips to database
  • accuracy not too much of a concern
    • we already lose accuracy in interpreting the sentiment of the tweet
  • convert a map of the uk to colours:
    • look up geocode coords in map
    • check colour → get country code
  • problem: the world is a sphere
    • projecting a sphere onto a rectangle
  • prior art in d3.js
  • use JavaFX to exploit it

when?

  • there’s a lot of seconds in a day
  • and even more seconds in a year
  • really not interested in seconds anyway
  • want to group tweets by minute
  • and also group by hour
  • and also group by day, and month, and year

why?

  • why are we doing this?
  • online social media are surveillance
  • the line between public and private is becoming blurred
  • if we don’t need data, we shouldn’t collect it
    • in this example:
      • we’re never more granular than country
      • we’re never more granular than overall sentiment
      • we’re never more granular than minute
    • hopefully this is enough to prevent anyone being identified
  • Datensparsamkeit

Q: have you used Storm for this?

  • no

Q: any preliminary results on the Scotland referendum analysis?

  • I’ve had more luck with tech than data science?

Q: which way should we vote?

  • haha

Q: how do you verify your results?

  • it’s very crude at the moment?

Paul Ingles, Multi-armed Bandit Optimisation in Clojure

  • @pingles

problem statement

  • product optimisation cycles are long, complex, and inefficient
  • the multi-armed bandit model shows lots of things we’re getting wrong
  • eg: online newspapers
    • fundamentally human-led, editorially-led
  • people behave irrationally
  • Dan Ariely & Daniel Kahnemann
  • (@philandstuff suggestion: Stuart Sutherland, Irrationality)
  • economist subscription options
    1. online $59
    2. print $125
    3. print & online $125
    4. the ridiculousness of option 2. makes option 3. seem more reasonable
  • need machines to optimise at scale; but need humans to provide stuff only they can
  • running RCTs to optimise sites
    • doing so on a continuing basis
    • measuring big effects work with small numbers of participants
    • but measuring small effects requires ever larger numbers
    • to the extent that you can only run ~12 experiments a year
    • which is not really good enough

Bandit strategies can help

  • a product for procrastinators by a procrastinator
  • Product: Notflix!
    • video website
    • http://notflix.herokuapp.com/
    • shows 3 different videos
    • show good videos at top of page, and less good at bottom
    • show best possible thumbnail for each video
  • optimising with multi-armed bandits
    • optimising order and thumbnails

multi-armed bandit problem

  • slot machine = one-armed bandit
  • problem: you have a bunch of money you want to “invest” in a casino
    • you have a number of different machines to play
    • each machine has a different probability of reward
    • you don’t know what that probability is up front
  • need to balance “exploration” and “exploitation”
    • ie learning about the world vs using that knowledge to maximise income
    • analogy: trying new foods out vs sticking to what you like

bandit model

  • number of arms {1, 2, …, K }
  • number of trials: 1, 2, …, T
  • rewards: {0,1}
  • K-headlines
    • options of different text
  • K-buttons
    • options of button text, colour, etc
  • K-pages
    • whole page redesigns
  • explore this space with notflix

bandit strategy

;; choose which arm to pull
(defn select-arm [arms]
  ...)

;; update arm with feedback
(defn pulled [arm]
  ...)
(defn reward [arm x]
  ...)

(defrecord Arm [name pulls value])

ε-greedy

  • “hello world” algorithm
  • generally exploit
  • ε (epsilon) is the rate of exploration
  • eg if ε = 0.1, your strategy is:
    • with probability 10%, try a random arm with equal probability
    • with probability 90%, try the best arm based on current knowledge
  • if ε = 0, always exploit; if ε = 1, always explore
  • example with bernoulli-bandit
(bernoulli-bandit {:arm1 0.1 :arm2 0.1 :arm3 0.1 :arm4 0.1 :arm5 0.9})
  • with ε=0.2, you converge faster on the best arm
  • but ε=0.1, you exploit it more when you find it
  • once you’ve found the best arm, you should be able to double down
    • ie explore more at the beginning (when you have least knowledge) and less at the end
    • lots of extensions to ε-greedy to factor things like this in

Thompson sampling

  • Arm model
    • Θ_k: Arm k’s hidden true probability of reward (in range [0,1])
    • can build a distribution for Θ_k based on current knowledge
    • small number of pulls means wide distribution; large number means narrow distribution
    • captures uncertainty in value of Θ_k
  • each iteration, take a random sample from each distribution, take the largest sample
    • algorithm naturally balances exploration/exploitation trade-off
    • the more it learns, the narrower the distributions get, and so the more likely it is to choose an arm with a higher expected value
  • incanter example
  • Thompson-sampling example with same Bernoulli-bandit from above
    • compared with ε-greedy, explores much more much earlier, and exploits much more later on
    • considered optimal convergence
  • we can use it to rank things (not just select)
    • take a sample from each arm distribution, then order arms by that value
    • in notflix, can use for ordering the videos we show

applied to notflix

  • video rank bandit
  • for each video, a thumbnail bandit
  • at the end, the best video should be at the top
    • and each video should show the best thumbnail

results

  • videos, worst to best
    • “hero of the coconut pain”
    • “100 Danes eat 1000 chillies”
    • “3 year-old with a portal gun”
  • thumbnail bandit data
  • “we built a fictional but amazing product”

links

Q: this model assume bandits have same probability through time

  • can it readapt?
  • Thompson sampling does adapt
    • it won’t change back as quickly

Q: isn’t there an interaction between the two bandits?

  • if the thumbnail is crappy, they might not click the video
  • made an assumption about this
  • in general, if you leave it running over time and let the evidence build, it should be fine in the long run
  • but that is definitely a flaw

Tommi Reiman, Schema and Swagger to improve your web APIs

super simple web api in clojure

  • just using compojure
  • “sausage” as example data
  • PUT /foo/sausage/:id
  • example:
    • in Java: immutable value object
    • in Scala: case class
    • in Clojure:
      • free-form map?
      • constructor fn with bunch of validation?
      • prismatic/schema!

prismatic schema

  • define structure of sausage
  • then call s/validate to validate
  • schema can define functions
(s/defn get-sausage :- (s/maybe Sausage) [id :- Long]
  (@sausages id))

(s/defn ^:always-validate get-sausage2 :- Sausage [id :- Long]
  (@sausages id))

schema coercion

(defmodel Pizza {:id Long
                 :name String
                 :price Double
                 :hot Boolean
                 (s/optional-key :description) String
                 :toppings #{(s/enum :cheese :olives :ham :pepperoni :habanero)}})
  • allows slurping JSON data, but imposing extra types
  • eg above we can slurp toppings from a JSON array into a Clojure set rather than a vector

double schema

  • loose schema for first input
    • (def Customer {...})
  • tighter schema for validated input
    • (def ValidCustomer (merge Customer {...}))

schema selectors

  • accept but remove unrecognised params with select-schema

generative schema

  • generate random orders for test data
  • davegolland/generative-schema.clj

contribs

  • sfx/schema-contrib
  • cddr/integrity

swagger

  • a specification for describing, producing, consuming, visualising RESTful web services
  • https://helloreverb.com/developers/swagger
  • existing adapters
  • clojure options:
    • octohipster
    • swag
    • ring-swagger
      • compojure-api
      • fnhouse-swagger
  • endpoint definitions in JSON
  • data models as a JSON Schema
  • swagger UI
    • visualises the API
  • code gen
    • no clojure support yet (anyone?)
  • swagger-socket
    • run it all on top of websockets

ring-swagger

  • https://github.com/metosin/ring-swagger
  • JSON-Schema has some dates
    • but prismatic/schema will never support dates, as it’s more generic
  • higher level abstractions on top of swagger, but nothing for the web developer

compojure-api

  • an extendable web api lib on top of compojure
  • macros & middleware with good defaults
  • schema-based models & coercion
  • GET* macro to define input and output schemas

fnhouse-swagger

  • prismatic/fnhouse
    • launched at clojure/west
  • defnk with metadata → annotated handler
  • fnhouse-swagger
    • metosin/fnhouse-swagger

summary

  • schema is an awesome tool
  • describe, validate, coerce your data
  • building on top of ring-swagger
    • compojure-api → declarative web apis
    • fn-swagger → meta-data done right
    • or do your own!
  • kekkonen.io
    • CQRS-lib

Renzo Borgatti, The Compiler, the Runtime and other interesting beasts from the clojure codebase

an amazing growth:

  • mar 2006: first commit
  • oct 2006: 30k loc (7 month old)
  • oct 2007: clojure announced!
  • oct 2008: invited to Lisp50 to celebrate 50 years of lisp
  • May 2009: 1.0 + book!
  • now: almost 90k loc

initial milestones

  • apr 06: lisp2java sources
  • may 06: boot.clj appears
  • may 06: STM first cut
  • june 06: first persistent data structure
  • sep 06: java2java sources
  • aug 07: java2bytecode started
  • right after: almost all the rest: refs, lockingtx

drew on lots of sources of knowledge

  • collection of papers

high-level view:

  • (def lister (fn [& args] args))
  • read → analyse → emit/compile → compile
  • although the lines between the stages get blurred at times

reader

  • takes stream, returns data structures
  • PersistentList, Symbol, etc

analyser

  • input: data structure
  • output: exprs
    • DefExpr
      • Var
      • FnExpr
        • Sym
        • PersistentList
          • FnMethod
            • LocalBinding(Sym(“args”)),
            • BodyExpr
              • PersistentVector
              • LocalBindingExpr

Emission

  • bytecode generation for Exprs
  • prerequisite for evaluation
  • emit() method in Expr interface
  • Notable exception: called over ??

Evaluation

  • transform Exprs into their “usable form”
  • eg
    • new object
    • a var
    • namespace
  • FnExpr is just getCompiledClass().newInstance

Compilation

  • Usually coordination for emit
  • Compiler.compile namespace -> file

Emit

  • input: Exprs
  • output: bytecode

monsters!

RT

  • this is how the RT class gets initialised: the first time it gets referenced:
final static private Var REQUIRE = RT.var("clojure.core", "require");
  • simply referring to it here causes the static initializers to run
  • RT has a lot of behaviour in static initializers
    • inside it is the doInit(); call
      • which loads all of clojure.core
    • all just from referring to RT in some otherwise unrelated class!

Compiler

  • inner classes for each Expr type

LispReader

  • inner classes for each token you might encounter
  • <clinit>
    • sets up reader macros
      • macros and dispatchMacros (latter for #{ #( #_ #^ etc)

analyze()

  • not a class, but a family of methods
    • analyzeSeq
    • new ConstantExpr
    • MapExpr.parse
  • FnExpr.parse
    • invokes the compiling phase during parsing phase

emission

  • ASM lib used to generate bytecode
  • FnExpr.emitMethods()
    • generate a method for each of the arities of the function

other beasts

  • LockingTransaction and Ref

DynamicClassLoader

  • clojure.lang.DynamicClassLoader.findClass(String)
    • RT.classForName()
    • Compiler$HostExpr.maybeClass()
  • Class.forName() goes up the hierarchy of classloaders and asks each what they know
    • an instance of DynamicClassloader is created for each namespace
      • and also for each form
    • (this is true for the bootstrap phase; not always true eg in AOT (ahead-of-time) compilation)
  • supporting dynamicity
    • in defineClass:
      • classCache.put(name, new SoftReference(c,rq));
    • in findClass:
      • Reference<Class> cr = classCache.get(name);
    • SoftReferences are used to save PermGen, since if we redef a var we don’t want it to keep consuming PermGen

Bonus: clojure was initially implemented in lisp

  • ~1600 loc to implement read, analyse, compile, eval
  • although emitting Java code, not bytecode
  • was also generating C♯

Q: some things in bytecode can’t be expressed in java

  • is there anything which clojure generates which can’t be decompiled back to Java?
    • I’m pretty sure yes, but not sure exactly what
    • Rich:
      • locals-clearing
      • constructs which use goto (which exists in bytecode but not Java)

Rich Hickey, the insides of core.async channels

aside: here’s what clojure looks like in a good IDE

  • (ie IntelliJ)
  • yes, Compiler.java is massive
    • but if your IDE has a structure editor, you can navigate them all easily
    • it’s all in one file because I don’t want 300 files

aside2: the classloader has a cache in a branch

  • fast-load branch

warning! implementation details ahead

  • subject to change!
  • informational only

the problems

  • single channel implementation
    • for use from both dedicated threads and go threads
      • simultaneously, on same channel
  • alt and atomicity
    • Java CSP libraries often didn’t support alt well
    • it’s tricky to do atomically
  • multi-reader/multi-writer
  • concurrency
    • construct deals with the ick of threads and mutexes
  • (this talk: focus on JVM impl; JS version has less of these issues)

API

  • >! >!! put! alt! → channel → <! <!! take! alt!
  • it’s not an RPC mechanism, it’s just a conveyor belt

SPI (service provider interface)

  • >! >!! put! alt!impl/put! [val handler] → channel → impl/take! [handler]<! <!! take! alt!

anatomy

  • channel has:
    • pending puts (fifo)
    • a buffer (optional) in the middle
      • contains data
    • pending takes (fifo)
    • flag indicating if channel is closed
  • fifos implemented as linked queues
  • important to distinguish queues of operations from buffer of data

invariants

  • never pending puts and takes simultaneously
  • never takes and anything in buffer
  • never puts and room in buffer
  • take! and put! use channel mutex
  • no global mutex
    • or even multi-channel mutex

put! scenarios

  1. one or more waiting take! operations
    • gets paired up, takes handler gets completed
  2. stuff in the buffer, but with room in buffer
    • puts its stuff in the buffer, succeeds and immediately completes
  3. buffer full (or no buffer)
    • enter puts queue, block
      • results in backpressure
  4. full buffer, but windowed
    • sliding buffer: latest information takes priority, drop head of buffer (oldest item in fifo), put! completes immediately and enters buffer
    • dropping buffer: drop put! on floor, but completes immediately
    • could have more sophisticated policies in future

take! scenarios

  1. nothing in buffer
    • enqueued
  2. buffer has stuff, but no puts waiting
    • get data, immediately complete
  3. buffer full (or no buffer), puts pending
    • get something (either head of buffer or get paired with first put!)
    • first waiting put! completes (either enters buffer or hands directly to take!)

close! scenario

  • all pending takes complete with nil (closed)
  • subsequent puts complete with nil (already closed) (relatively new)
  • subsequent takes consume ordinarily until empty
    • any pending puts complete with true
    • takes then complete with nil

queue limits

  • puts and takes queues are not unbounded either
  • 1024 pending ops limit
    • somewhat arbitrary, might change
    • will throw if exceeded
      • if you’re seeing this, it’s an architecture smell
    • most likely if you use put! on the edge of your system

alt(s!!)

  • attempts more than one op
  • on more than one channel
  • without global mutex
  • nor multi-channel locks
  • exactly one op can succeed

implications

  • registration of handlers is not atomic
  • completion might occur before registrations are finished, or any time thereafter
  • completion of one alternative must ‘disable’ the others atomically
  • cleanup

handlers

  • wrapper around a callback
    • callbacks are icky, so we want to hide them
  • SPI
    • active?
    • commit → callback-fn
    • lock-id → unique-id
    • java.util.concurrent.locks.Lock: lock, unlock

take/put handlers

  • simple wrapper on callback
  • lock is no-op
  • lock-id is 0
  • active? always true
  • commit → the callback

alt handlers

  • each op handler wraps its own callback, but delegates rest to shared “flag” handler
  • flag handler has lock
    • a boolean active? flag that starts true and makes one-time atomic transition
  • commit transitions shared flag and returns callback
    • must be called under lock

alt concurrency

  • no global or multi-channel locking
  • but channel does multi-handler locking
    • some ops commit both a put and a take
  • lock-ids used to ensure consistent lock acquisition order
    • (avoids deadlock)

alt cleanup

  • “disabled” handlers will still be in queues
  • channel ops purge

SPI revisited

  • handler callback only invoked on async completion
    • only 2 scenarios
  • when not “parked”, op happens immediately
    • callback is not used
    • non-nil return value is op return
  • only time ops park
    • put! when it gets blocked on full buffer
    • take! when it gets blocked on empty buffer
  • only time ops complete asynchronously
    • take! with pending puts
    • put! with pending takes

wiring !/!!

  • blocking ops (!!)
    • create promise
    • callback delivers
    • only deref promise on nil return from op
      • non-nil indicates immediate success (and so callback never gets called)
  • parking go ops (!)
    • IOC state machine code is callback

summary

  • you don’t need to know any of this
  • but understanding the “machine” can help you make good decisions

Q: why use alt! for putting? what’s rationale?

  • taking multiple channels is like a select(2)
  • when you have consumers of different capabilities
    • I want to try to write to everyone, but whenever the first one is ready, I give it to them
    • Q: what’s the difference between that and having four consumers on a single channel?
      • you might have a priority metric, or a cost metric
      • though yes sometimes you can achieve same result two different ways

Q: why is global or multi-channel mutex not good enough?

  • well it would be easy! :)
  • a global mutex could make registration atomic
  • you’d have to make disabling other alts atomic
  • you’d have to make rendezvous atomic
  • you could have two unrelated sets of channel operations, why should they contend?
  • people hate global locks
  • rules out by my aesthetic sense :)

Q: David Nolen had an example of 10000 go blocks updating a textarea, did he hit the 1024 limit?

  • no I don’t think so, but not sure exactly

Q: are buffer & queue sizes useful metrics to monitor?

  • that would be great, and making them monitorable is on the TODO list

Q: other possible extensions?

  • buffer policies
    • you might have logic about priority
  • core.async has proven its utility and it’s become important
    • go macro is a great PoC of what you can do with a macro with several kLoC behind it
      • has its own subcompiler inside it
      • kind of implements a subset of clojure
    • maybe build async support into the compiler?
      • move locals from the stack to fields on the method object
      • I don’t need the stack anymore
      • I can be paused and resumed on another thread
      • declare a fn as async
      • comply with this SPI
      • could build other things like generators & yield
    • the pride moment of “look you can do this with a macro” is not dominated by the desire to make this performant and more solid
  • Q: continuations? how do they differ?
    • continuations are more general
    • this won’t use continuation-passing-style
    • it’s related
    • it won’t be like call/cc
    • it won’t be first-class
    • you won’t be able to resume it more than once
    • for a specific set of use-cases
    • Oleg did a talk that just generators are enough to do stuff that people think you need a lot more for

Q: is there something planned for dynamic binding and the go macro?

  • there are fns which allow you to do the conveyance
    • don’t know if go allows all of them to work

Q: channels on the network?

  • it’s easy to have something you call a channel and put over a wire
  • pretty hard to have all the semantics of these channels over the wire
  • already have queues and all sorts of interfaces to do similar things
  • atomic alt! over more than one wire not going to happen
  • maybe semantics for ports
  • or limitations on alt!
  • the wire has its own semantics, this is the key thing here
    • failure, queueing, delays
  • really easy to just take something from the wire and call put!

Q: is there a typical way to monitor a go block?

  • what kind of monitoring?
  • see that it’s still working, still alive?
  • if the channels were monitorable, you could see if things were producing/consuming properly

Q: what other options did you consider & reject in the design of core.async

  • something other than CSP?
  • the generators stuff
  • continuations
  • I liked what golang did
    • they made a good choice
    • there’s a java csp lib that impls the same kinds of ops
    • it’s difficult to get the semantics correct
  • wanted alts! to be a regular fn, not syntax
    • which feels like an enhancement over go
  • what we’re putting on these channels is immutable
    • which gives extra robustness
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment