dreamsmasher · July 12, 2022 21:35
diff --git a/spells.lhs b/spells.lhs

 --- 
 title: Conjuring the sprits of the computer with our spells
 author: Naomi Liu
 patat:
    wrap: true
    margins: 
        left: 10
        right: 10
    incrementalLists: true

 ---

 <!-- 
 Copyright 2021 Naomi Liu.

 This is a literate Haskell file, formatted for presentation in `patat`. If you want to run it locally, you'll need a few things:

 GHC > 8.10.2
 haskeline
 containers
 transformers
 parsec

 -->

 ```python
                   \=========\\===========\                         
                    \888888888\\ssssssssss\\                         
                     \888888888\\ssssssssss\\                        
                      \888888888\\sssssssssss\                       
                       \888888888\\sssssssssss\                      
                        \888888888\\sssssssssss\ ===================
                         \888888888\\sssssssssss\ #>=>>=>>=>>=>>=>>=
   Conjuring the spirits  \888888888\\sssssssssss\ ================
      of the computer      \888888888\\sssssssssss\
     with our spells:      /888888888//ssssssssssss\ 
      how programming     /888888888//ssssssssssssss\ ==============
       languages are     /888888888//ssssssssssssssss\ #=>>=>>=>>=>>
       implemented      /888888888//ssssssss/ \sssssss\ ============
                       /888888888//ssssssss/   \sssssss\        
        (＠＾◡＾)      /888888888//ssssssss/     \sssssss\ 
       Naomi Liu     /888888888//ssssssss/       \sssssss\       
        nliu.net    /888888888//ssssssss/         \sssssss\      
                   /=========/ ========/           \========
 ```
 ---

 Core ideas:
 ----------

 . . .

  - Evaluating code is a combination of:
    - parsing input
    - transforming that code into an AST
    - walking the tree, reducing nodes by evaluating expressions
    - maintaining an environment mapping keys to values, for variable lookup
    
 ---

 Why implement a new language?
 ------
  - you're fed up with current ecosystems
  - you want to experiment with new ideas
  - for the learning experience (brain gains)
  - it's fun, like **really really** fun.

 ---


 \begin{code}
 {-# LANGUAGE LambdaCase, ImportQualifiedPost #-}
 import Text.Printf
 import Debug.Trace
 import Data.Map (Map) 
 import Data.Map qualified as M
 import Data.Maybe
 import Data.Char
 import Data.List
 import Text.ParserCombinators.Parsec as P
 import Text.Parsec.Combinator
 import Control.Monad
 import Control.Arrow
 import Control.Applicative hiding ((<|>))
 import Control.Monad.Trans.Reader
 import System.Console.Haskeline
 import System.IO
 import Control.Monad.IO.Class
 import Control.Exception qualified as E

 \end{code}
 don't worry about these ;)

 ---

 The simplest language to evaluate: **Hutton's razor**

 At a high level, programs are just a tree of expressions:

 \begin{code}
 data Expr = Constant Int | AddTwo Expr Expr
 -- a value of type Expr can be a constant integer
 -- or two Exprs wrapped in a constructor

 -- e.g. 1 + 2 can be expresed as Add (Constant 1) (Constant 2)

 evalHutton :: Expr -> Int
 evalHutton (Constant n) = n
 evalHutton (AddTwo a b) = evalHutton a + evalHutton b

 \end{code}

 we can't directly add two `Expr`s, but we can unwrap them!

 ---


 Visualizing this tree:

 ```haskell
             [ + ]             --- Add ---
 1 + 2   ->  /     \   ->      /           \ 
           1       2     (Constant 1) (Constant 2)

 evalHutton (Constant n) = n
 evalHutton (AddTwo a b) = evalHutton a + evalHutton b
 ```

 . . .

 If we encounter a constant value, we can just return the value.

 . . .

 If we encounter an operator, we walk down the tree, evaluating both arguments before adding them together.

 . . .

 Evaluation is really just a postorder traversal down an n-ary tree.

 ---

 Moving on up
 -----

 To evaluate a node, you need to evaluate its children first

 `f(1, 2) + 3` doesn't make sense unless `f(1, 2)` becomes a number

 ---

 Core steps
 -----

 - Turning human-readable code into something that's more easily transformed

 . . .

 - Tagging expressions with their types

 . . .

 - Rearranging these tagged expressions into a tree

 . . .

 - Walk the tree, looking at the type of a current node and acting accordingly

 ---

 There are two main strategies for turning your human-readable source code into instructions for the computer: 

 . . .

 - compiling the code into a more-direct representation

 . . .

 - using a program that takes in source code as input and evaluates it for you. 

 ---

 The main steps to this process:
 ----

 - Lexing/Tokenizing

 . . .

 - Parsing into IR (intermediate representation)

 . . .

 - Compilers and interpreters diverge here:

 | Compilers | Interpreters |
 |-----------|--------------|
 | Static analysis  | Direct evaluation|
 | Compile-time Optimization*\**  | Runtime optimization (e.g. analyzing hot loops) | 
 | Assembly generation | Bytecode generation (optional) |

 *\**(e.g. dead code elimination, strength reduction)

 . . .

 The bytecode generation step can come before, during, or after runtime optimization. 
 -  (e.g. Java, LuaJIT, Pypy)

 ---

 How do I do the do???
 -----


 write a parser first
 ---------

  - define a grammar for your language, e.g.:
    - expressions always come between parentheses

    - `(+ 1 2 3)`
    
    . . .


    - lists: comma separated values between square brackets

    - `[1, 2, 3]`

    . . .

    - strings: any body of text between two double quotes

    - "UwU"

    . . .

    - comments: anything between either:
      - `//` and `\n`
      - `/*` and `*/`

 ```javascript
        const s = 'foo'; // unnecessary        

        /* i think lodash exports this
        import is-even from 'is-even';
        */
 ```

 ---

 Some example parsing code:

 Scan for a character - if it matches a reserved operator, return the corresponding tag

 \begin{code}

 parsePrim :: Parser PrimOp
 parsePrim = try $ do
    op <- oneOf "+-*/=" 
    pure $ case op of
            '+' -> Add
            '-' -> Sub
            '*' -> Mul
            '/' -> Div
            '=' -> EqP

 parseAtom :: Parser Symbol
 parseAtom = try $ do
    first <- try letter <|> symbols
    rest <- P.many (try letter <|> try digit <|> try symbols)
    pure $ Atom (first : rest)

    -- <|> is a choice between parsers
    -- the start of an Atom is parsed differently
    -- we keep reading until we hit a character that isn't a letter, digit, or symbol
    -- at the end: return the first char + rest of string, tagged as an atomic identifier 

 \end{code}

 ---

 The rough symbolic structure of your language might look like 
 -----

 \begin{code}
 data BuiltinOp = Cons Symbol Symbol -- attach an element onto a list
               | Car Symbol -- get the head of a list
               | Cdr Symbol -- get the tail of a list
               | Define Symbol Symbol -- bind a list to a name
               | If Symbol Symbol Symbol
               deriving (Eq)

 data PrimOp = Add | Sub | Mul | Div | EqP -- Mod | Exp | Eq | Gt | Lt ...
            deriving (Eq)
 \end{code}

 ---

 Next, you would assign rough types to your expressions:
 ----

 \begin{code}
 data Symbol = Atom String -- an identifier
            | Function [Symbol] Symbol -- func (...args) (body)
            | Builtin BuiltinOp -- evaluated differently 
            | Prim PrimOp       -- likewise 
            | String [Char]
            | Boolean Bool
            | Number Int
            | List [Symbol]     -- your function body
            | Quoted Symbol -- a literal list
            | Pair Symbol Symbol -- (cons 1 2)
            | Nil 
            -- pairs are special because they form the basis for lists
            deriving (Eq)

 \end{code}

 ---

 Your semantics should be **clear**

 . . .

 `(func arg1 arg2)` 

 should be parsed the same as 

 . . .

 `(      func    arg1   arg2   )`

 ---



 **Optionally:** write a tokenizer first to make parsing easier
 ----

  - condense multi-character sequences down into atomic symbols
  - e.g. `//` could expressed as a single symbol COMMENT_START

 ---

 ```scheme 
 (func arg1 arg2)
 ``` 
 could be something like
      
 ```haskell
 [ _LeftParen
 , _SymbolName "func"
 , _SymbolName "arg1"
 , _SymbolName "arg2"
 , _RightParen
 ]
 ```

 Don't have to worry about whitespace, contextual character sequences, etc. during the actual parsing phase

 ---

 Any syntactic dividers like parentheses, brackets, etc. 
 only exist when you're writing the code 

 They exist to give hints about the structure of the program.

 . . .

 Parsed, the expression might look like:

 . . .

 ```haskell
 -- (func arg1 arg2)
 List 
  [ Atom "func"
  , Atom "arg1"
  , Atom "arg2" 
  ]
 -- recall our types:
    -- List [ Symbol ]
    -- Atom String 
 ```
    
 ---

 Some languages are more amenable to parsing than others.
 ---------
 ```scheme
 ;; lisp

 (define (fac n) 
  (if (= n 0) 1 
      (* n (fac (- n 1)))))
 ```

 Any expression within parentheses has the form `(operator, ...args)`.
 The more parentheses an expression has, the more deeply-nested it is.

 ---

 ```javascript
 // javascript

 const fac = (n) => { 
  if (n === 0) { 
    return 1;
  }
  return n * fac(n - 1);
 };

 // bonus:
 // facOneLiner = n => n ? n * fac(n - 1) : 1;
 ```

 Curly braces denote code blocks or objects.
 `if` is always followed by `(conditional)`, then a code block/expression.

 ---

 ```python
  # python

 def fac(n): 
  # the scope of a code block is dependent 
  # on its indentation relative to its surroundings
  if n == 0:     # if n == 0:
    return 1     # return 1 # syntax error
  return n * fac(n - 1) # requires context to be known when parsing
 ```

 ---

 <!-- 
 ask: If you had to write a parser for one of these languages RIGHT NOW, which one would you pick ? 
 -->

 Lisp is probably the easiest to parse. 
 - it has the fewest edge-cases 
 - context-free grammar

 Converted into an AST, `fac` might look like:

 ---

 ``` haskell
 Define -- (define 
  List [ 
    Atom "fac", -- (fac n)
    Atom "n"
  ]
  List [
    Atom "if", -- (if 
    List [
      Prim Eq, --   (= n 0)
      Atom "n",
      Number 0
    ],

 ```
    
 ---

 ```haskell
    Number 1, -- 1
    List [    -- `else` is implicit
      Prim Mul, -- (* n
      Atom "n",
      List [
        Atom "fac", -- (fac
        List [
          Prim Sub, -- (- n 1)
          Atom "n",
          Number 1
        ] --))))
      ]
    ]
  ]
 ```

 ---

 Very, VERY rarely do you end up with parse-trees that are human readable.

 . . .

 Lisp restricts every function call to be in prefix position:

 ```scheme
 (defvar arg1 arg2 ...)
 ```

 which makes things a LOT easier for the programmer.

 . . .

 They also lend the language some pretty magical semantics.

 ---

 Now what??
 ---------

 We need to call the function.

 ``` scheme
 (fac 3)
 ```
 When the function is called with arguments, 
 we can maintain a map 

 \begin{code}
 type Environment = Map String Symbol
 \end{code}

 from variables to names to values as we traverse the AST.
 These names can refer to functions, lists, or any representable variable.

 ---

 Let's expand on Hutton's Razor a bit:

 ---

 \begin{code}

 evalPrim :: Environment -> PrimOp -> [Symbol] -> (Environment, Symbol)
 evalPrim env EqP args = case map (snd . eval env) args of
    [] -> (env, Boolean True) -- special case
    (x:xs) -> (env, Boolean $ all (== x) xs)
 evalPrim env sym args = 
    let reducingFunc = case sym of
            Add -> (+) -- we interpret the representation of 
            Sub -> (-) -- primitive operations as actual functions
            Mul -> (*)
            Div -> div

        unwrapAndApply :: (Int -> Int -> Int) -> Symbol -> Symbol -> Symbol
        unwrapAndApply f (Number a) (Number b) = Number (f a b)
        -- we can't directly add two Numbers, we have to unwrap them
        -- then wrap them back up in a constructor

 \end{code}

 ---

 \begin{code}
    -- args :: [Symbol]
    in case args of
        [] -> (env, Number 0)
        -- we need at least one value to apply these functions to
        (head:tail) -> 
            let (newEnv, evaluated)
                  =  threadEnv env (head : tail)
        -- reduce every child AST node into a form we can work with
            in (env, foldr1 (unwrapAndApply reducingFunc) evaluated)
        -- foldr encapsulates recursing over a list
        -- equivalent to js .reduce() (but lazy!)
 \end{code}


 ---

 Where it gets really interesting: threading an environment 
 ----

 If we have a list of expressions at the same level, chances are 
 that we want a local scope to reflect the semantics of our code.

 . . .

 If you write 
 ``` javascript
 let a = 1;
 console.log(a);
 ```

 . . .

 you would hope that `a` contains the value you just assigned to it.

 As we walk down the tree, we need to make sure that nodes at the **same level**
 are able to share an environment!

 \begin{code}
 threadEnv :: Environment -> [Symbol] -> (Environment, [Symbol])
 threadEnv env list = mapAccumL 
                    (\env' node -> eval env' node) 
                    env 
                    list
    -- mapAccumL is like map, except it passes an accumulating state
        -- between invocations
    -- this allows us to gradually change an environment as we evaluate each
    -- AST node in turn
    -- and finally return (new environment, list of results) 

 \end{code}


 ---


 Let's write some boilerplate for `BuiltinOps`:

 \begin{code}

 evalBuiltin :: Environment -> BuiltinOp -> (Environment, Symbol)
 evalBuiltin env = \case
    Cons a (Quoted (List xs)) -> (env, Quoted $ List (a : xs)) -- (1, (2, (3, ...))) is a linked list
    Cons a b -> (env, Pair a b) -- forms a pair otherwise (R6RS Scheme standard)
    Car (Pair a _) -> (env, a)
    Car (List (x:xs)) -> (env, x) -- since we have two ways of expressing lists
    Cdr (Pair _ b) -> (env, b)
    Cdr (List (_:xs)) -> (env, List xs)
    If p x y -> 
      let (Boolean b) = snd $ eval env p in eval env (if b then x else y)
    Define (Atom key) val -> let (_, val') = eval env val
        in (M.insert key val' env, Atom key)
    -- store the arguments and body in our map, stored with the key
    Define (List (Atom key : args)) body -> 
        let funcSym = Function args body
         in (M.insert key funcSym env, Atom key)
    -- save a record of this definition in the environment 
    -- we'll let any other combinations fall through and throw an error

 \end{code}
 ---

 The core of our interpreter: `eval`
 -----
 <!--
 say something interesting here
 -->

 \begin{code}

 eval :: Environment -> Symbol -> (Environment, Symbol)
 eval env sym = case sym of
  Number n -> (env, sym)
  String s -> (env, sym)
  -- we don't need to really compute atomic values,
  -- since there's no further work to do
  -- so we'll use a wildcard to match these basic types

  -- assume that all inputs are valid for now,
    -- error handling is outside our scope

 \end{code}

 ---

 Cases we've already covered:
 ----

 \begin{code}
  Atom a -> (env, fromJust $ M.lookup a env) 
    -- return the stored value, which is a Symbol
  Builtin b -> evalBuiltin env b
  List xs -> case xs of
    -- pattern match on the list 
    (Prim p: args) -> evalPrim env p args 

 \end{code}

 ---

 Recursively evaluating function calls:
 ----

 - ignore the environment returned from the child 
 - this is how we get scoping!

 \begin{code}
    (Atom a: args) ->
      let 
        (_, Function argNames body) = eval env (Atom a)
        -- any symbol at the beginning of an expression list 
        -- HAS to be a function

 \end{code}

 ---

 Creating a new environment with bound arguments
 - we need to evaluate children before 


 \begin{code}
        evaledArgs = map ((\(_, val) -> val) . eval env) args
        -- we don't care about those returned environments
        unwrappedNames = map (\(Atom a) -> a) argNames
        -- we need to unwrap these constructors to get their
        -- underlying strings

 \end{code}

 ---

 \begin{code}
        boundNames = zip unwrappedNames evaledArgs
        -- [(variable name, evaluated value)]
        boundEnv   = foldr 
            (\(name, val) -> M.insert name val)
            env 
            boundNames
        -- {variable name: value}
        -- create a new Map, combining these pairs
        -- with our current env
 \end{code}

 ---

 finally: 
 ----

 evaluate the body with arguments bound to their names

 \begin{code}
       in eval boundEnv body


 \end{code}

 . . .

 \begin{code}
    [singleton] -> eval env singleton
    _ -> (env, Nil)
  _ -> (env, sym)
 -- catchall for other simple values
 \end{code}

 ---

 Now you have a programming language, a spicy job offer, and a pretty nice toy to play around with.

 Are we done? Not yet.

 ---

 Optimizations
 ---------

 - High-level languages offer a faster development/debugging cycle

 . . .

 - at the cost of performance

 . . .

 If you want your language to be used in production, it's time to drop down a few levels.


 ---

 Why are low level languages more efficient?

 - Finer grained control over details

 . . .

  - memory management

 . . .

  - closer to the bare hardware 

 . . .

  - fewer things to take care of (e.g. closures, objects, etc)

 . . .

  - simulating a computer is already intensive, so it's ideal to try and minimize that overhead as much as possible

  . . .

  - the higher you go, the more work has to be done to run your code

  . . .

  - imagine if JS was written in JS? 

 ---

 Writing a runtime
 ---------

 Higher level languages abstract away the details from you

 Naturally, going lower means that you need to care take of those details yourself

 . . .

 Considerations
 ----------------

 Allocating space for closures/objects

 . . .

 - how do keep track of where values are stored? 
 - when they aren't used anymore?
 - freeing memory that's unused?
 - in other words: **collecting garbage**

 ---

 Making your language usable

 . . .

 - You didn't write `express.js`, and you definitely didn't write `http`.

 . . .

 - You'll need to, if you want people to use your language

 . . .

 - Library quality will make or break a new programming language

 . . .

 - Math, networking, sorting, data structures, etc. are core pieces of functionality you'll need to implement

 . . .

 - It also serves as a non-trivial test for your new language, serving as a proof of your project's functionality


 . . .

 - Why build a car without driving it?

 ---

 Summary

 - Writing a new language is a non-trivial undertaking

 . . .

 - You'll gain a greater appreciation and understanding of the work that has gone into our tooling over the years

 . . .

 - If you're lucky, you might gain some popularity

 . . .

 - If you're *REALLY* lucky, you'll become the next `PHP`.

 ---

 Thank you!

 ---

 ^ Presentation ends there

 This is the rest of the parsing/helper code - I wrote this entire document in a night, so there are a few inconsistencies and bugs. Specifically, it will crash on invalid inputs instead of returning a `Maybe` or `Either` type. `Cons`, `Car`, etc. should probably just be a regular enum type instead of the sum type that they are.

 There's also no support for numeric comparisons (Lt, Gt, etc) because it would have added extra complexity outside the scope of the presentation. They would be trivial to add in, though.

 There's no support for macros or lambdas, although lambdas could just be implemented by returning a `Function` symbol directly without modifying the environment.

 As for the code, 90% of it I wrote last night. I took the `Parsec` functions from my old Scheme [interpreter](https://github.com/dreamsmasher/meme), but the evaluation logic is new, and a lot cleaner than my old implementation. On my first try, I was using IORefs under the hood to keep track of environments, but I think passing an `Environment` around is a lot more elegant. I avoided using the `State` monad for the sake of not having to explain monads on top of everything else. As you can see from the type signatures, this is literally just the `State` monad in long/tedious form.

 \begin{code}

 instance Show BuiltinOp where
    show = \case
        Cons a b -> printf "cons %s %s" (show a) (show b)
        Car a -> printf "car %s" (show a)
        Cdr a -> printf "cdr %s" (show a)
        Define a b -> printf "[define] %s -> %s" (show a) (show b)
        If p a b -> printf "%s ? %s : %s" (show p) (show a) (show b)

 instance Show PrimOp where
   show = \case
       Add -> "+"
       Sub -> "-"
       Mul -> "*"
       Div -> "/"
       EqP -> "="

 instance Show Symbol where
    show = \case
        Atom s -> s
        Prim p -> show p
        String s -> show s
        Boolean b -> show b
        Number n -> show n
        Builtin b -> show b
        List xs -> printf "(%s)" (unwords $ map show xs)
        Quoted xs -> show xs
        Pair a b -> printf "(%s . %s)" (show a) (show b) 
        Function args body -> printf "[function] (%s) %s" (unwords $ map show args) (show body)
        Nil -> "nil"


 parseString :: Parser Symbol
 parseString = char '"' >> String <$> P.many (satisfy (/= '"')) <* char '"'

 \end{code}


 \begin{code}

 --"

 -- Parser code for handling the remaining data types

 parseNumber :: Parser Symbol
 parseNumber = try $ do
    sign <- maybe 1 (const (-1)) <$> optionMaybe (char '-')
    n <- some digit
    pure $ Number (sign * read n)

 stripSpc :: Parser a -> Parser a
 stripSpc p = spaces *> p <* spaces

 parseList :: Parser Symbol
 parseList = try $ List <$> stripSpc (sepEndBy parseSymbol spaces)

 symbols :: Parser Char
 symbols = oneOf "!$%&|-*+=/:<=>?@^_~"

 parseNil :: Parser Symbol
 parseNil = try $ Nil <$ string "nil"

 parsePair :: Parser Symbol
 parsePair = try $  do
    h <- endBy parseSymbol spaces
    t <- char '.' >> spaces >> parseSymbol
    pure $ Pair (List h) t

 parseQuoted :: Parser Symbol
 parseQuoted = try $ do
    try $ char '\''
    Quoted <$> parseSymbol

 \end{code}

 \begin{code}

 parseBuiltin :: Parser BuiltinOp
 parseBuiltin = try $ do
    let ops = ["cons", "car", "cdr", "define", "if"]
    let arities = zip ops [2, 1, 1, 2, 3]
    name <- choice (map string ops)
    spaces
    let arity = fromMaybe 0 $ lookup name arities
    case arity of -- super dirty, I know
        3 -> do
            [p, a, b] <- forM [0..2] (const (parseSymbol <* spaces))
            pure $ If p a b
        2 -> do
            let func = if name == "cons" then Cons else Define
            l <- parseSymbol <* spaces
            func l <$> parseSymbol
        1 -> do
            let func = if name == "car" then Car else Cdr
            func <$> parseSymbol

 parseBool :: Parser Symbol
 parseBool = try $
    Boolean . (== "#t") <$> choice (string <$> ["#t", "nil"]) 

 parseSymbol :: Parser Symbol
 parseSymbol = 
    parseNumber 
    <|> parseString
    <|> parseNil
    <|> Prim <$> parsePrim
    <|> Builtin <$> parseBuiltin
    <|> parseAtom
    <|> parseBool
    <|> parseQuoted
    <|> do char '('
           x <- stripSpc (parseList <|> parsePair) 
           char ')'
           pure x

 parseExpr :: String -> Either ParseError Symbol
 parseExpr = runParser parseSymbol () "lisp" 

 replLoop :: Environment -> InputT IO ()
 replLoop env = handleInterrupt (outputStr "" >> replLoop env) $ do
    liftIO (hFlush stdout)
    input <-  getInputLine "λ > "
    let exitNicely = outputStrLn "See you later! (✿ - ‿ ◦)\n" >> pure ()
        handleErr :: E.SomeException -> IO (Environment, Symbol)
        handleErr err = print (show err) >> pure (env, Nil)
    case input of
        Nothing -> exitNicely
        Just ":quit" -> exitNicely
        Just expr -> do
            case parseExpr expr of
                Left err -> outputStrLn (show err) >> replLoop env
                Right p -> do
                  (env', res) <- liftIO (E.catch (pure $ eval env p) handleErr)
                  outputStrLn (show res) >> replLoop env'

 main :: IO ()
 main = runInputT defaultSettings $ withInterrupt (replLoop mempty)
 \end{code}