Skip to content

Instantly share code, notes, and snippets.

@dreamsmasher
Last active July 12, 2022 21:35
Show Gist options
  • Save dreamsmasher/7f3cbe474ca823e99eb2a913a32c6e4e to your computer and use it in GitHub Desktop.
Save dreamsmasher/7f3cbe474ca823e99eb2a913a32c6e4e to your computer and use it in GitHub Desktop.
Conjuring the spirits of the computer with our spells (HRR50 Senior Presentation)
---
title: Conjuring the sprits of the computer with our spells
author: Naomi Liu
patat:
wrap: true
margins:
left: 10
right: 10
incrementalLists: true
---
<!--
Copyright 2021 Naomi Liu.
This is a literate Haskell file, formatted for presentation in `patat`. If you want to run it locally, you'll need a few things:
GHC > 8.10.2
haskeline
containers
transformers
parsec
-->
```python
\=========\\===========\
\888888888\\ssssssssss\\
\888888888\\ssssssssss\\
\888888888\\sssssssssss\
\888888888\\sssssssssss\
\888888888\\sssssssssss\ ===================
\888888888\\sssssssssss\ #>=>>=>>=>>=>>=>>=
Conjuring the spirits \888888888\\sssssssssss\ ================
of the computer \888888888\\sssssssssss\
with our spells: /888888888//ssssssssssss\
how programming /888888888//ssssssssssssss\ ==============
languages are /888888888//ssssssssssssssss\ #=>>=>>=>>=>>
implemented /888888888//ssssssss/ \sssssss\ ============
/888888888//ssssssss/ \sssssss\
(@^◡^) /888888888//ssssssss/ \sssssss\
Naomi Liu /888888888//ssssssss/ \sssssss\
nliu.net /888888888//ssssssss/ \sssssss\
/=========/ ========/ \========
```
---
Core ideas:
----------
. . .
- Evaluating code is a combination of:
- parsing input
- transforming that code into an AST
- walking the tree, reducing nodes by evaluating expressions
- maintaining an environment mapping keys to values, for variable lookup
---
Why implement a new language?
------
- you're fed up with current ecosystems
- you want to experiment with new ideas
- for the learning experience (brain gains)
- it's fun, like **really really** fun.
---
\begin{code}
{-# LANGUAGE LambdaCase, ImportQualifiedPost #-}
import Text.Printf
import Debug.Trace
import Data.Map (Map)
import Data.Map qualified as M
import Data.Maybe
import Data.Char
import Data.List
import Text.ParserCombinators.Parsec as P
import Text.Parsec.Combinator
import Control.Monad
import Control.Arrow
import Control.Applicative hiding ((<|>))
import Control.Monad.Trans.Reader
import System.Console.Haskeline
import System.IO
import Control.Monad.IO.Class
import Control.Exception qualified as E
\end{code}
don't worry about these ;)
---
The simplest language to evaluate: **Hutton's razor**
At a high level, programs are just a tree of expressions:
\begin{code}
data Expr = Constant Int | AddTwo Expr Expr
-- a value of type Expr can be a constant integer
-- or two Exprs wrapped in a constructor
-- e.g. 1 + 2 can be expresed as Add (Constant 1) (Constant 2)
evalHutton :: Expr -> Int
evalHutton (Constant n) = n
evalHutton (AddTwo a b) = evalHutton a + evalHutton b
\end{code}
we can't directly add two `Expr`s, but we can unwrap them!
---
Visualizing this tree:
```haskell
[ + ] --- Add ---
1 + 2 -> / \ -> / \
1 2 (Constant 1) (Constant 2)
evalHutton (Constant n) = n
evalHutton (AddTwo a b) = evalHutton a + evalHutton b
```
. . .
If we encounter a constant value, we can just return the value.
. . .
If we encounter an operator, we walk down the tree, evaluating both arguments before adding them together.
. . .
Evaluation is really just a postorder traversal down an n-ary tree.
---
Moving on up
-----
To evaluate a node, you need to evaluate its children first
`f(1, 2) + 3` doesn't make sense unless `f(1, 2)` becomes a number
---
Core steps
-----
- Turning human-readable code into something that's more easily transformed
. . .
- Tagging expressions with their types
. . .
- Rearranging these tagged expressions into a tree
. . .
- Walk the tree, looking at the type of a current node and acting accordingly
---
There are two main strategies for turning your human-readable source code into instructions for the computer:
. . .
- compiling the code into a more-direct representation
. . .
- using a program that takes in source code as input and evaluates it for you.
---
The main steps to this process:
----
- Lexing/Tokenizing
. . .
- Parsing into IR (intermediate representation)
. . .
- Compilers and interpreters diverge here:
| Compilers | Interpreters |
|-----------|--------------|
| Static analysis | Direct evaluation|
| Compile-time Optimization*\** | Runtime optimization (e.g. analyzing hot loops) |
| Assembly generation | Bytecode generation (optional) |
*\**(e.g. dead code elimination, strength reduction)
. . .
The bytecode generation step can come before, during, or after runtime optimization.
- (e.g. Java, LuaJIT, Pypy)
---
How do I do the do???
-----
write a parser first
---------
- define a grammar for your language, e.g.:
- expressions always come between parentheses
- `(+ 1 2 3)`
. . .
- lists: comma separated values between square brackets
- `[1, 2, 3]`
. . .
- strings: any body of text between two double quotes
- "UwU"
. . .
- comments: anything between either:
- `//` and `\n`
- `/*` and `*/`
```javascript
const s = 'foo'; // unnecessary
/* i think lodash exports this
import is-even from 'is-even';
*/
```
---
Some example parsing code:
Scan for a character - if it matches a reserved operator, return the corresponding tag
\begin{code}
parsePrim :: Parser PrimOp
parsePrim = try $ do
op <- oneOf "+-*/="
pure $ case op of
'+' -> Add
'-' -> Sub
'*' -> Mul
'/' -> Div
'=' -> EqP
parseAtom :: Parser Symbol
parseAtom = try $ do
first <- try letter <|> symbols
rest <- P.many (try letter <|> try digit <|> try symbols)
pure $ Atom (first : rest)
-- <|> is a choice between parsers
-- the start of an Atom is parsed differently
-- we keep reading until we hit a character that isn't a letter, digit, or symbol
-- at the end: return the first char + rest of string, tagged as an atomic identifier
\end{code}
---
The rough symbolic structure of your language might look like
-----
\begin{code}
data BuiltinOp = Cons Symbol Symbol -- attach an element onto a list
| Car Symbol -- get the head of a list
| Cdr Symbol -- get the tail of a list
| Define Symbol Symbol -- bind a list to a name
| If Symbol Symbol Symbol
deriving (Eq)
data PrimOp = Add | Sub | Mul | Div | EqP -- Mod | Exp | Eq | Gt | Lt ...
deriving (Eq)
\end{code}
---
Next, you would assign rough types to your expressions:
----
\begin{code}
data Symbol = Atom String -- an identifier
| Function [Symbol] Symbol -- func (...args) (body)
| Builtin BuiltinOp -- evaluated differently
| Prim PrimOp -- likewise
| String [Char]
| Boolean Bool
| Number Int
| List [Symbol] -- your function body
| Quoted Symbol -- a literal list
| Pair Symbol Symbol -- (cons 1 2)
| Nil
-- pairs are special because they form the basis for lists
deriving (Eq)
\end{code}
---
Your semantics should be **clear**
. . .
`(func arg1 arg2)`
should be parsed the same as
. . .
`( func arg1 arg2 )`
---
**Optionally:** write a tokenizer first to make parsing easier
----
- condense multi-character sequences down into atomic symbols
- e.g. `//` could expressed as a single symbol COMMENT_START
---
```scheme
(func arg1 arg2)
```
could be something like
```haskell
[ _LeftParen
, _SymbolName "func"
, _SymbolName "arg1"
, _SymbolName "arg2"
, _RightParen
]
```
Don't have to worry about whitespace, contextual character sequences, etc. during the actual parsing phase
---
Any syntactic dividers like parentheses, brackets, etc.
only exist when you're writing the code
They exist to give hints about the structure of the program.
. . .
Parsed, the expression might look like:
. . .
```haskell
-- (func arg1 arg2)
List
[ Atom "func"
, Atom "arg1"
, Atom "arg2"
]
-- recall our types:
-- List [ Symbol ]
-- Atom String
```
---
Some languages are more amenable to parsing than others.
---------
```scheme
;; lisp
(define (fac n)
(if (= n 0) 1
(* n (fac (- n 1)))))
```
Any expression within parentheses has the form `(operator, ...args)`.
The more parentheses an expression has, the more deeply-nested it is.
---
```javascript
// javascript
const fac = (n) => {
if (n === 0) {
return 1;
}
return n * fac(n - 1);
};
// bonus:
// facOneLiner = n => n ? n * fac(n - 1) : 1;
```
Curly braces denote code blocks or objects.
`if` is always followed by `(conditional)`, then a code block/expression.
---
```python
# python
def fac(n):
# the scope of a code block is dependent
# on its indentation relative to its surroundings
if n == 0: # if n == 0:
return 1 # return 1 # syntax error
return n * fac(n - 1) # requires context to be known when parsing
```
---
<!--
ask: If you had to write a parser for one of these languages RIGHT NOW, which one would you pick ?
-->
Lisp is probably the easiest to parse.
- it has the fewest edge-cases
- context-free grammar
Converted into an AST, `fac` might look like:
---
``` haskell
Define -- (define
List [
Atom "fac", -- (fac n)
Atom "n"
]
List [
Atom "if", -- (if
List [
Prim Eq, -- (= n 0)
Atom "n",
Number 0
],
```
---
```haskell
Number 1, -- 1
List [ -- `else` is implicit
Prim Mul, -- (* n
Atom "n",
List [
Atom "fac", -- (fac
List [
Prim Sub, -- (- n 1)
Atom "n",
Number 1
] --))))
]
]
]
```
---
Very, VERY rarely do you end up with parse-trees that are human readable.
. . .
Lisp restricts every function call to be in prefix position:
```scheme
(defvar arg1 arg2 ...)
```
which makes things a LOT easier for the programmer.
. . .
They also lend the language some pretty magical semantics.
---
Now what??
---------
We need to call the function.
``` scheme
(fac 3)
```
When the function is called with arguments,
we can maintain a map
\begin{code}
type Environment = Map String Symbol
\end{code}
from variables to names to values as we traverse the AST.
These names can refer to functions, lists, or any representable variable.
---
Let's expand on Hutton's Razor a bit:
---
\begin{code}
evalPrim :: Environment -> PrimOp -> [Symbol] -> (Environment, Symbol)
evalPrim env EqP args = case map (snd . eval env) args of
[] -> (env, Boolean True) -- special case
(x:xs) -> (env, Boolean $ all (== x) xs)
evalPrim env sym args =
let reducingFunc = case sym of
Add -> (+) -- we interpret the representation of
Sub -> (-) -- primitive operations as actual functions
Mul -> (*)
Div -> div
unwrapAndApply :: (Int -> Int -> Int) -> Symbol -> Symbol -> Symbol
unwrapAndApply f (Number a) (Number b) = Number (f a b)
-- we can't directly add two Numbers, we have to unwrap them
-- then wrap them back up in a constructor
\end{code}
---
\begin{code}
-- args :: [Symbol]
in case args of
[] -> (env, Number 0)
-- we need at least one value to apply these functions to
(head:tail) ->
let (newEnv, evaluated)
= threadEnv env (head : tail)
-- reduce every child AST node into a form we can work with
in (env, foldr1 (unwrapAndApply reducingFunc) evaluated)
-- foldr encapsulates recursing over a list
-- equivalent to js .reduce() (but lazy!)
\end{code}
---
Where it gets really interesting: threading an environment
----
If we have a list of expressions at the same level, chances are
that we want a local scope to reflect the semantics of our code.
. . .
If you write
``` javascript
let a = 1;
console.log(a);
```
. . .
you would hope that `a` contains the value you just assigned to it.
As we walk down the tree, we need to make sure that nodes at the **same level**
are able to share an environment!
\begin{code}
threadEnv :: Environment -> [Symbol] -> (Environment, [Symbol])
threadEnv env list = mapAccumL
(\env' node -> eval env' node)
env
list
-- mapAccumL is like map, except it passes an accumulating state
-- between invocations
-- this allows us to gradually change an environment as we evaluate each
-- AST node in turn
-- and finally return (new environment, list of results)
\end{code}
---
Let's write some boilerplate for `BuiltinOps`:
\begin{code}
evalBuiltin :: Environment -> BuiltinOp -> (Environment, Symbol)
evalBuiltin env = \case
Cons a (Quoted (List xs)) -> (env, Quoted $ List (a : xs)) -- (1, (2, (3, ...))) is a linked list
Cons a b -> (env, Pair a b) -- forms a pair otherwise (R6RS Scheme standard)
Car (Pair a _) -> (env, a)
Car (List (x:xs)) -> (env, x) -- since we have two ways of expressing lists
Cdr (Pair _ b) -> (env, b)
Cdr (List (_:xs)) -> (env, List xs)
If p x y ->
let (Boolean b) = snd $ eval env p in eval env (if b then x else y)
Define (Atom key) val -> let (_, val') = eval env val
in (M.insert key val' env, Atom key)
-- store the arguments and body in our map, stored with the key
Define (List (Atom key : args)) body ->
let funcSym = Function args body
in (M.insert key funcSym env, Atom key)
-- save a record of this definition in the environment
-- we'll let any other combinations fall through and throw an error
\end{code}
---
The core of our interpreter: `eval`
-----
<!--
say something interesting here
-->
\begin{code}
eval :: Environment -> Symbol -> (Environment, Symbol)
eval env sym = case sym of
Number n -> (env, sym)
String s -> (env, sym)
-- we don't need to really compute atomic values,
-- since there's no further work to do
-- so we'll use a wildcard to match these basic types
-- assume that all inputs are valid for now,
-- error handling is outside our scope
\end{code}
---
Cases we've already covered:
----
\begin{code}
Atom a -> (env, fromJust $ M.lookup a env)
-- return the stored value, which is a Symbol
Builtin b -> evalBuiltin env b
List xs -> case xs of
-- pattern match on the list
(Prim p: args) -> evalPrim env p args
\end{code}
---
Recursively evaluating function calls:
----
- ignore the environment returned from the child
- this is how we get scoping!
\begin{code}
(Atom a: args) ->
let
(_, Function argNames body) = eval env (Atom a)
-- any symbol at the beginning of an expression list
-- HAS to be a function
\end{code}
---
Creating a new environment with bound arguments
- we need to evaluate children before
\begin{code}
evaledArgs = map ((\(_, val) -> val) . eval env) args
-- we don't care about those returned environments
unwrappedNames = map (\(Atom a) -> a) argNames
-- we need to unwrap these constructors to get their
-- underlying strings
\end{code}
---
\begin{code}
boundNames = zip unwrappedNames evaledArgs
-- [(variable name, evaluated value)]
boundEnv = foldr
(\(name, val) -> M.insert name val)
env
boundNames
-- {variable name: value}
-- create a new Map, combining these pairs
-- with our current env
\end{code}
---
finally:
----
evaluate the body with arguments bound to their names
\begin{code}
in eval boundEnv body
\end{code}
. . .
\begin{code}
[singleton] -> eval env singleton
_ -> (env, Nil)
_ -> (env, sym)
-- catchall for other simple values
\end{code}
---
Now you have a programming language, a spicy job offer, and a pretty nice toy to play around with.
Are we done? Not yet.
---
Optimizations
---------
- High-level languages offer a faster development/debugging cycle
. . .
- at the cost of performance
. . .
If you want your language to be used in production, it's time to drop down a few levels.
---
Why are low level languages more efficient?
- Finer grained control over details
. . .
- memory management
. . .
- closer to the bare hardware
. . .
- fewer things to take care of (e.g. closures, objects, etc)
. . .
- simulating a computer is already intensive, so it's ideal to try and minimize that overhead as much as possible
. . .
- the higher you go, the more work has to be done to run your code
. . .
- imagine if JS was written in JS?
---
Writing a runtime
---------
Higher level languages abstract away the details from you
Naturally, going lower means that you need to care take of those details yourself
. . .
Considerations
----------------
Allocating space for closures/objects
. . .
- how do keep track of where values are stored?
- when they aren't used anymore?
- freeing memory that's unused?
- in other words: **collecting garbage**
---
Making your language usable
. . .
- You didn't write `express.js`, and you definitely didn't write `http`.
. . .
- You'll need to, if you want people to use your language
. . .
- Library quality will make or break a new programming language
. . .
- Math, networking, sorting, data structures, etc. are core pieces of functionality you'll need to implement
. . .
- It also serves as a non-trivial test for your new language, serving as a proof of your project's functionality
. . .
- Why build a car without driving it?
---
Summary
- Writing a new language is a non-trivial undertaking
. . .
- You'll gain a greater appreciation and understanding of the work that has gone into our tooling over the years
. . .
- If you're lucky, you might gain some popularity
. . .
- If you're *REALLY* lucky, you'll become the next `PHP`.
---
Thank you!
---
^ Presentation ends there
This is the rest of the parsing/helper code - I wrote this entire document in a night, so there are a few inconsistencies and bugs. Specifically, it will crash on invalid inputs instead of returning a `Maybe` or `Either` type. `Cons`, `Car`, etc. should probably just be a regular enum type instead of the sum type that they are.
There's also no support for numeric comparisons (Lt, Gt, etc) because it would have added extra complexity outside the scope of the presentation. They would be trivial to add in, though.
There's no support for macros or lambdas, although lambdas could just be implemented by returning a `Function` symbol directly without modifying the environment.
As for the code, 90% of it I wrote last night. I took the `Parsec` functions from my old Scheme [interpreter](https://github.com/dreamsmasher/meme), but the evaluation logic is new, and a lot cleaner than my old implementation. On my first try, I was using IORefs under the hood to keep track of environments, but I think passing an `Environment` around is a lot more elegant. I avoided using the `State` monad for the sake of not having to explain monads on top of everything else. As you can see from the type signatures, this is literally just the `State` monad in long/tedious form.
\begin{code}
instance Show BuiltinOp where
show = \case
Cons a b -> printf "cons %s %s" (show a) (show b)
Car a -> printf "car %s" (show a)
Cdr a -> printf "cdr %s" (show a)
Define a b -> printf "[define] %s -> %s" (show a) (show b)
If p a b -> printf "%s ? %s : %s" (show p) (show a) (show b)
instance Show PrimOp where
show = \case
Add -> "+"
Sub -> "-"
Mul -> "*"
Div -> "/"
EqP -> "="
instance Show Symbol where
show = \case
Atom s -> s
Prim p -> show p
String s -> show s
Boolean b -> show b
Number n -> show n
Builtin b -> show b
List xs -> printf "(%s)" (unwords $ map show xs)
Quoted xs -> show xs
Pair a b -> printf "(%s . %s)" (show a) (show b)
Function args body -> printf "[function] (%s) %s" (unwords $ map show args) (show body)
Nil -> "nil"
parseString :: Parser Symbol
parseString = char '"' >> String <$> P.many (satisfy (/= '"')) <* char '"'
\end{code}
\begin{code}
--"
-- Parser code for handling the remaining data types
parseNumber :: Parser Symbol
parseNumber = try $ do
sign <- maybe 1 (const (-1)) <$> optionMaybe (char '-')
n <- some digit
pure $ Number (sign * read n)
stripSpc :: Parser a -> Parser a
stripSpc p = spaces *> p <* spaces
parseList :: Parser Symbol
parseList = try $ List <$> stripSpc (sepEndBy parseSymbol spaces)
symbols :: Parser Char
symbols = oneOf "!$%&|-*+=/:<=>?@^_~"
parseNil :: Parser Symbol
parseNil = try $ Nil <$ string "nil"
parsePair :: Parser Symbol
parsePair = try $ do
h <- endBy parseSymbol spaces
t <- char '.' >> spaces >> parseSymbol
pure $ Pair (List h) t
parseQuoted :: Parser Symbol
parseQuoted = try $ do
try $ char '\''
Quoted <$> parseSymbol
\end{code}
\begin{code}
parseBuiltin :: Parser BuiltinOp
parseBuiltin = try $ do
let ops = ["cons", "car", "cdr", "define", "if"]
let arities = zip ops [2, 1, 1, 2, 3]
name <- choice (map string ops)
spaces
let arity = fromMaybe 0 $ lookup name arities
case arity of -- super dirty, I know
3 -> do
[p, a, b] <- forM [0..2] (const (parseSymbol <* spaces))
pure $ If p a b
2 -> do
let func = if name == "cons" then Cons else Define
l <- parseSymbol <* spaces
func l <$> parseSymbol
1 -> do
let func = if name == "car" then Car else Cdr
func <$> parseSymbol
parseBool :: Parser Symbol
parseBool = try $
Boolean . (== "#t") <$> choice (string <$> ["#t", "nil"])
parseSymbol :: Parser Symbol
parseSymbol =
parseNumber
<|> parseString
<|> parseNil
<|> Prim <$> parsePrim
<|> Builtin <$> parseBuiltin
<|> parseAtom
<|> parseBool
<|> parseQuoted
<|> do char '('
x <- stripSpc (parseList <|> parsePair)
char ')'
pure x
parseExpr :: String -> Either ParseError Symbol
parseExpr = runParser parseSymbol () "lisp"
replLoop :: Environment -> InputT IO ()
replLoop env = handleInterrupt (outputStr "" >> replLoop env) $ do
liftIO (hFlush stdout)
input <- getInputLine "λ > "
let exitNicely = outputStrLn "See you later! (✿ - ‿ ◦)\n" >> pure ()
handleErr :: E.SomeException -> IO (Environment, Symbol)
handleErr err = print (show err) >> pure (env, Nil)
case input of
Nothing -> exitNicely
Just ":quit" -> exitNicely
Just expr -> do
case parseExpr expr of
Left err -> outputStrLn (show err) >> replLoop env
Right p -> do
(env', res) <- liftIO (E.catch (pure $ eval env p) handleErr)
outputStrLn (show res) >> replLoop env'
main :: IO ()
main = runInputT defaultSettings $ withInterrupt (replLoop mempty)
\end{code}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment