Skip to content

Instantly share code, notes, and snippets.

@alok
Last active November 15, 2024 07:10
Show Gist options
  • Save alok/16bb1114f677b7c916e286f7f2ce58e0 to your computer and use it in GitHub Desktop.
Save alok/16bb1114f677b7c916e286f7f2ce58e0 to your computer and use it in GitHub Desktop.
A parser for Gwern's GTX format in Lean 4, to play with the syntax
import Lean.Parser
import Std.Internal.Parsec
import Std.Data.HashMap
open Lean.Parser
open Std.Internal.Parsec
/-!
{- GTX: custom document format for Gwern.net, for lightweight writing of annotations.
Author: Gwern Branwen
Date: 2024-02-28
When: Time-stamp: "2024-09-06 19:06:44 gwern"
License: CC-0
A 'GTX' (short for 'Gwern text' until I come up with a better name) text file is a UTF-8 text file
with the extension '.gtx'. It is designed for writing Gwern.net 'annotations': document excerpts/summaries
used for the popups. (It replaces earlier solutions using Haskell Read/Show tuple formatted data files & YAML.)
It avoids named fields, quoting/escaping, or error-prone indentation/nesting/hierarchy, to make it as easy to hand-write entries as possible,
and allow easily appending new entries.
A GTX is a newline-delimited format of records delimited by '---\n' separators. A record contains the following lines:
0. initial '---\n' separator (mandatory)
1. a naked URL (valid URI, text); this is the only mandatory field, all others can be left blank.
2. a title (UTF-8 HTML string)
3. comma-separated list of authors (UTF-8 HTML string)
4. date (YYYY[-MM[-DD]] digit-hyphen ASCII)
- OPTIONAL: a GTX implementation must guarantee that date values are either valid ISO 8601 dates or the empty string; it may however attempt to heuristically parse invalid dates into valid ones, for user convenience (eg. accepting 'September 1st, 1981' and rewriting it to '1981-09-01').
If a GTX implementation does this, and a pseudo-date field cannot be successfully parsed, it should error out rather than return an empty string, as that indicates a serious error in the GTX metadata which needs to be fixed by hand.
5. a 'naked DOI' or a Haskell key-value dictionary
A line for key-value dictionaries (association-lists), for storing miscellaneous information. The most common case is a DOI global identifier. This line must parse by GHC Haskell as a `[(String, String)]` list. It is written out as a blank line for empty lists `[]`, and as a sorted, unique, key-value association list otherwise.
A naked DOI must be: UTF-8 text, no white-space, must contain one '/' forward slash. It must not start with a LEFT SQUARE BRACKET character (which may be technically allowed by the DOI standard), or it will be misparsed as a K-V. Naked DOIs can be read, but will be written out as a key-value list. The purpose of this is to allow convenient writing, without having to generate all of the wrapper like `[("doi", "...")]`; it is then converted to the canonical list at some point later to enable easier editing, like inserting an additional entry.
6. tags (space-separated alpha-numerical strings, which *must* correspond to on-disk directories in `doc/*`); parses into a list of strings. (Tags are semi-mandatory on Gwern.net: ideally every URL would have at least one tag.)
7. an 'abstract': an 'abstract' is all HTML (TODO: permit Markdown as well, as defined by whether the abstract begins with '<' or not) text after the tags and until the next '---\n' separator; it, and no other field, may contain arbitrarily many lines.
An example GTX file with 3 entries of increasing completeness:
```
---
https://jackcook.com/2024/02/23/mamba.html
ai/nn/rnn
---
http://www.demarcken.org/carl/papers/ITA-software-travel-complexity/ITA-software-travel-complexity.pdf
Computational Complexity of Air Travel Planning
Carl de Marcken
2003
cs/computable
---
https://arxiv.org/abs/1003.0358#schmidhuber
Deep Big Simple Neural Nets Excel on Handwritten Digit Recognition
Dan Claudiu Ciresan, Ueli Meier, Luca Maria Gambardella, Juergen Schmidhuber
1 March 2010
10.1162/NECO_a_00052
ai/nn/fully-connected ai/scaling/hardware
<p>Good old on-line back-propagation for plain multi-layer perceptrons yields
a very low 0.35% error rate on the famous <a href="https://en.wikipedia.org/wiki/MNIST_database">MNIST</a>
handwritten digits benchmark.</p> <p>All we need to achieve this best result so
far are many hidden layers, many neurons per layer, numerous deformed training
images, and [Nvidia] graphics cards to greatly speed up learning.</p
```
Note that this format is intended to be extremely inflexible and tailored to the exact use case of writing annotations for Gwern.net, which avoids any need to care about newlines, quote marks, colons, indentation depth lining up, complex parsing etc, that were constant papercuts in writing annotations in more powerful formats like YAML.
Problems with YAML: indentation was easy to get wrong; the YAML writer alternates seemingly at random between single & double-quotes, so editing old annotations (eg. to add a link) would unpredictably break them; the *lack* of quotes also caused the same problem, when editing added a colon character & broke it; the YAML writer forcibly wraps lines at a rather short line length, which breaks many searches (which would work if it only line-broke at a more logical place like the end of a block element like `</p>`); dates could be frustrating to write because while `2000-01-01` would parse as a string, `2000` would *not* and had to be quoted as `"2000"` to ensure it wasn’t turned into an integer; it was easy to omit a field (because if I had required them to be labeled as key-value pairs, that would have meant a lot more typing/repetition); it was impossible to think about having Markdown entries (which would be ideal for writing convenience) because it would have to be wrapped in quotes and whitespace escaped and rendered un-editable natively since they are then hard to read & you might indentation...
The previous format, Haskell, lacked many of YAML’s "helpful" shortcuts and was highly regular; but unfortunately, it still requires whitespace & quotes to be escaped, was slow to parse/write, and the usual encoding, as a list, meant that it could not be appended to (because one would have to move the closing-bracket).
One alternative I didn’t explore too thoroughly was the idea of writing each annotation as a Markdown section; this might have worked but would have required somewhat unnatural formatting like requiring newlines between each field (so they could be unambiguously parsed as separate `Para` AST nodes) or ordered/unordered lists. This would have worked poorly with the increasingly-complicated HTML inside many annotations.
-}
-/
/--
Defines a list/array comprehension syntax similar to Haskell/Python.
Supports filtering with optional predicates.
Examples:
[x * 2 | for x in xs] -- Map
[x | for x in xs if x > 0] -- Filter and map
[f x y | for x in xs, for y in ys] -- Nested comprehension
-/
/--
List comprehension syntax.
Examples:
#[x * 2 | for x in xs] -- Map
#[x | for x in xs if x > 0] -- Filter and map
#[f x y | for x in xs, for y in ys] -- Nested comprehension
-/
syntax "[" term "|" "for" ident "in" term ("," "for" ident "in" term)* ("if" term)? "]" : term
def _root_.List.size (xs : List α) : Nat := xs.length
macro_rules
| `([$t | for $x in $xs if $pred]) => `(Id.run do
let mut out := Array.mkEmpty ($xs).size
for $x in $xs do
if $pred then
out := out.push ($t)
out)
| `([$t | for $x in $xs]) => `(Id.run do
let mut out := Array.mkEmpty ($xs).size
for $x in $xs do
out := out.push ($t)
out)
| `([$t | for $x in $xs, for $y in $ys if $pred]) =>
`(Id.run do
let mut out := Array.mkEmpty ($xs).size
for $x in $xs do
for $y in $ys do
if $pred then
out := out.push ($t)
out)
| `([$t | for $x in $xs, for $y in $ys]) =>
`(Id.run do
let mut out := Array.mkEmpty ($xs).size
for $x in $xs do
for $y in $ys do
out := out.push ($t)
out)
/--
Array variant of list comprehension syntax.
Supports filtering with optional predicates.
Examples:
#[x * 2 | for x in xs] -- Map
#[x | for x in xs if x > 0] -- Filter and map
#[f x y | for x in xs, for y in ys] -- Nested comprehension
-/
syntax "#[" term "|" "for" ident "in" term ("," "for" ident "in" term)* ("if" term)? "]" : term
macro_rules
| `(#[$t | for $x in $xs if $pred]) => `(Id.run do
let mut out := Array.mkEmpty ($xs).size
for $x in $xs do
if $pred then
out := out.push $t
out)
| `(#[$t | for $x in $xs]) => `(Id.run do
let mut out := Array.mkEmpty ($xs).size
for $x in $xs do
out := out.push $t
out)
| `(#[$t | for $x in $xs, for $y in $ys if $pred]) =>
`(Id.run do
let mut out := Array.mkEmpty ($xs).size
for $x in $xs do
for $y in $ys do
if $pred then
out := out.push $t
out)
| `(#[$t | for $x in $xs, for $y in $ys]) =>
`(Id.run do
let mut out := Array.mkEmpty ($xs).size
for $x in $xs do
for $y in $ys do
out := out.push $t
out)
#eval #[x * 2 | for x in #[1, 2, 3]]
/-- Hashmap literal syntax -/
syntax "hm![" term,* "]" : term
/--
Hashmap literal syntax.
Example: `hm!["a" -> 1, "b" -> 2]["b"]`
-/
macro_rules
| `(hm![]) => `(Std.HashMap.empty)
| `(hm![ $k -> $v]) => `((Std.HashMap.singleton $k $v)
| `(hm![$k -> $v, $kvs,*]) => `((hm![$kvs,*]).insert $k $v)
#eval hm!["a" -> 1, "b" -> 2]["b"]
#eval hm!["a" -> 1]["a"]
#eval (hm![] : Std.HashMap String Nat)
structure GTXEntry where
url: String
title: String
authors: Array String
date: String
metadata: String -- Either DOI or key-value list
tags: Array String
abstract: String
deriving Repr
instance : ToString GTXEntry where
toString e :=
let url := if e.url.isEmpty then "" else s!"url: {e.url}\n"
let title := if e.title.isEmpty then "" else s!"title: {e.title}\n"
let authors := if e.authors.isEmpty then "" else s!"authors: {e.authors}\n"
let date := if e.date.isEmpty then "" else s!"date: {e.date}\n"
let metadata := if e.metadata.isEmpty then "" else s!"metadata: {e.metadata}\n"
let tags := if e.tags.isEmpty then "" else s!"tags: {e.tags}\n"
let abstract := if e.abstract.isEmpty then "" else s!"abstract: {e.abstract}\n"
s!"GTXEntry(\n{url}{title}{authors}{date}{metadata}{tags}{abstract})"
def parseEntry (input: String) : Option GTXEntry := Id.run do
-- Split into lines
let lines := input.splitOn "\n" |>.toArray
-- Must start with separator
if lines[0]? ≠ some "---" then
return none
-- Parse required URL (line 1)
let url := lines[1]?.getD ""
if url.isEmpty then return none
-- Parse optional fields
let title := lines[2]?.getD ""
let authors := #[x.trim | for x in (lines[3]?.getD "").splitOn ", "]
let date := lines[4]?.getD ""
let metadata := lines[5]?.getD ""
-- Parse tags (space separated)
let tags := #[x | for x in (lines[6]?.getD "").splitOn " "]
-- Parse abstract (remaining lines until next separator)
let mut abstract := ""
let mut i := 7
while i < lines.size && lines[i]?.getD "" ≠ "---" do
if let some line := lines[i]? then
abstract := abstract ++ line ++ "\n"
i := i + 1
return some {
url,
title,
authors,
date,
metadata,
tags,
abstract
}
/-- Parse a GTX file into entries -/
def parseGTX (input : String) : Array GTXEntry := Id.run do
-- Split on "---" separator
let blocks := input.splitOn "---"
let mut entries := #[]
for block in blocks do
if let some entry ← parseEntry ("---" ++ block) then
entries := entries.push entry
return entries
/-- Test if a string is valid GTX -/
def testGTX (input : String) : Bool := Id.run do
let entries := parseGTX input
dbg_trace s!"Parsed entries: {entries}"
return entries.size > 0
/-- Example test cases matching the new format -/
def gtxTests : Array String := #[
"---
https://example.com
Title
Author1, Author2
2024-03-06
10.1234/example
tag1 tag2
This is the abstract
---",
"---
https://minimal.com
ai/nn
Simple entry with just URL and tag
---"
]
#eval #[testGTX x | for x in gtxTests]
def main : IO Unit :=
IO.println s!"Hello, {hello}!"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment