Last active
November 15, 2024 07:10
-
-
Save alok/16bb1114f677b7c916e286f7f2ce58e0 to your computer and use it in GitHub Desktop.
A parser for Gwern's GTX format in Lean 4, to play with the syntax
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import Lean.Parser | |
import Std.Internal.Parsec | |
import Std.Data.HashMap | |
open Lean.Parser | |
open Std.Internal.Parsec | |
/-! | |
{- GTX: custom document format for Gwern.net, for lightweight writing of annotations. | |
Author: Gwern Branwen | |
Date: 2024-02-28 | |
When: Time-stamp: "2024-09-06 19:06:44 gwern" | |
License: CC-0 | |
A 'GTX' (short for 'Gwern text' until I come up with a better name) text file is a UTF-8 text file | |
with the extension '.gtx'. It is designed for writing Gwern.net 'annotations': document excerpts/summaries | |
used for the popups. (It replaces earlier solutions using Haskell Read/Show tuple formatted data files & YAML.) | |
It avoids named fields, quoting/escaping, or error-prone indentation/nesting/hierarchy, to make it as easy to hand-write entries as possible, | |
and allow easily appending new entries. | |
A GTX is a newline-delimited format of records delimited by '---\n' separators. A record contains the following lines: | |
0. initial '---\n' separator (mandatory) | |
1. a naked URL (valid URI, text); this is the only mandatory field, all others can be left blank. | |
2. a title (UTF-8 HTML string) | |
3. comma-separated list of authors (UTF-8 HTML string) | |
4. date (YYYY[-MM[-DD]] digit-hyphen ASCII) | |
- OPTIONAL: a GTX implementation must guarantee that date values are either valid ISO 8601 dates or the empty string; it may however attempt to heuristically parse invalid dates into valid ones, for user convenience (eg. accepting 'September 1st, 1981' and rewriting it to '1981-09-01'). | |
If a GTX implementation does this, and a pseudo-date field cannot be successfully parsed, it should error out rather than return an empty string, as that indicates a serious error in the GTX metadata which needs to be fixed by hand. | |
5. a 'naked DOI' or a Haskell key-value dictionary | |
A line for key-value dictionaries (association-lists), for storing miscellaneous information. The most common case is a DOI global identifier. This line must parse by GHC Haskell as a `[(String, String)]` list. It is written out as a blank line for empty lists `[]`, and as a sorted, unique, key-value association list otherwise. | |
A naked DOI must be: UTF-8 text, no white-space, must contain one '/' forward slash. It must not start with a LEFT SQUARE BRACKET character (which may be technically allowed by the DOI standard), or it will be misparsed as a K-V. Naked DOIs can be read, but will be written out as a key-value list. The purpose of this is to allow convenient writing, without having to generate all of the wrapper like `[("doi", "...")]`; it is then converted to the canonical list at some point later to enable easier editing, like inserting an additional entry. | |
6. tags (space-separated alpha-numerical strings, which *must* correspond to on-disk directories in `doc/*`); parses into a list of strings. (Tags are semi-mandatory on Gwern.net: ideally every URL would have at least one tag.) | |
7. an 'abstract': an 'abstract' is all HTML (TODO: permit Markdown as well, as defined by whether the abstract begins with '<' or not) text after the tags and until the next '---\n' separator; it, and no other field, may contain arbitrarily many lines. | |
An example GTX file with 3 entries of increasing completeness: | |
``` | |
--- | |
https://jackcook.com/2024/02/23/mamba.html | |
ai/nn/rnn | |
--- | |
http://www.demarcken.org/carl/papers/ITA-software-travel-complexity/ITA-software-travel-complexity.pdf | |
Computational Complexity of Air Travel Planning | |
Carl de Marcken | |
2003 | |
cs/computable | |
--- | |
https://arxiv.org/abs/1003.0358#schmidhuber | |
Deep Big Simple Neural Nets Excel on Handwritten Digit Recognition | |
Dan Claudiu Ciresan, Ueli Meier, Luca Maria Gambardella, Juergen Schmidhuber | |
1 March 2010 | |
10.1162/NECO_a_00052 | |
ai/nn/fully-connected ai/scaling/hardware | |
<p>Good old on-line back-propagation for plain multi-layer perceptrons yields | |
a very low 0.35% error rate on the famous <a href="https://en.wikipedia.org/wiki/MNIST_database">MNIST</a> | |
handwritten digits benchmark.</p> <p>All we need to achieve this best result so | |
far are many hidden layers, many neurons per layer, numerous deformed training | |
images, and [Nvidia] graphics cards to greatly speed up learning.</p | |
``` | |
Note that this format is intended to be extremely inflexible and tailored to the exact use case of writing annotations for Gwern.net, which avoids any need to care about newlines, quote marks, colons, indentation depth lining up, complex parsing etc, that were constant papercuts in writing annotations in more powerful formats like YAML. | |
Problems with YAML: indentation was easy to get wrong; the YAML writer alternates seemingly at random between single & double-quotes, so editing old annotations (eg. to add a link) would unpredictably break them; the *lack* of quotes also caused the same problem, when editing added a colon character & broke it; the YAML writer forcibly wraps lines at a rather short line length, which breaks many searches (which would work if it only line-broke at a more logical place like the end of a block element like `</p>`); dates could be frustrating to write because while `2000-01-01` would parse as a string, `2000` would *not* and had to be quoted as `"2000"` to ensure it wasn’t turned into an integer; it was easy to omit a field (because if I had required them to be labeled as key-value pairs, that would have meant a lot more typing/repetition); it was impossible to think about having Markdown entries (which would be ideal for writing convenience) because it would have to be wrapped in quotes and whitespace escaped and rendered un-editable natively since they are then hard to read & you might indentation... | |
The previous format, Haskell, lacked many of YAML’s "helpful" shortcuts and was highly regular; but unfortunately, it still requires whitespace & quotes to be escaped, was slow to parse/write, and the usual encoding, as a list, meant that it could not be appended to (because one would have to move the closing-bracket). | |
One alternative I didn’t explore too thoroughly was the idea of writing each annotation as a Markdown section; this might have worked but would have required somewhat unnatural formatting like requiring newlines between each field (so they could be unambiguously parsed as separate `Para` AST nodes) or ordered/unordered lists. This would have worked poorly with the increasingly-complicated HTML inside many annotations. | |
-} | |
-/ | |
/-- | |
Defines a list/array comprehension syntax similar to Haskell/Python. | |
Supports filtering with optional predicates. | |
Examples: | |
[x * 2 | for x in xs] -- Map | |
[x | for x in xs if x > 0] -- Filter and map | |
[f x y | for x in xs, for y in ys] -- Nested comprehension | |
-/ | |
/-- | |
List comprehension syntax. | |
Examples: | |
#[x * 2 | for x in xs] -- Map | |
#[x | for x in xs if x > 0] -- Filter and map | |
#[f x y | for x in xs, for y in ys] -- Nested comprehension | |
-/ | |
syntax "[" term "|" "for" ident "in" term ("," "for" ident "in" term)* ("if" term)? "]" : term | |
def _root_.List.size (xs : List α) : Nat := xs.length | |
macro_rules | |
| `([$t | for $x in $xs if $pred]) => `(Id.run do | |
let mut out := Array.mkEmpty ($xs).size | |
for $x in $xs do | |
if $pred then | |
out := out.push ($t) | |
out) | |
| `([$t | for $x in $xs]) => `(Id.run do | |
let mut out := Array.mkEmpty ($xs).size | |
for $x in $xs do | |
out := out.push ($t) | |
out) | |
| `([$t | for $x in $xs, for $y in $ys if $pred]) => | |
`(Id.run do | |
let mut out := Array.mkEmpty ($xs).size | |
for $x in $xs do | |
for $y in $ys do | |
if $pred then | |
out := out.push ($t) | |
out) | |
| `([$t | for $x in $xs, for $y in $ys]) => | |
`(Id.run do | |
let mut out := Array.mkEmpty ($xs).size | |
for $x in $xs do | |
for $y in $ys do | |
out := out.push ($t) | |
out) | |
/-- | |
Array variant of list comprehension syntax. | |
Supports filtering with optional predicates. | |
Examples: | |
#[x * 2 | for x in xs] -- Map | |
#[x | for x in xs if x > 0] -- Filter and map | |
#[f x y | for x in xs, for y in ys] -- Nested comprehension | |
-/ | |
syntax "#[" term "|" "for" ident "in" term ("," "for" ident "in" term)* ("if" term)? "]" : term | |
macro_rules | |
| `(#[$t | for $x in $xs if $pred]) => `(Id.run do | |
let mut out := Array.mkEmpty ($xs).size | |
for $x in $xs do | |
if $pred then | |
out := out.push $t | |
out) | |
| `(#[$t | for $x in $xs]) => `(Id.run do | |
let mut out := Array.mkEmpty ($xs).size | |
for $x in $xs do | |
out := out.push $t | |
out) | |
| `(#[$t | for $x in $xs, for $y in $ys if $pred]) => | |
`(Id.run do | |
let mut out := Array.mkEmpty ($xs).size | |
for $x in $xs do | |
for $y in $ys do | |
if $pred then | |
out := out.push $t | |
out) | |
| `(#[$t | for $x in $xs, for $y in $ys]) => | |
`(Id.run do | |
let mut out := Array.mkEmpty ($xs).size | |
for $x in $xs do | |
for $y in $ys do | |
out := out.push $t | |
out) | |
#eval #[x * 2 | for x in #[1, 2, 3]] | |
/-- Hashmap literal syntax -/ | |
syntax "hm![" term,* "]" : term | |
/-- | |
Hashmap literal syntax. | |
Example: `hm!["a" -> 1, "b" -> 2]["b"]` | |
-/ | |
macro_rules | |
| `(hm![]) => `(Std.HashMap.empty) | |
| `(hm![ $k -> $v]) => `((Std.HashMap.singleton $k $v) | |
| `(hm![$k -> $v, $kvs,*]) => `((hm![$kvs,*]).insert $k $v) | |
#eval hm!["a" -> 1, "b" -> 2]["b"] | |
#eval hm!["a" -> 1]["a"] | |
#eval (hm![] : Std.HashMap String Nat) | |
structure GTXEntry where | |
url: String | |
title: String | |
authors: Array String | |
date: String | |
metadata: String -- Either DOI or key-value list | |
tags: Array String | |
abstract: String | |
deriving Repr | |
instance : ToString GTXEntry where | |
toString e := | |
let url := if e.url.isEmpty then "" else s!"url: {e.url}\n" | |
let title := if e.title.isEmpty then "" else s!"title: {e.title}\n" | |
let authors := if e.authors.isEmpty then "" else s!"authors: {e.authors}\n" | |
let date := if e.date.isEmpty then "" else s!"date: {e.date}\n" | |
let metadata := if e.metadata.isEmpty then "" else s!"metadata: {e.metadata}\n" | |
let tags := if e.tags.isEmpty then "" else s!"tags: {e.tags}\n" | |
let abstract := if e.abstract.isEmpty then "" else s!"abstract: {e.abstract}\n" | |
s!"GTXEntry(\n{url}{title}{authors}{date}{metadata}{tags}{abstract})" | |
def parseEntry (input: String) : Option GTXEntry := Id.run do | |
-- Split into lines | |
let lines := input.splitOn "\n" |>.toArray | |
-- Must start with separator | |
if lines[0]? ≠ some "---" then | |
return none | |
-- Parse required URL (line 1) | |
let url := lines[1]?.getD "" | |
if url.isEmpty then return none | |
-- Parse optional fields | |
let title := lines[2]?.getD "" | |
let authors := #[x.trim | for x in (lines[3]?.getD "").splitOn ", "] | |
let date := lines[4]?.getD "" | |
let metadata := lines[5]?.getD "" | |
-- Parse tags (space separated) | |
let tags := #[x | for x in (lines[6]?.getD "").splitOn " "] | |
-- Parse abstract (remaining lines until next separator) | |
let mut abstract := "" | |
let mut i := 7 | |
while i < lines.size && lines[i]?.getD "" ≠ "---" do | |
if let some line := lines[i]? then | |
abstract := abstract ++ line ++ "\n" | |
i := i + 1 | |
return some { | |
url, | |
title, | |
authors, | |
date, | |
metadata, | |
tags, | |
abstract | |
} | |
/-- Parse a GTX file into entries -/ | |
def parseGTX (input : String) : Array GTXEntry := Id.run do | |
-- Split on "---" separator | |
let blocks := input.splitOn "---" | |
let mut entries := #[] | |
for block in blocks do | |
if let some entry ← parseEntry ("---" ++ block) then | |
entries := entries.push entry | |
return entries | |
/-- Test if a string is valid GTX -/ | |
def testGTX (input : String) : Bool := Id.run do | |
let entries := parseGTX input | |
dbg_trace s!"Parsed entries: {entries}" | |
return entries.size > 0 | |
/-- Example test cases matching the new format -/ | |
def gtxTests : Array String := #[ | |
"--- | |
https://example.com | |
Title | |
Author1, Author2 | |
2024-03-06 | |
10.1234/example | |
tag1 tag2 | |
This is the abstract | |
---", | |
"--- | |
https://minimal.com | |
ai/nn | |
Simple entry with just URL and tag | |
---" | |
] | |
#eval #[testGTX x | for x in gtxTests] | |
def main : IO Unit := | |
IO.println s!"Hello, {hello}!" |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment