Skip to content

Instantly share code, notes, and snippets.

@emilypi
Last active April 11, 2020 03:42
Show Gist options
  • Select an option

  • Save emilypi/30590e572d912b338a73e2d23004b39b to your computer and use it in GitHub Desktop.

Select an option

Save emilypi/30590e572d912b338a73e2d23004b39b to your computer and use it in GitHub Desktop.

Yaml (Tweag Fellowship Proposal Annex)


This proposal represents the current understanding of the intent and implementation schedule for a new YAML library called Yaml that is YAML 1.2 spec-compliant, purely Haskell, and released under the BSD-3 clause license. In addition to the core library, we will implement a range of dependent libraries that will implement streaming, optic, and stream-transducer support in a modular way, with a minimal dependency footprint.

Introduction

YAML (YAML Ain't Markup Language) Haskell has two existing YAML 1.2 implementations publicly available through the Hackage ecosystem: yaml, and HsYaml. Both of these packages represents years of work and value for Haskell developers and businesses alike, and have developed into small ecosystems unto themselves in terms of downstream dependencies. Both libraries, unfortunately, also have irreconciliable flaws that warrant the creation of a newer, more modular, and more open set of YAML packages.

Motivation

To be concrete about these flaws:

  • yaml depends upon libyaml, and serves as a wrapper for the C library libyaml.c, while yaml exists as a "once size fits all" library for parsing, streaming, pretty printing, contains two executables for JSON conversion along with library-level JSON support, and Template Haskell bindings. It was built primarily as a driver for Yesod development, and reflects those biases deeply in the code. It's streaming support is considered highly experimental. Its parser does not implement the YAML 1.2 specification fully, and has suprising behavior at the edges. The dependency footprint is also very heavy.

  • HsYaml is a standalone pure haskell implementation of the YAML 1.2 specification, but features nothing more than a library, with little downstream ecosystem support aside from HsYaml-aeson (a JSON support module for conversions between JSON and YAML). It has no streaming or optic support to speak of. It is also released under a GPL-2.0-only license, which makes it a hard sell for business users to adopt or for other Haskell developers to extend with downstream libraries in the United States.

This may seem like a harsh critique, but it is done with respect, through the hindsight of 10 years of existence between both projects. I have even discussed this with Herbert (aka HVR, author of HsYaml) and he gave it his blessing once we confirmed that he was bound by previous engagements to not change the license for HsYaml.

In particular, I make these critiques because this presents an opportunity for Haskell developers to step in and write a new, more modern and modular YAML implementation that addresses these problems and improves the ecosystem for all Haskell developers at the same time. The result will be a production-ready suite of modular, platform-agnostic YAML libraries with a minimal dependency footprint, that can be easily extended and built upon by the rest of the Haskell community.

Project Description

The project will be split into four individual libraries:

  1. Yaml: this will be a pure implementation of the YAML 1.2 specification in Haskell, released under a BSD-3 clause license. Importantly, this library will focus on parsing performance and totality of specification coverage.

  2. Yaml-lens and Yaml-microlens: these libraries will represent coverage for downstream optics libraries built using lens and microlens and the Yaml core library. In the same way that aeson has the celebrated optical interface in lens-aeson, Yaml can have one too!

  3. Yaml-streaming: this library represents YAML streaming support using the streaming family of libraries in conjunction with the Yaml core library. YAML is often used for streaming data events, and streaming is a relatively new addition to the traditional Haskell streaming ecosystem. It represents a step forward in terms of performance and design. Libraries such as pipes and io-streams can be supported later - perhaps by other haskellers as a GSOC or Tweag Fellowship project.

  4. Yaml-machines: this library represents support for a machines-based stream-transducer library for Yaml core. This has been requested by members of the community currently working on consulting projects making use of machines and machines-encoding, and so has concrete business value.

Project Phases

This project will be broken down into 2 phases: implementation of the core Yaml library, and implementation of the peripheral downstream libraries.

In the case of phase one, the translation of the YAML 1.2 spec to Haskell will be the most time-consuming, as the spec is large. However, Haskell has prior art here, and I am in frequent communication with HVR, who has already implemented the specification and can offer guidance with regards to totality of the spec. I myself, have expertise in performance tuning libraries, and have many libraries already to my name that make use of more advanced features in that area.

Phase One - The Spec

Concretely, Yaml as a processor needs to support the 4 layers of the YAML lifecycle: Native, Representation, Serialization and Representation as depicted by this graph:

Yaml 1.2

This graph can be addressed with two modules:

  • one to handle lexing and parsing of character streams into events and presenting events as character streams.
  • one to translate node representations into native data structures (and also vice versa).

The parsing strategy is up for debate. HsYaml makes use of a faithful aeson-like approach which is relatively performant. However, waargonaut makes use of a succinct zipper-based approach that would scale better to streaming YAML documents. This will be up for debate and subject to performance analysis.

Once the core parser is implemented, I propose an aeson-esque typeclass-driven interface of To/FromYAML and a Generic-driven SYB/SOP-derived instance approach for record conversion. This would complete the interface. The hardest part will obviously be the implementation of the spec. A user-facing interface will reveal itself as we progress.

Phase Two - Periphery

This will be the easy part: I have implemented optical interfaces for half of my packages already, and I work on streaming distributed datastructures using the streaming library as a day job. I am not at all worried about any of the peripheral libraries - only the core. Lenses, Prisms, and Traversals are self-evident once the core datastructures are decided upon.

Outcomes

I expect people to use these libraries for years to come. There are no concrete professional benefits other than the knowledge gained in the implementation of these libraries, and the satisfaction of seeing people use them. My aim is to improve Haskell as a whole by contributing my time and energy to making our collective experience better through personal effort.

Scope and Takeover

A minimal MVP for this project will be the core Yaml library, and the Yaml-machines stream-transducer package implementation. I will consider the project a success if those two are written and released within the given time frame. I would consider the project a total success if all 4 are released within that time frame.

There will be no takeover. I intend to maintain these projects as the author and main implementor for the foreseeable future. There is no requirement for Tweag to take over maintenance of these packages once they are released.

Timeline

Here is a candidate timeline for the 12-week period:

  • Week 1: Research

    • Read the YAML 1.2 spec thoroughly
    • Construct preliminary data structures
  • Week 2: Tokenizer

    • Implementation of the tokenizer and event presentation mechanism
  • Week 3: Parser

    • Implementation of the main parser
  • Week 4: API

    • Implementation of the user-facing API and Generic SYB/SOP support for user-defined data structures
  • Week 5/6: Testing/Benchmarking

    • Regressions against existing YAML libraries
    • Performance improvements and fine tuning
    • Full test coverage
  • Week 7/8: Machines

    • Implementation of Yaml-machines
  • Week 9/10: Streaming

    • Implementation of Yaml-streaming
  • Week 11/12: Optics

    • Implementation of the Yaml-lens and Yaml-microlens libraries.

Potential drawbacks include taking extra time to write proper test harnesses to achieve full coverage tests for the YAML 1.2 spec. This may come down to writing a fuzzer against the YAML 1.2 ABNF, which may skew towards an extra week of implementation time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment