Skip to content

Instantly share code, notes, and snippets.

@malcolmsparks
Created March 29, 2020 18:01
Show Gist options
  • Save malcolmsparks/6a9aab2465d2d7ea841cb9187a6d17fd to your computer and use it in GitHub Desktop.
Save malcolmsparks/6a9aab2465d2d7ea841cb9187a6d17fd to your computer and use it in GitHub Desktop.
Parsing the RFC 7231 Accept-Charset header with reap
;; Accept-Charset = *( "," OWS ) ( ( charset / "*" ) [ weight ] ) *( OWS
;; "," [ OWS ( ( charset / "*" ) [ weight ] ) ] )
(let [parser
(let [charset-with-weight
(p/sequence-group
(p/alternatives
(p/pattern-parser
(re-pattern charset))
(p/pattern-parser
(re-pattern (re/re-str \*))))
(p/optionally (weight)))]
(p/cons
(p/first
(p/sequence-group
(p/ignore
(p/pattern-parser
(re-pattern (re/re-compose "(?:%s)*" (re/re-concat \, OWS)))))
charset-with-weight))
(p/zero-or-more
(p/first
(p/sequence-group
(p/ignore
(p/pattern-parser
(re-pattern (re/re-compose "%s%s" OWS ","))))
(p/optionally
(p/first
(p/sequence-group
(p/ignore
(p/pattern-parser
(re-pattern OWS)))
charset-with-weight))))))))]
(criterium.core/quick-bench
(let [matcher (re/input ", \t, , , UTF-8;q=0.8,shift_JIS;q=0.4,a,b")]
(parser matcher))))
@malcolmsparks
Copy link
Author

Reap is here: https://github.com/juxt/reap

A blog is in the works.

@souenzzo
Copy link

How about use instaparse to parse the specification as is?

@malcolmsparks
Copy link
Author

Hi @souenzzo - we did once try to compare the regex approach used by reap (an earlier incarnation in jinx) with instaparse. A primitive performance comparison showed reap is 2-3 orders of magnitude faster (because Java regular expressions compile to hightly-optimized interpreters).

user=> (time (dotimes [_ 2000] (insta test-str :start :IRI)))
"Elapsed time: 18674.490482 msecs"
nil
user=> (time (dotimes [_ 2000] (jinx.patterns/re-group-by-name (jinx.patterns/matched jinx.patterns/IRI "https://jon:[email protected]/malcolm?foo#bar") "host")))
"Elapsed time: 27.511034 msecs"

@malcolmsparks
Copy link
Author

malcolmsparks commented Apr 1, 2020

Also, even if you use instaparse, you still have most of the work ahead of you if you want to parse a non-trivial request header. The purpose of reap is to provide these parsers for you, out of the box.

@malcolmsparks
Copy link
Author

Here's the latest Accept header

;; Accept = [ ( "," / ( media-range [ accept-params ] ) ) *( OWS "," [
;;  OWS ( media-range [ accept-params ] ) ] ) ]

(defn accept []
  (let [media-range-parameter
        (p/first
         (p/sequence-group
          (p/ignore
           (p/pattern-parser
            (re-pattern (re/re-concat OWS \; OWS))))
          (parameter)))

        accept-params (accept-params)

        ;; The reason why we can't just use `media-range` is that we
        ;; need to resolve the ambiguity whereby the "q" parameter
        ;; separates media type parameters from Accept extension
        ;; parameters. This is more fully discussed in RFC 7231
        ;; Section 5.3.2.
        ;;
        ;; The trick is to create a parameter parser modelled on
        ;; `zero-or-more` which attempts to match `accept-params` for
        ;; each parameter. Since `accept-params` matches on a leading
        ;; `weight`, a weight parameter will be detected and cause the
        ;; loop to end.
        parameters-weight-accept-params
        (fn [matcher]
          (loop [matcher matcher
                 result {:parameters []}]
            (if-let [accept-params (accept-params matcher)]
              (merge result accept-params)
              (if-let [match (media-range-parameter matcher)]
                (recur matcher (update result :parameters conj match))
                result))))]
    (p/optionally
     (p/cons
      (p/alternatives
       (p/ignore
        (p/pattern-parser #","))
       (p/comp
        #(apply merge %)
        (p/sequence-group
         (media-range-without-parameters)
         parameters-weight-accept-params)))
      (p/zero-or-more
       (p/first
        (p/sequence-group
         (p/ignore
          (p/pattern-parser
           (re-pattern
            (re/re-concat OWS ","))))
         (p/optionally
          (p/first
           (p/sequence-group
            (p/ignore
             (p/pattern-parser
              (re-pattern OWS)))
            (p/comp
             #(apply merge %)
             (p/sequence-group
              (media-range-without-parameters)
              parameters-weight-accept-params))))))))))))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment