Skip to content

Instantly share code, notes, and snippets.

@malcolmsparks
Created March 29, 2020 18:01
Show Gist options
  • Select an option

  • Save malcolmsparks/6a9aab2465d2d7ea841cb9187a6d17fd to your computer and use it in GitHub Desktop.

Select an option

Save malcolmsparks/6a9aab2465d2d7ea841cb9187a6d17fd to your computer and use it in GitHub Desktop.
Parsing the RFC 7231 Accept-Charset header with reap
;; Accept-Charset = *( "," OWS ) ( ( charset / "*" ) [ weight ] ) *( OWS
;; "," [ OWS ( ( charset / "*" ) [ weight ] ) ] )
(let [parser
(let [charset-with-weight
(p/sequence-group
(p/alternatives
(p/pattern-parser
(re-pattern charset))
(p/pattern-parser
(re-pattern (re/re-str \*))))
(p/optionally (weight)))]
(p/cons
(p/first
(p/sequence-group
(p/ignore
(p/pattern-parser
(re-pattern (re/re-compose "(?:%s)*" (re/re-concat \, OWS)))))
charset-with-weight))
(p/zero-or-more
(p/first
(p/sequence-group
(p/ignore
(p/pattern-parser
(re-pattern (re/re-compose "%s%s" OWS ","))))
(p/optionally
(p/first
(p/sequence-group
(p/ignore
(p/pattern-parser
(re-pattern OWS)))
charset-with-weight))))))))]
(criterium.core/quick-bench
(let [matcher (re/input ", \t, , , UTF-8;q=0.8,shift_JIS;q=0.4,a,b")]
(parser matcher))))
@malcolmsparks
Copy link
Copy Markdown
Author

Performance:

Evaluation count : 110616 in 6 samples of 18436 calls.
             Execution time mean : 5.626027 µs
    Execution time std-deviation : 118.634985 ns
   Execution time lower quantile : 5.490290 µs ( 2.5%)
   Execution time upper quantile : 5.769129 µs (97.5%)
                   Overhead used : 8.246236 ns

@malcolmsparks
Copy link
Copy Markdown
Author

Result:

(("UTF-8" 0.8) ("shift_JIS" 0.4) ("a") ("b"))

@malcolmsparks
Copy link
Copy Markdown
Author

This variant provides a nice result:

;; Accept-Charset = *( "," OWS ) ( ( charset / "*" ) [ weight ] ) *( OWS
;;  "," [ OWS ( ( charset / "*" ) [ weight ] ) ] )
(let [parser
      (let [charset-with-weight
            (p/as-map
             (p/sequence-group
              (p/alternatives
               (p/as-entry
                :charset
                (p/pattern-parser
                 (re-pattern charset)))
               (p/pattern-parser
                (re-pattern (re/re-str \*))))
              (p/optionally
               (p/as-entry
                :weight (weight)))))]
        (p/cons
         (p/first
          (p/sequence-group
           (p/ignore
            (p/pattern-parser
             (re-pattern (re/re-compose "(?:%s)*" (re/re-concat \, OWS)))))
           charset-with-weight))
         (p/zero-or-more
          (p/first
           (p/sequence-group
            (p/ignore
             (p/pattern-parser
              (re-pattern (re/re-compose "%s%s" OWS ","))))
            (p/optionally
             (p/first
              (p/sequence-group
               (p/ignore
                (p/pattern-parser
                 (re-pattern OWS)))
               charset-with-weight))))))))]
  (criterium.core/quick-bench
   (let [matcher (re/input ", \t, , , UTF-8;q=0.8,shift_JIS;q=0.4,a,b")]
     (parser matcher))))
=>
({:charset "UTF-8", :weight 0.8}
 {:charset "shift_JIS", :weight 0.4}
 {:charset "a", :weight nil}
 {:charset "b", :weight nil})

at a marginal cost:

Evaluation count : 91548 in 6 samples of 15258 calls.
             Execution time mean : 6.791045 µs
    Execution time std-deviation : 124.832798 ns
   Execution time lower quantile : 6.656914 µs ( 2.5%)
   Execution time upper quantile : 6.965258 µs (97.5%)
                   Overhead used : 8.246236 ns

@malcolmsparks
Copy link
Copy Markdown
Author

Reap is here: https://github.com/juxt/reap

A blog is in the works.

@souenzzo
Copy link
Copy Markdown

How about use instaparse to parse the specification as is?

@malcolmsparks
Copy link
Copy Markdown
Author

Hi @souenzzo - we did once try to compare the regex approach used by reap (an earlier incarnation in jinx) with instaparse. A primitive performance comparison showed reap is 2-3 orders of magnitude faster (because Java regular expressions compile to hightly-optimized interpreters).

user=> (time (dotimes [_ 2000] (insta test-str :start :IRI)))
"Elapsed time: 18674.490482 msecs"
nil
user=> (time (dotimes [_ 2000] (jinx.patterns/re-group-by-name (jinx.patterns/matched jinx.patterns/IRI "https://jon:pither@juxt.pro/malcolm?foo#bar") "host")))
"Elapsed time: 27.511034 msecs"

@malcolmsparks
Copy link
Copy Markdown
Author

malcolmsparks commented Apr 1, 2020

Also, even if you use instaparse, you still have most of the work ahead of you if you want to parse a non-trivial request header. The purpose of reap is to provide these parsers for you, out of the box.

@malcolmsparks
Copy link
Copy Markdown
Author

Here's the latest Accept header

;; Accept = [ ( "," / ( media-range [ accept-params ] ) ) *( OWS "," [
;;  OWS ( media-range [ accept-params ] ) ] ) ]

(defn accept []
  (let [media-range-parameter
        (p/first
         (p/sequence-group
          (p/ignore
           (p/pattern-parser
            (re-pattern (re/re-concat OWS \; OWS))))
          (parameter)))

        accept-params (accept-params)

        ;; The reason why we can't just use `media-range` is that we
        ;; need to resolve the ambiguity whereby the "q" parameter
        ;; separates media type parameters from Accept extension
        ;; parameters. This is more fully discussed in RFC 7231
        ;; Section 5.3.2.
        ;;
        ;; The trick is to create a parameter parser modelled on
        ;; `zero-or-more` which attempts to match `accept-params` for
        ;; each parameter. Since `accept-params` matches on a leading
        ;; `weight`, a weight parameter will be detected and cause the
        ;; loop to end.
        parameters-weight-accept-params
        (fn [matcher]
          (loop [matcher matcher
                 result {:parameters []}]
            (if-let [accept-params (accept-params matcher)]
              (merge result accept-params)
              (if-let [match (media-range-parameter matcher)]
                (recur matcher (update result :parameters conj match))
                result))))]
    (p/optionally
     (p/cons
      (p/alternatives
       (p/ignore
        (p/pattern-parser #","))
       (p/comp
        #(apply merge %)
        (p/sequence-group
         (media-range-without-parameters)
         parameters-weight-accept-params)))
      (p/zero-or-more
       (p/first
        (p/sequence-group
         (p/ignore
          (p/pattern-parser
           (re-pattern
            (re/re-concat OWS ","))))
         (p/optionally
          (p/first
           (p/sequence-group
            (p/ignore
             (p/pattern-parser
              (re-pattern OWS)))
            (p/comp
             #(apply merge %)
             (p/sequence-group
              (media-range-without-parameters)
              parameters-weight-accept-params))))))))))))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment