Created
March 29, 2020 18:01
-
-
Save malcolmsparks/6a9aab2465d2d7ea841cb9187a6d17fd to your computer and use it in GitHub Desktop.
Parsing the RFC 7231 Accept-Charset header with reap
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
;; Accept-Charset = *( "," OWS ) ( ( charset / "*" ) [ weight ] ) *( OWS | |
;; "," [ OWS ( ( charset / "*" ) [ weight ] ) ] ) | |
(let [parser | |
(let [charset-with-weight | |
(p/sequence-group | |
(p/alternatives | |
(p/pattern-parser | |
(re-pattern charset)) | |
(p/pattern-parser | |
(re-pattern (re/re-str \*)))) | |
(p/optionally (weight)))] | |
(p/cons | |
(p/first | |
(p/sequence-group | |
(p/ignore | |
(p/pattern-parser | |
(re-pattern (re/re-compose "(?:%s)*" (re/re-concat \, OWS))))) | |
charset-with-weight)) | |
(p/zero-or-more | |
(p/first | |
(p/sequence-group | |
(p/ignore | |
(p/pattern-parser | |
(re-pattern (re/re-compose "%s%s" OWS ",")))) | |
(p/optionally | |
(p/first | |
(p/sequence-group | |
(p/ignore | |
(p/pattern-parser | |
(re-pattern OWS))) | |
charset-with-weight))))))))] | |
(criterium.core/quick-bench | |
(let [matcher (re/input ", \t, , , UTF-8;q=0.8,shift_JIS;q=0.4,a,b")] | |
(parser matcher)))) |
Result:
(("UTF-8" 0.8) ("shift_JIS" 0.4) ("a") ("b"))
This variant provides a nice result:
;; Accept-Charset = *( "," OWS ) ( ( charset / "*" ) [ weight ] ) *( OWS
;; "," [ OWS ( ( charset / "*" ) [ weight ] ) ] )
(let [parser
(let [charset-with-weight
(p/as-map
(p/sequence-group
(p/alternatives
(p/as-entry
:charset
(p/pattern-parser
(re-pattern charset)))
(p/pattern-parser
(re-pattern (re/re-str \*))))
(p/optionally
(p/as-entry
:weight (weight)))))]
(p/cons
(p/first
(p/sequence-group
(p/ignore
(p/pattern-parser
(re-pattern (re/re-compose "(?:%s)*" (re/re-concat \, OWS)))))
charset-with-weight))
(p/zero-or-more
(p/first
(p/sequence-group
(p/ignore
(p/pattern-parser
(re-pattern (re/re-compose "%s%s" OWS ","))))
(p/optionally
(p/first
(p/sequence-group
(p/ignore
(p/pattern-parser
(re-pattern OWS)))
charset-with-weight))))))))]
(criterium.core/quick-bench
(let [matcher (re/input ", \t, , , UTF-8;q=0.8,shift_JIS;q=0.4,a,b")]
(parser matcher))))
=>
({:charset "UTF-8", :weight 0.8}
{:charset "shift_JIS", :weight 0.4}
{:charset "a", :weight nil}
{:charset "b", :weight nil})
at a marginal cost:
Evaluation count : 91548 in 6 samples of 15258 calls.
Execution time mean : 6.791045 µs
Execution time std-deviation : 124.832798 ns
Execution time lower quantile : 6.656914 µs ( 2.5%)
Execution time upper quantile : 6.965258 µs (97.5%)
Overhead used : 8.246236 ns
Reap is here: https://github.com/juxt/reap
A blog is in the works.
How about use instaparse to parse the specification as is?
Hi @souenzzo - we did once try to compare the regex approach used by reap (an earlier incarnation in jinx) with instaparse. A primitive performance comparison showed reap is 2-3 orders of magnitude faster (because Java regular expressions compile to hightly-optimized interpreters).
user=> (time (dotimes [_ 2000] (insta test-str :start :IRI)))
"Elapsed time: 18674.490482 msecs"
nil
user=> (time (dotimes [_ 2000] (jinx.patterns/re-group-by-name (jinx.patterns/matched jinx.patterns/IRI "https://jon:[email protected]/malcolm?foo#bar") "host")))
"Elapsed time: 27.511034 msecs"
Also, even if you use instaparse, you still have most of the work ahead of you if you want to parse a non-trivial request header. The purpose of reap is to provide these parsers for you, out of the box.
Here's the latest Accept header
;; Accept = [ ( "," / ( media-range [ accept-params ] ) ) *( OWS "," [
;; OWS ( media-range [ accept-params ] ) ] ) ]
(defn accept []
(let [media-range-parameter
(p/first
(p/sequence-group
(p/ignore
(p/pattern-parser
(re-pattern (re/re-concat OWS \; OWS))))
(parameter)))
accept-params (accept-params)
;; The reason why we can't just use `media-range` is that we
;; need to resolve the ambiguity whereby the "q" parameter
;; separates media type parameters from Accept extension
;; parameters. This is more fully discussed in RFC 7231
;; Section 5.3.2.
;;
;; The trick is to create a parameter parser modelled on
;; `zero-or-more` which attempts to match `accept-params` for
;; each parameter. Since `accept-params` matches on a leading
;; `weight`, a weight parameter will be detected and cause the
;; loop to end.
parameters-weight-accept-params
(fn [matcher]
(loop [matcher matcher
result {:parameters []}]
(if-let [accept-params (accept-params matcher)]
(merge result accept-params)
(if-let [match (media-range-parameter matcher)]
(recur matcher (update result :parameters conj match))
result))))]
(p/optionally
(p/cons
(p/alternatives
(p/ignore
(p/pattern-parser #","))
(p/comp
#(apply merge %)
(p/sequence-group
(media-range-without-parameters)
parameters-weight-accept-params)))
(p/zero-or-more
(p/first
(p/sequence-group
(p/ignore
(p/pattern-parser
(re-pattern
(re/re-concat OWS ","))))
(p/optionally
(p/first
(p/sequence-group
(p/ignore
(p/pattern-parser
(re-pattern OWS)))
(p/comp
#(apply merge %)
(p/sequence-group
(media-range-without-parameters)
parameters-weight-accept-params))))))))))))
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Performance: