response to alphapapa.md

I haven't used phpinspect.el before. It looks like good software for working with PHP.

Thanks! IMO It is good software for working with PHP, I use it every day at my dayjob. I'll be the first to admit that it still needs work though, and is not yet competitive with the commercial offerings available on many fronts.

Would you tell us a bit about how this splay tree library is hooked into the incremental parser? e.g. if the user inserts a single character, what chain of operations does that set off?

Ha, there's a lot involved in the process so I'm struggling to find a good place to start my explanation, but here I go. Within phpinspect's codebase there is the concept of a "bmap" or "Buffer Map". This is not called a map in the sense that it is a map datastructure, but because it "maps" tokens to their metadata and vice versa. A token in phpinspect is a simple list. For example, the text "{call(foo, bar)}" becomes the following list after parsing:

(:block (:word "call") (:list (:word "foo") (:comma ",") (:word "bar")))

As you can see, a token in and of itself is an n-ary tree of nested lists. This makes it easy to process the tokens and interpret their meaning using basic lisp constructs. In a live edited buffer though, phpinspect needs some way to attach metadata to a token, like its start and end positions. For this purpose there is the phpinspect-meta object, the rough layout of which is as follows:

phpinspect-meta 
    start     [ Start position ]
    end       [ End position ]
    token     [ The token, e.g: (:list (:word "foo") (:comma ",") (:word "bar")) ]
    children  [ A splay tree containing phpinspect-meta objects for "foo, bar" ]

The buffer map of the example from before would roughly look something like this:

phpinspect-bmap
    starts (hash table of start-pos => metadata):
        1 => [ meta object for :block token]
        2 => [ meta object for (:word "call) ]
        6 => [ meta object for (:list ...) ]
        etc...
    meta (metadata tree):
                  [meta (:root ...)]
                          |
                  [meta (:block ...)]
                  /               \
[meta (:word "call")]   [meta (:list .........)]
                         /        |            \
         [meta (:word "foo")] [meta (:comma ",")] [meta (:word "bar")]

When a single character is inserted, for example the letted "d" after "foo", after-change-functions are triggered. Phpinspect has registered its own "after change function" which is called to notify the "edit tracker" of the change. For the purpose of this example, just see the edit tracker as a way to keep track of "tainted"/edited regions in a buffer. When contextual information is requested, for example to display an eldoc string or complete a function name, phpinspect starts incrementally parsing the buffer. The incremental parsing logic is:

Do we know about a token at point? (uses hash table)
If we do, is it tainted?
If it is tainted, start parsing text from point (recurses into same logic but nested in parser for new token)
if it is not tainted, absorb token, skip over {token length} chars and goto 1.

By inserting "d" after "foo", "foo" as well as all parent tokens have become tainted.

                      Tainted!
                         | 
                  [meta (|:root ...)]
                         | |
                  [meta (|:block ...)]
                  /      |         \
[meta (:word "call")]   [|meta (:list .........)]
                         |/        |            \
        [meta (:word "foo|")] [meta (:comma ",")] [meta (:word "bar")]

Because the parser keeps checking for existing tokens after recursing, (:block (:list (:word "foo") will be reparsed, but (:comma ",") (:word "bar") will be reused while parsing the rest of (:list ...).

Reusing a token means detaching its metadata node from the previously parsed tree and attaching it to a parent in the new tree that will be the result of the current parse. If it has children, they stay attached and come with. As you can imagine, skipping over untainted tokens can result in saving on LOT of parsing.

Is the splay tree modified when the user moves point without changing the buffer contents?

Yes and no. The content of all of the splay trees would remain the same, but during the first query for information, the relevant splay trees would be re-balanced to quick successive access to the tokens near point. Imagine the following layout of the metadata tree:

[meta (:block ...)]:
    children (splay tree):
        [splayt-node {pos x} [meta (:word "call")]]
            \
            [splayt-node {pos x} [meta (:list ...)]

[meta (:list ...)] (the child of (:block ...) above):
    children (splay tree):
        [splayt-node {pos x} [meta (:comma ",")]]
            /                                 \
[splayt-node {pos x} [meta (:word "foo")]]   [splayt-node {pos x} [meta (:word "bar")]]

We now move point to the word "foo" and eldoc queries the buffer map to determine if it has information to display at point. After the query, the ordering of the splay trees changes:

[meta (:block ...)]:
    children (splay tree):
        [splayt-node {pos x} [meta (:list ...)]
            /
[splayt-node {pos x} [meta (:word "call")]]

[meta (:list ...)] (the child of (:block ...) above):
    children (splay tree):
        [splayt-node {pos x} [meta (:word "foo")]]
            \ 
            [splayt-node {pos x} [meta (:comma ",")]]
                \
                [splayt-node {pos x} [meta (:word "bar")]]

(:list ...) and all of its parents now have their splay trees of children ordered to quickly access their last accessed child and nearby tokens.

Also, a bit off-topic, but would it be feasible to port the incremental parser in phpinspect.el to work with languages other than PHP? Could be an interesting alternative to LSP servers for some cases.

I think so. There is some phpinspect-specific code in the function phpinspect-bmap-register to keep track of some "tables of content" for notable tokens. But the rest of the code is reasonably decoupled from the package. I will say that the parser is pretty "dumb" in the sense that tokens are parsed with simple lisp functions that can do anything they want to produce a token, as long as they move point to some place after they started. Writing a parser is not as simple as defining a "grammar" using some regular expressions as you would expect to do with common parser frameworks. It can be a bit of a tedious process to end up with the desired result :).

Some examples of parser "handlers", responsible for a single token:

(phpinspect-defhandler comma (comma &rest _ignored)
  "Handler for comma tokens"
  ((regexp . ","))
  (phpinspect-munch-token-without-attribs comma :comma))
  
(phpinspect-defhandler variable (start-token &rest _ignored)
  "Handler for tokens indicating reference to a variable"
  ((regexp . "\\$"))
  (forward-char (length start-token))
  (if (looking-at (phpinspect-handler-regexp word)) ;; macro that inlines regexp of another handler
      (phpinspect-munch-token-without-attribs (match-string 0) :variable)
    (list :variable nil)))

After defining the handlers for different kinds of tokens, it should be smooth sailing from there. Creating a parser function from them can easily be done with phpinspect-defparser. Here's an example of a parser for a PHP function declaration, which defines the function phpinspect--parse-declaration as entrypoint:

(phpinspect-defparser declaration
  :tree-keyword "declaration" ;; Name of the keyword to put in the car of token (:declaration)
  :handlers '(comment word list terminator tag) ;; use comment, word, list, terminator and tag handlers
  :delimiter-predicate #'phpinspect-end-of-token-p) ;; Stop parsing after encountering a token satisfying end-of-token-p

Thanks for sharing your work.

And thank you for taking a look at it :)

hugot/response to alphapapa.md

hugot commented Aug 31, 2024