This guide gives you a starting point on writing a tree-sitter major mode. Remember, don’t panic and check your manuals!
Build Emacs with tree-sitter
You can either install tree-sitter by your package manager, or from
source:
git clone https://github.com/tree-sitter/tree-sitter.git
cd tree-sitter
make
make install
To build and run Emacs 29:
git clone https://git.savannah.gnu.org/git/emacs.git -b emacs-29
cd emacs
./autogen.sh
./configure
make
src/emacs
Require the tree-sitter package with (require 'treesit)
. Note that tree-sitter always appear as treesit
in symbols. Now check if Emacs is successfully built with tree-sitter library by evaluating (treesit-available-p)
.
Tree-sitter stuff in Emacs can be categorized into two parts: the tree-sitter API itself, and integration with fontification, indentation, Imenu, etc. You can use shortdoc to glance over all the tree-sitter API functions by typing M-x shortdoc RET treesit RET
. The integration are described in the rest of the post.
Install language definitions
Tree-sitter by itself doesn’t know how to parse any particular language. It needs the language grammar (a dynamic library) for a language to be able to parse it.
First, find the repository for the language grammar, eg, tree-sitter-python. Take note of the Git clone URL of it, eg, https://github.com/tree-sitter/tree-sitter-python.git
. Now check where is the parser.c file in that repository, usually it’s in src
.
Make sure you have Git, C and C++ compiler, and run the treesit-install-grammar
command, it will prompt for the URL and the directory of parser.c, leave other prompts at default unless you know what you are doing.
You can also manually clone the repository and compile it, and put the dynamic library at a standard library location. Emacs will be able to find it. If you wish to put it somewhere else, set treesit-extra-load-path
so Emacs can find it.
Tree-sitter major modes
Tree-sitter modes should be separate major modes, usually named xxx-ts-mode
. I know I said tree-sitter always appear as treesit
in symbols, this is the only exception.
If the tree-sitter mode and the “native” mode could share some setup code, you can create a “base mode”, which only contains the common setup. For example, there is python-base-mode (shared), and both python-mode (native), and python-ts-mode (tree-sitter) derives from it.
In the tree-sitter mode, check if we can use tree-sitter with treesit-ready-p
, it will emit a warning if tree-sitter is not ready (tree-sitter not built with Emacs, can’t find the language grammar, buffer too large, etc).
Fontification
Tree-sitter works like this: It parses the buffer and produces a parse tree. You provide a query made of patterns and capture names, tree-sitter finds the nodes that match these patterns, tag the corresponding capture names onto the nodes and return them to you. The query function returns a list of (capture-name . node)
.
For fontification, we simply use face names as capture names. And the captured node will be fontified in their capture name (the face).
The capture name could also be a function, in which case (NODE OVERRIDE START END)
is passed to the function for fontification. START
and END
are the start and end of the region to be fontified. The function should only fontify within that region. The function should also allow more optional arguments with &rest _
, for future extensibility. For OVERRIDE
check out the docstring of treesit-font-lock-rules
.
Query syntax
There are two types of nodes: “named nodes”, like (identifier)
, (function_definition)
, and “anonymous nodes”, like "return"
, "def"
, "("
, ";"
. Parent-child relationship is expressed as
(parent (child) (child) (child (grand_child)))
Eg, an argument list (1, "3", 1)
would be:
(argument_list "(" (number) (string) (number) ")")
Children could have field names:
(function_definition name: (identifier) type: (identifier))
To match any one in the list:
["true" "false" "none"]
Capture names can come after any node in the pattern:
(parent (child) @child) @parent
The query above captures both the parent and the child.
The query below captures all the keywords with capture name
"keyword"
:
["return" "continue" "break"] @keyword
These are the common syntax, check out the full syntax in the manual: Pattern Matching.
Query references
But how do one come up with the queries? Take python for an example, open any python source file, type M-x treesit-explore-mode RET
. You should see the parse tree in a separate window, automatically updated as you select text or edit the buffer. Besides this, you can consult the grammar of the language definition. For example, Python’s grammar file is at
https://github.com/tree-sitter/tree-sitter-python/blob/master/grammar.js
Neovim also has a bunch of queries to reference from.
The manual explains how to read grammar files in the bottom of Language Grammar.
Debugging queries
If your query has problems, use treesit-query-validate
to debug the query. It will pop a buffer containing the query (in text format) and mark the offending part in red. Set treesit--font-lock-verbose
to t
if you want the font-lock function to report what it’s doing.
Set up font-lock
To enable tree-sitter font-lock, set treesit-font-lock-settings
and treesit-font-lock-feature-list
buffer-locally and call treesit-major-mode-setup
. For example, see python--treesit-settings
in python.el. Below is a snippet of it.
Note that like the current font-lock system, if the to-be-fontified region already has a face (ie, an earlier match fontified part/all of the region), the new face is discarded rather than applied. If you want later matches always override earlier matches, use the :override
keyword.
Each rule should have a :feature
, like function-name
, string-interpolation
, builtin
, etc. This way users can enable/disable each feature individually.
Read the manual section Parser-based Font-Lock for more detail.
Example from python.el:
(defvar python--treesit-settings
(treesit-font-lock-rules
:feature 'comment
:language 'python
'((comment) @font-lock-comment-face)
:feature 'string
:language 'python
'((string) @python--treesit-fontify-string)
:feature 'string-interpolation
:language 'python
:override t
'((interpolation (identifier) @font-lock-variable-name-face))
...))
In python-ts-mode
:
(treesit-parser-create 'python)
(setq-local treesit-font-lock-settings python--treesit-settings)
(setq-local treesit-font-lock-feature-list
'(( comment definition)
( keyword string type)
( assignment builtin constant decorator
escape-sequence number property string-interpolation )
( bracket delimiter function operator variable)))
...
(treesit-major-mode-setup)
Concretely, something like this:
(define-derived-mode python-ts-mode python-base-mode "Python"
"Major mode for editing Python files, using tree-sitter library.
\\{python-ts-mode-map}"
:syntax-table python-mode-syntax-table
(when (treesit-ready-p 'python)
(treesit-parser-create 'python)
(setq-local treesit-font-lock-feature-list
'(( comment definition)
( keyword string type)
( assignment builtin constant decorator
escape-sequence number property string-interpolation )
( bracket delimiter function operator variable)))
(setq-local treesit-font-lock-settings python--treesit-settings)
(setq-local imenu-create-index-function
#'python-imenu-treesit-create-index)
(setq-local treesit-defun-type-regexp (rx (or "function" "class")
"_definition"))
(setq-local treesit-defun-name-function
#'python--treesit-defun-name)
(treesit-major-mode-setup)
(when python-indent-guess-indent-offset
(python-indent-guess-indent-offset))))
Indentation
Indentation works like this: We have a bunch of rules that look like
(MATCHER ANCHOR OFFSET)
When the indenting a line, let NODE
be the node at the beginning of the current line, we pass this node to the MATCHER
of each rule, one of them will match the node (eg, “this node is a closing bracket!”). Then we pass the node to the ANCHOR
, which returns a point (eg, the beginning of NODE
’s parent). We find the column number of that point (eg, 4), add OFFSET
to it (eg, 0), and that is the column we want to indent the current line to (4 + 0 = 4).
Matchers and anchors are functions that takes (NODE PARENT BOL &rest _)
. Matches return nil/non-nil for no match/match, and anchors return the anchor point. An Offset is usually a number or a variable, but it can also be a function. Below are some convenient builtin matchers and anchors.
For MATHCER
we have
(parent-is TYPE) => matches if PARENT’s type matches TYPE as regexp
(node-is TYPE) => matches NODE’s type
(query QUERY) => matches if querying PARENT with QUERY
captures NODE.
(match NODE-TYPE PARENT-TYPE NODE-FIELD
NODE-INDEX-MIN NODE-INDEX-MAX)
=> checks everything. If an argument is nil, don’t match that. Eg,
(match nil TYPE) is the same as (parent-is TYPE)
For ANCHOR
we have
first-sibling => start of the first sibling
parent => start of parent
parent-bol => BOL of the line parent is on.
prev-sibling => start of previous sibling
no-indent => current position (don’t indent)
prev-line => start of previous line
There is also a manual section for indent: Parser-based Indentation.
When writing indent rules, you can use treesit-check-indent
to
check if your indentation is correct. To debug what went wrong, set
treesit--indent-verbose
to t
. Then when you indent, Emacs
tells you which rule is applied in the echo area.
Here is an example:
(defvar typescript-mode-indent-rules
(let ((offset 'typescript-indent-offset))
`((typescript
;; This rule matches if node at point is ")", ANCHOR is the
;; parent node’s BOL, and offset is 0.
((node-is ")") parent-bol 0)
((node-is "]") parent-bol 0)
((node-is ">") parent-bol 0)
((node-is "\\.") parent-bol ,offset)
((parent-is "ternary_expression") parent-bol ,offset)
((parent-is "named_imports") parent-bol ,offset)
((parent-is "statement_block") parent-bol ,offset)
((parent-is "type_arguments") parent-bol ,offset)
((parent-is "variable_declarator") parent-bol ,offset)
((parent-is "arguments") parent-bol ,offset)
((parent-is "array") parent-bol ,offset)
((parent-is "formal_parameters") parent-bol ,offset)
((parent-is "template_substitution") parent-bol ,offset)
((parent-is "object_pattern") parent-bol ,offset)
((parent-is "object") parent-bol ,offset)
((parent-is "object_type") parent-bol ,offset)
((parent-is "enum_body") parent-bol ,offset)
((parent-is "arrow_function") parent-bol ,offset)
((parent-is "parenthesized_expression") parent-bol ,offset)
...))))
Then you set treesit-simple-indent-rules
to your rules, and call treesit-major-mode-setup
.
Set treesit-simple-imenu-settings
and call treesit-major-mode-setup
.
Navigation
Set treesit-defun-type-regexp
, treesit-defun-name-function
, and call treesit-major-mode-setup
.
C-like languages
c-ts-mode.el has some goodies for handling indenting and filling block comments.
These two rules should take care of indenting block comments.
((and (parent-is "comment") c-ts-mode--looking-at-star)
c-ts-mode--comment-start-after-first-star -1)
((parent-is "comment") prev-adaptive-prefix 0)
Set c-ts-mode-indent-block-type-regexp
and these two rules should take care of indenting statements in “{}” blocks and closing bracket “}”.
((node-is "
") point-min c-ts-mode--close-bracket-offset)
((parent-is "compound_statement")
point-min c-ts-mode--statement-offset)
}
c-ts-mode-comment-setup
will set up comment and filling for you.
Multi-language modes
Refer to the manual: Multiple Languages.
Common Tasks
M-x shortdoc RET treesit RET
will give you a complete list.
How to...
Get the buffer text corresponding to a node?
(treesit-node-text node)
Don’t confuse this with treesit-node-string
.
Scan the whole tree for stuff?
(treesit-search-subtree)
(treesit-search-forward)
(treesit-induce-sparse-tree)
Find/move to to next node that...?
(treesit-search-forward node ...)
(treesit-search-forward-goto node ...)
Get the root node?
(treesit-buffer-root-node)
Get the node at point?
(treesit-node-at (point))