This sample code parses an INSERT INTO columns... VALUES (values...) statement and extracts the column names and values using ANTLR4 and a SQL grammar.
Create a console project
Add the following Nuget packages:
- Antlr4.Runtime.Standard 4.9.3
- Antlr4BuildTasks 8.17.0
Grab the following files from https://github.com/antlr/grammars-v4/tree/master/sql/tsql and add them to your project folder
- TSqlLexer.g4 (https://raw.githubusercontent.com/antlr/grammars-v4/master/sql/tsql/TSqlLexer.g4)
- TSqlParser.g4 (https://raw.githubusercontent.com/antlr/grammars-v4/master/sql/tsql/TSqlParser.g4)
In the File Properties of the two .g4 files, set the following:
- Build Action: ANTLR 4 grammar
- Custom Tool NameSpace: Namespace for the generated classes, e.g. Parser
- Generate listener: Yes
- Generate Visitor: Yes
Make sure the project compiles without errors.
NOTE: Compiling the parser takes a bit long, as the grammar is a complete SQL grammar. You could trim it to fit your needs (get only the rules you need) but unless you know what you are doing, it can be quite demanding due to the number of rules. You also may need to trim the rules themselves.
There is a trimmed down version here (https://gist.github.com/RupertAvery/ced40e72962390661bab0b93886250f9)
There should be files generated at
obj\Debug\netcoreapp3.1\TSqlLexer.cs
obj\Debug\netcoreapp3.1\TSqlParser.cs
To actually parse, you need to override methods in a class that is inheriting from TSqlParserBaseListener
(see SqlParserListener.cs
).
TSqlParserBaseListener
should exist in the Custom Tool NameSpace you provided. So if you put Parser
you should add
using Parser;
to the top of your listener.
What methods should you override? This bears a bit of explanation and is much better discussed in other pages like
- https://tomassetti.me/getting-started-with-antlr-in-csharp/
- https://putridparrot.com/blog/antlr-in-c/
ANTLR grammers have a set of chained rules in the grammar (.g4) files. For example, the INSERT statement rule in TSqlParser.g4
is as follows:
(HINT: Search for rules in Visual Studio using regex search "^rule_name" to anchor the rule_name to the beginning of the line.)
insert_statement
: with_expression?
INSERT (TOP '(' expression ')' PERCENT?)?
INTO? (ddl_object | rowset_function_limited)
insert_with_table_hints?
('(' insert_column_name_list ')')?
output_clause?
insert_statement_value
for_clause? option_clause? ';'?
;
A rule can contain references to other rules, tokens (defined in the lexer), regex-like modifiers and literals enclosed in single quotes.
We're interested in the list of column names:
insert_column_name_list
: col+=insert_column_id (',' col+=insert_column_id)*
;
and the values:
insert_statement_value
: table_value_constructor
| derived_table
| execute_statement
| DEFAULT VALUES
;
which leads to:
table_value_constructor
: VALUES '(' exps+=expression_list ')' (',' '(' exps+=expression_list ')')*
;
So apparently, VALUES supports multiple lists.
expression_list
: exp+=expression (',' exp+=expression)*
;
We're only interested in the first one. So how does that translate to code?
We start off by creating our listener class. It should inherit from TSqlParserBaseListener
. The listener contains methods that will
be fired each time a rule is detected. We need to override the methods we are interested in.
public class SqlParserListener : TSqlParserBaseListener
{
private List<string> columnNames;
private List<string> values;
public IReadOnlyCollection<string> ColumnNames => columnNames;
public IReadOnlyCollection<string> Values => values;
}
In our SqlParserListener
class, the base class has several overridable methods that correspond to the rules.
For example insert_statement
has EnterInsert_statement
and ExitInsert_statement
. Each method has an argument that is a context
specific to that rule. The context holds the parsable data. Enter* methods are fired before that rule is executed, while Exit*
methods are fired after. You usually only need to override Enter* methods, and usually you use Enter methods for initialization of
your own variables and Exit* methods for finalizing stuff after all inner rules have completed.
So we use EnterInsert_statement
just to setup our lists that will hold stuff.
public override void EnterInsert_statement([NotNull] TSqlParser.Insert_statementContext context)
{
base.EnterInsert_statement(context);
columnNames = new List<string>();
values = new List<string>();
}
To parse the column names, we need to override EnterInsert_column_name_list
. Notice that the rule is a lot like Regex:
a insert_column_id
followed by zero or more insert_column_id
col+=insert_column_id (',' col+=insert_column_id)*
Our context will have a insert_column_id()
method that returns an array od insert_column_id
contexts.
There is also a insert_column_id(int index)
method that returns a specific instance. We loop through the contexts, and call
GetText()
to get the column definition. This will contain the square brackets. You could delve deeper into insert_column_id
and
see what else is parsed, but for now this should be good enough.
public override void EnterInsert_column_name_list([NotNull] TSqlParser.Insert_column_name_listContext context)
{
base.EnterInsert_column_name_list(context);
var columns = context.insert_column_id();
foreach (var column in columns)
{
columnNames.Add(column.GetText());
}
}
To parse the values, we have to get the first expression_list
, then grab the value expressions
public override void EnterTable_value_constructor([NotNull] TSqlParser.Table_value_constructorContext context)
{
base.EnterTable_value_constructor(context);
var expList = context.expression_list(0);
var expressions = expList.expression();
foreach (var expression in expressions)
{
values.Add(expression.GetText());
}
}
Note that this contains the raw parseable values like N'My small, cozy house'
, again, you will have to delve deeper into the expressionm,
possibly check the context Type to see what it is (a string literal? a NULL?)
You can use the following code to run the parser using the listener you just created.
static void Parse(string input)
{
ICharStream stream = CharStreams.fromString(input);
var lcaseStream = new CaseChangingCharStream(stream, true);
ITokenSource lexer = new Parser.TSqlLexer(lcaseStream);
ITokenStream tokens = new CommonTokenStream(lexer);
var parser = new Parser.TSqlParser(tokens);
parser.BuildParseTree = true;
var tree = parser.tsql_file();
var listener = new SqlParserListener();
ParseTreeWalker.Default.Walk(listener, tree);
var c = listener.ColumnNames;
var v = listener.Values;
}
The important part here is the entry rule:
var tree = parser.tsql_file();
This tells the parser to start reading from the tsql_file
rule. You could also go directly to the insert_statement
rule.
With this grammar you could parse a complete SQL stored proc, and detect syntax errors.