Skip to content

Instantly share code, notes, and snippets.

@RupertAvery
Last active December 11, 2021 01:05
Show Gist options
  • Save RupertAvery/9d8bfd862ce46369ec356b25832305a9 to your computer and use it in GitHub Desktop.
Save RupertAvery/9d8bfd862ce46369ec356b25832305a9 to your computer and use it in GitHub Desktop.
ANTLR4 TSQLParser Sample Project

This sample code parses an INSERT INTO columns... VALUES (values...) statement and extracts the column names and values using ANTLR4 and a SQL grammar.

Setup

Create a console project

Add the following Nuget packages:

  • Antlr4.Runtime.Standard 4.9.3
  • Antlr4BuildTasks 8.17.0

Grab the following files from https://github.com/antlr/grammars-v4/tree/master/sql/tsql and add them to your project folder

In the File Properties of the two .g4 files, set the following:

  • Build Action: ANTLR 4 grammar
  • Custom Tool NameSpace: Namespace for the generated classes, e.g. Parser
  • Generate listener: Yes
  • Generate Visitor: Yes

Make sure the project compiles without errors.

NOTE: Compiling the parser takes a bit long, as the grammar is a complete SQL grammar. You could trim it to fit your needs (get only the rules you need) but unless you know what you are doing, it can be quite demanding due to the number of rules. You also may need to trim the rules themselves.

There is a trimmed down version here (https://gist.github.com/RupertAvery/ced40e72962390661bab0b93886250f9)

There should be files generated at

  • obj\Debug\netcoreapp3.1\TSqlLexer.cs
  • obj\Debug\netcoreapp3.1\TSqlParser.cs

Parsing

To actually parse, you need to override methods in a class that is inheriting from TSqlParserBaseListener (see SqlParserListener.cs). TSqlParserBaseListener should exist in the Custom Tool NameSpace you provided. So if you put Parser you should add using Parser; to the top of your listener.

What methods should you override? This bears a bit of explanation and is much better discussed in other pages like

ANTLR grammers have a set of chained rules in the grammar (.g4) files. For example, the INSERT statement rule in TSqlParser.g4 is as follows:

(HINT: Search for rules in Visual Studio using regex search "^rule_name" to anchor the rule_name to the beginning of the line.)

insert_statement
    : with_expression?
      INSERT (TOP '(' expression ')' PERCENT?)?
      INTO? (ddl_object | rowset_function_limited)
      insert_with_table_hints?
      ('(' insert_column_name_list ')')?
      output_clause?
      insert_statement_value
      for_clause? option_clause? ';'?
    ;

A rule can contain references to other rules, tokens (defined in the lexer), regex-like modifiers and literals enclosed in single quotes.

We're interested in the list of column names:

insert_column_name_list
    : col+=insert_column_id (',' col+=insert_column_id)*
    ;

and the values:

insert_statement_value
    : table_value_constructor
    | derived_table
    | execute_statement
    | DEFAULT VALUES
    ;

which leads to:

table_value_constructor
    : VALUES '(' exps+=expression_list ')' (',' '(' exps+=expression_list ')')*
    ;

So apparently, VALUES supports multiple lists.

expression_list
    : exp+=expression (',' exp+=expression)*
    ;

We're only interested in the first one. So how does that translate to code?

Builing the parser

We start off by creating our listener class. It should inherit from TSqlParserBaseListener. The listener contains methods that will be fired each time a rule is detected. We need to override the methods we are interested in.

    public class SqlParserListener : TSqlParserBaseListener
    {
        private List<string> columnNames;
        private List<string> values;

        public IReadOnlyCollection<string> ColumnNames => columnNames;
        public IReadOnlyCollection<string> Values => values;
    }

In our SqlParserListener class, the base class has several overridable methods that correspond to the rules. For example insert_statement has EnterInsert_statement and ExitInsert_statement. Each method has an argument that is a context specific to that rule. The context holds the parsable data. Enter* methods are fired before that rule is executed, while Exit* methods are fired after. You usually only need to override Enter* methods, and usually you use Enter methods for initialization of your own variables and Exit* methods for finalizing stuff after all inner rules have completed.

So we use EnterInsert_statement just to setup our lists that will hold stuff.

    public override void EnterInsert_statement([NotNull] TSqlParser.Insert_statementContext context)
    {
        base.EnterInsert_statement(context);

        columnNames = new List<string>();
        values = new List<string>();
    }

To parse the column names, we need to override EnterInsert_column_name_list. Notice that the rule is a lot like Regex:

a insert_column_id followed by zero or more insert_column_id

col+=insert_column_id (',' col+=insert_column_id)*

Our context will have a insert_column_id() method that returns an array od insert_column_id contexts. There is also a insert_column_id(int index) method that returns a specific instance. We loop through the contexts, and call GetText() to get the column definition. This will contain the square brackets. You could delve deeper into insert_column_id and see what else is parsed, but for now this should be good enough.

    public override void EnterInsert_column_name_list([NotNull] TSqlParser.Insert_column_name_listContext context)
    {
        base.EnterInsert_column_name_list(context);

        var columns = context.insert_column_id();

        foreach (var column in columns)
        {
            columnNames.Add(column.GetText());
        }
    }

To parse the values, we have to get the first expression_list, then grab the value expressions

    public override void EnterTable_value_constructor([NotNull] TSqlParser.Table_value_constructorContext context)
    {
        base.EnterTable_value_constructor(context);

        var expList = context.expression_list(0);

        var expressions = expList.expression();

        foreach (var expression in expressions)
        {
            values.Add(expression.GetText());
        }
    }

Note that this contains the raw parseable values like N'My small, cozy house', again, you will have to delve deeper into the expressionm, possibly check the context Type to see what it is (a string literal? a NULL?)

Putting it all together

You can use the following code to run the parser using the listener you just created.

    static void Parse(string input)
    {
        ICharStream stream = CharStreams.fromString(input);
        var lcaseStream = new CaseChangingCharStream(stream, true);
        ITokenSource lexer = new Parser.TSqlLexer(lcaseStream);
        ITokenStream tokens = new CommonTokenStream(lexer);
        var parser = new Parser.TSqlParser(tokens);
        parser.BuildParseTree = true;
        var tree = parser.tsql_file();
        var listener = new SqlParserListener();
        ParseTreeWalker.Default.Walk(listener, tree);

        var c = listener.ColumnNames;
        var v = listener.Values;

    }

The important part here is the entry rule:

var tree = parser.tsql_file();

This tells the parser to start reading from the tsql_file rule. You could also go directly to the insert_statement rule.

With this grammar you could parse a complete SQL stored proc, and detect syntax errors.

using Antlr4.Runtime;
using Antlr4.Runtime.Misc;
namespace TSQLParser
{
/// <summary>
/// This class supports case-insensitive lexing by wrapping an existing
/// <see cref="ICharStream"/> and forcing the lexer to see either upper or
/// lowercase characters. Grammar literals should then be either upper or
/// lower case such as 'BEGIN' or 'begin'. The text of the character
/// stream is unaffected. Example: input 'BeGiN' would match lexer rule
/// 'BEGIN' if constructor parameter upper=true but getText() would return
/// 'BeGiN'.
/// </summary>
public class CaseChangingCharStream : ICharStream
{
private ICharStream stream;
private bool upper;
/// <summary>
/// Constructs a new CaseChangingCharStream wrapping the given <paramref name="stream"/> forcing
/// all characters to upper case or lower case.
/// </summary>
/// <param name="stream">The stream to wrap.</param>
/// <param name="upper">If true force each symbol to upper case, otherwise force to lower.</param>
public CaseChangingCharStream(ICharStream stream, bool upper)
{
this.stream = stream;
this.upper = upper;
}
public int Index
{
get
{
return stream.Index;
}
}
public int Size
{
get
{
return stream.Size;
}
}
public string SourceName
{
get
{
return stream.SourceName;
}
}
public void Consume()
{
stream.Consume();
}
[return: NotNull]
public string GetText(Interval interval)
{
return stream.GetText(interval);
}
public int LA(int i)
{
int c = stream.LA(i);
if (c <= 0)
{
return c;
}
char o = (char)c;
if (upper)
{
return (int)char.ToUpperInvariant(o);
}
return (int)char.ToLowerInvariant(o);
}
public int Mark()
{
return stream.Mark();
}
public void Release(int marker)
{
stream.Release(marker);
}
public void Seek(int index)
{
stream.Seek(index);
}
}
}
using Antlr4.Runtime;
using Antlr4.Runtime.Tree;
using System;
namespace TSQLParser
{
class Program
{
static void Main(string[] args)
{
Parse("INSERT INTO [Address] ([AddressLine1] ,[AddressLine2] ,[City] ,[StateProvinceID] ,[PostalCode] ,[ModifiedDate]) VALUES (N'My small, cozy house',N'Small town',N'Small country',1,NULL,'2021-06-30 12:45:30.000')");
}
static void Parse(string input)
{
ICharStream stream = CharStreams.fromString(input);
var lcaseStream = new CaseChangingCharStream(stream, true);
ITokenSource lexer = new Parser.TSqlLexer(lcaseStream);
ITokenStream tokens = new CommonTokenStream(lexer);
var parser = new Parser.TSqlParser(tokens);
parser.BuildParseTree = true;
var tree = parser.tsql_file();
var listener = new SqlParserListener();
ParseTreeWalker.Default.Walk(listener, tree);
Console.WriteLine("Column Names:");
foreach (var columnName in listener.ColumnNames)
{
Console.WriteLine(columnName);
}
Console.WriteLine();
Console.WriteLine("Values:");
foreach (var value in listener.Values)
{
Console.WriteLine(value);
}
}
}
}
using Antlr4.Runtime.Misc;
using Parser;
using System.Collections.Generic;
using System.Linq;
using System.Text;
namespace TSQLParser
{
public class SqlParserListener : TSqlParserBaseListener
{
private List<string> columnNames;
private List<string> values;
public IReadOnlyCollection<string> ColumnNames => columnNames;
public IReadOnlyCollection<string> Values => values;
public override void EnterInsert_statement([NotNull] TSqlParser.Insert_statementContext context)
{
base.EnterInsert_statement(context);
// Initialize lists when entering an insert statement
columnNames = new List<string>();
values = new List<string>();
}
public override void EnterInsert_column_name_list([NotNull] TSqlParser.Insert_column_name_listContext context)
{
base.EnterInsert_column_name_list(context);
var columns = context.insert_column_id();
foreach (var column in columns)
{
columnNames.Add(column.GetText());
}
}
public override void EnterTable_value_constructor([NotNull] TSqlParser.Table_value_constructorContext context)
{
base.EnterTable_value_constructor(context);
var expList = context.expression_list(0);
var expressions = expList.expression();
foreach (var expression in expressions)
{
values.Add(expression.GetText());
}
}
}
}
<Project Sdk="Microsoft.NET.Sdk">
<PropertyGroup>
<OutputType>Exe</OutputType>
<TargetFramework>netcoreapp3.1</TargetFramework>
</PropertyGroup>
<ItemGroup>
<None Remove="TSqlLexer.g4" />
<None Remove="TSqlParser.g4" />
</ItemGroup>
<ItemGroup>
<Antlr4 Include="TSqlLexer.g4">
<Package>Parser</Package>
</Antlr4>
<Antlr4 Include="TSqlParser.g4">
<Package>Parser</Package>
</Antlr4>
</ItemGroup>
<ItemGroup>
<PackageReference Include="Antlr4.Runtime.Standard" Version="4.9.3" />
<PackageReference Include="Antlr4BuildTasks" Version="8.17.0" />
</ItemGroup>
</Project>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment