ANTLR with Python 3 Example: Parsing English Sentences

Overview

In this blog post, I will introduce you all to a open-source parser generator called ANTLR (pronounced ant-ler) with a simple programming question that I encountered on a job interview.

Who this tutorial is intended for

Someone who...

Knows what a compiler is

I mean if you are delving into parser generators like ANTLR... ;)

Knows decent programming in Python

We will be using Python 3 and its type annotation (only syntax sugar for now)

Understands object-oriented programming

Classes are everywhere

Understands data structures

Trees are enough

Is comfortable with regular expressions

You can write a regular expression for simple things like floating point numbers.

Has basic knowledge of context-free grammar

Just know what it is conceptually.

Software needed to follow this tutorial

ANTLR v4 (latest version as of February 2020)

Set up your classpath as described after downloading the Jar file.

Python 3 (version 3.8 used here)

The global interpreter (the one you get from python.org) or the one inside Anaconda is fine, but I am using the global interpreter.

Java (openjdk 13 used here)

This is used to run ANTLR and test our parser fast.

ANTLR v4 run-time for Python 3

pip install antlr4-python3-runtime

Recommended: Visual Studio Code IDE and its ANTLR and Python extensions

You can use Notepad, or Vim if you are hardcore.

Recommended: Git Bash if on Windows

You can also use the Windows shell script (Batch), but I am using Bash.

The Problem Statement

Given an input message, what is the number of words in the longest sentence?

Assume the message consists of only spaces, letters, periods, exclamation marks and question marks. A sentence ends with punctuation, just like in English.

Here are a few examples.

Input	Output
""	0
"hello"	1
"hello world!"	2
"today. it is nice."	3
"."	0
"no...no!?"	1 (ellipsis are interpreted as 3 periods)
"why?do?you?keep?asking?"	1

Easy Python approach

Before we attempt the ANTLR solution, I want to discuss the simple double for-loop approach in Python. This will be used to verify and benchmark the ANTLR solution.

Algorithm:

Loop over the entire string.
1. Each time a sentence delimiter is hit, store everything up to that point into a variable.
2. Add this variable to a list.
Loop over the list of strings.
1. Split the string by the white spaces and count each word
2. Update the current maximum if the count is greater than it
Return the maximum

# Finds the word count of the longest sentence
def longest_sentence(s: str) -> int:
    sentences = []
    
    temp = ""
    for c in s:
        if c not in '.?!':
            temp += c
        else:
            sentences.append(temp)
            temp = ""
    sentences.append(temp)
    
    length = 0
    for sentence in sentences:
        current_length = 0
        for word in sentence.split():
            if word.isalpha():
                current_length += 1
        length = max(length, current_length)

    return length

if __name__ == '__main__':
    while True:
        try:
            print(longest_sentence(input()))
        except EOFError:
            break

Complexity analysis: The Python program has a nested for-loop. Thus, the run-time complexity is quadratic.

ANTLR Solution

Why bother?

You are right. If it's a simple and quick solution you want (aka in a 45-min interview question), there is no need to go into parser generators like ANTLR. But if your problem changes a little bit from time to time or gets more complicated, ANTLR is a better choice for maintaining your English parser.

What is ANTLR?

Simply put, ANTLR is a program that generates another program for you. You put in the rules of how to parse your message and ANTLR spews out a parser (a program) for you to do it. There are other parser generators like ANTLR, namely flex & bison, JavaCC, etc. What makes ANTLR special is its popularity, cool extra features and abundance of target languages. You can write a program in ANTLR language (.g4 file extension) and produce parsers in Python, Java, C# and many more. The downsides of ANTLR are the lack of (free) learning resources and its slower parsing algorithm compared to flex & bison.

Now into ANTLR!

Since we are doing something simple, there is just going to be one ANTLR .g4 file.

Empty grammar .g4 file

You can name it as you wish, but the grammar name and file name must be the same. Next, let's add the production rules.

ANTLR grammar rules allow regex!

One surprising thing about ANTLR is that its grammar rules are very expressive. They even allow regular expression Kleene star (*), which reduces the clutter that you would have in a flex & bison grammar file. Now, let's finish this .g4 file with the tokens.

Tokens in ANTLR

Hopefully, this is familiar to you since it's only regular expressions. As mentioned on the picture, I don't recommend using the lexer command "-> skip" because the parser will ignore the white space and everything that relies on it. Since ANTLR grammar rules are so expressive and we use the Kleene star in them, ignoring spaces would make the parse tree ignore the elements after a space in a sentence. Though the parsing correctness won't be undermined (to be verified), we can't display the parse tree correctly on the screen. Visualizing the parse tree is a key feature in ANTLR.

Generate and compile the parser

Now that we have the grammar finished, let's see if it works. To do this, be sure to have set up the Java classpath environment variable on your system as described in the installation. I am using Windows 10 with Git Bash, which requires the most complicated set up of all (-.-). For people following my foot steps, set up your CLASSPATH environment variable for your user. Then you just need to use a shorter alias in your ~/.bashrc.

# antlr when CLASSPATH has been set properly in Windows
alias antlr4='java -Xmx500M org.antlr.v4.Tool'
alias grun='java -Xmx500M org.antlr.v4.gui.TestRig'

Source your bash file.

$ source ~/.bashrc

To run the ANTLR tool, you can use the alias in the directory of your ANTLR file.

$ antlr4 Sentences.g4 -no-listener

No listener means ANTLR only generates what is necessary to parse. We will use the listener later with Python. Then, compile the generated Java files. Obviously, if you have a lot of irrelevant Java files in the same folder, use something more specific on javac, like Sentences*.java.

$ javac *.java

Note: If you are using Visual Studio Code and you have the Java extension installed, it will freak out if you haven't added the ANTLR Jar file to the Java dependencies.

Running our parser with grun

We will use the grun alias to run the Java target of our parser. This is a fast way of knowing if our grammar file is wrong, before we generate the Python target. Use Ctrl+Z (shown ^Z) on Windows to send an end-of-file (EOF) token to the parser. On Linux, use Ctrl+D.

$ grun Sentences start
hello world!
^Z

$

So the input is accepted as it should be! The syntax of the command 'grun' is the name of your grammar followed by the start rule of your grammar. Your start rule may have a different name, e.g. the C++ ANTLR grammar start rule is called 'translationunit'. If you are wondering, you can input a grammar rule that is not the start rule. That will make grun run the parser from that rule only, which is an excellent way of testing your individual grammar rules when the .g4 grammar gets long.

Let's try something that the parser should reject.

$ grun Sentences start
13 is the lucky number!JK.
^Z
line 1:0 token recognition error at: '1'
line 1:1 token recognition error at: '3'

$

Good, the parser rejected numbers since our tokens do not take any digits.

Visualizing the parse tree with grun

This feature is what sets ANTLR apart. You can visualize the parse tree of an input easily. As of today, this feature seems to be only available on the Java target. But that shouldn't be a problem for us, unless the Python target has errors.

Textual visualization

To visualize the parsed input textually, we use the -tree flag on grun.

$ grun Sentences start -tree
this!I like
^Z
(start (sentence this) ! (sentence I   like \r\n))

$

Graphical visualization

To visualize the parsed input graphically, we use the -gui flag on grun. You do not need to use both.

$ grun Sentences start -tree -gui
this! I like it very much!
^Z
(start (sentence this) ! (sentence   I   like   it   very   much) ! (sentence \r\n))

$

A Java popup will appear and the Bash terminal will receive back control once you close the popup.

ANTLR's visualization feature in Java targets

Great! Using the Java target to test and debug our grammar gives us access to visualizers. If you are a savvy command line user, you can write Bash scripts to automate the building process of your grammar file and its testing. We won't do that here.

Generating and running the Python 3 target

Let's finally turn the ANTLR file into a Python parser, by using the -Dlanguage flag.

$ antlr4 Sentences.g4 -Dlanguage=Python3

This command generates 3 Python files, only 2 of which we will use for now. Since there is no grun for Python, we need to create our own main class. In a new Python script, type in the following.

from antlr4 import *
from SentencesLexer import SentencesLexer
from SentencesParser import SentencesParser

def main():
    
    input_stream = StdinStream()
    lexer = SentencesLexer(input_stream)
    stream = CommonTokenStream(lexer)
    parser = SentencesParser(stream)
    tree = parser.start()

if __name__ == "__main__":
    main()

This is a main file I derived from ANTLR's GitHub. Check it out for more ways of making a main function.

Now we can run the Python 3 parser by calling our main script.

$ python Main.py
k k
^Z

$

Great! The main script is working. It is time we actually return the word count of the longest sentence. To do this, we will start to code a Listener child class in Python.

Accessing the parse tree with a listener

The listener is the last unused Python file generated by ANTLR. In our case, it is called SentencesListener.py. We will start a new Python file called LongestSentence.py. This file only contains a class definition.

from SentencesListener import SentencesListener

class LongestSentence(SentencesListener): # must inherit
    pass

The automatically generated listener class has 2 methods per grammar rule. The listener object-oriented design pattern removes the burden of walking the parse tree ourselves. We just say what the program should do when it enters or leaves a rule. If you really want to walk down the tree yourself, you can tell ANTLR to generate a visitor with the -visitor flag.

In our problem, we are interested in the number of words, so we will only override the exit rule methods. Let's just try one exit rule: the start rule.

from SentencesListener import SentencesListener
from SentencesParser import SentencesParser

class LongestSentence(SentencesListener): # must inherit

    def exitStart(self, ctx:SentencesParser.StartContext):
        print("parsed")

To test this, we need to update the Main.py script.

from antlr4 import *
from SentencesLexer import SentencesLexer
from SentencesParser import SentencesParser
from LongestSentence import LongestSentence

def main():
    
    input_stream = StdinStream()
    lexer = SentencesLexer(input_stream)
    stream = CommonTokenStream(lexer)
    parser = SentencesParser(stream)
    tree = parser.start()

    ls = LongestSentence()
    walker = ParseTreeWalker()
    walker.walk(ls, tree)

if __name__ == "__main__":
    main()

Then run.

$ python Main.py
e
^Z
parsed

$

Now we know that it works! Let's count the number of words in each sentence. Thanks to ANTLR's generated parse tree, we can take advantage of its runtime library to get the length of the word list. Inside the constructor, we define a member field to hold the current maximum, which is printed out at the end of the walk.

from SentencesListener import SentencesListener
from SentencesParser import SentencesParser

class LongestSentence(SentencesListener): # must inherit

    def __init__(self):
        super().__init__()
        self.count_seen_so_far = 0

    def exitStart(self, ctx:SentencesParser.StartContext):
        print(self.count_seen_so_far)

    def exitSentence(self, ctx:SentencesParser.SentenceContext):
        self.count_seen_so_far = max(
            self.count_seen_so_far, len(ctx.Word()))

Run the Main.py and this will give us the answer.

$ python Main.py
e e e
^Z
3

$

Well done! :)

Complexity analysis of ANTLR parsers: I said before that ANTLR isn't as fast as flex & bison in terms of its parsing algorithm. According to a research paper from ANTLR's authors, the parsing algorithm used in ANTLR is O(n^4) --yikes-- but empirically it is acceptably fast as state-of-art parsers. However, the paper omitted comparison against flex & bison, so ANTLR may not be the fastest.

On the other hand, working in ANTLR is simpler and faster than in flex & bison, so the slower parsing time is compensated by a faster development and fewer headaches. If you are using Python instead of C++, you know what I mean :). I come from the flex & bison world, so I am carrying a bias when evaluating ANTLR's learning curve. We haven't touched ambiguous grammars yet, but since ANTLR allows parse tree visualization, I can feel it to be easier to deal with than solving shift-reduce conflicts in flex & bison.

Previously, we wrote a tiny Python script by hand for this specific parsing problem and the script has a O(n^2) runtime. In a future blog post, we will benchmark the simple Python program against the ANTLR parser.

Conclusion

Congratulations on making your first ANTLR parser! Don't worry if it was difficult, because parser generators are kind of an academic topic. I am trying to simplify this as much as I can, but there is so much background knowledge I must assume to keep this tutorial tidy.

There are still many topics in ANTLR that I haven't covered. But everything we did here is the bare essentials of getting something to work in ANTLR. Further topics in ANTLR include custom syntax error messages and grammar actions. There is the reference book on ANTLR, though it is not free.

In addition, given ANTLR's (relative) popularity, there are already a bunch of grammars available on GitHub. If you are elevating your context-free grammar skills, be sure to check out how others are doing it. Chances are, there is already a grammar written for you. Still, you should copy-paste responsibly and always test out the things you copied online.

Until next time, love ya neighbor,
Bob the Cod

Learning Some Compilers - Bob the Cod

Search This Blog