Overview
In this blog post, I will introduce you all to a open-source parser generator called ANTLR (pronounced ant-ler) with a simple programming question that I encountered on a job interview.Who this tutorial is intended for
Someone who...- Knows what a compiler is
- I mean if you are delving into parser generators like ANTLR... ;)
- Knows decent programming in Python
- We will be using Python 3 and its type annotation (only syntax sugar for now)
- Understands object-oriented programming
- Classes are everywhere
- Understands data structures
- Trees are enough
- Is comfortable with regular expressions
- You can write a regular expression for simple things like floating point numbers.
- Has basic knowledge of context-free grammar
- Just know what it is conceptually.
Software needed to follow this tutorial
- ANTLR v4 (latest version as of February 2020)
- Set up your classpath as described after downloading the Jar file.
- Python 3 (version 3.8 used here)
- The global interpreter (the one you get from python.org) or the one inside Anaconda is fine, but I am using the global interpreter.
- Java (openjdk 13 used here)
- This is used to run ANTLR and test our parser fast.
- ANTLR v4 run-time for Python 3
- pip install antlr4-python3-runtime
- Recommended: Visual Studio Code IDE and its ANTLR and Python extensions
- You can use Notepad, or Vim if you are hardcore.
- Recommended: Git Bash if on Windows
- You can also use the Windows shell script (Batch), but I am using Bash.
The Problem Statement
Given an input message, what is the number of words in the longest sentence?Assume the message consists of only spaces, letters, periods, exclamation marks and question marks. A sentence ends with punctuation, just like in English.
Here are a few examples.
Input | Output |
---|---|
"" | 0 |
"hello" | 1 |
"hello world!" | 2 |
"today. it is nice." | 3 |
"." | 0 |
"no...no!?" | 1 (ellipsis are interpreted as 3 periods) |
"why?do?you?keep?asking?" | 1 |
Easy Python approach
Before we attempt the ANTLR solution, I want to discuss the simple double for-loop approach in Python. This will be used to verify and benchmark the ANTLR solution.Algorithm:
- Loop over the entire string.
- Each time a sentence delimiter is hit, store everything up to that point into a variable.
- Add this variable to a list.
- Loop over the list of strings.
- Split the string by the white spaces and count each word
- Update the current maximum if the count is greater than it
- Return the maximum
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 | # Finds the word count of the longest sentence def longest_sentence(s: str) -> int: sentences = [] temp = "" for c in s: if c not in '.?!': temp += c else: sentences.append(temp) temp = "" sentences.append(temp) length = 0 for sentence in sentences: current_length = 0 for word in sentence.split(): if word.isalpha(): current_length += 1 length = max(length, current_length) return length if __name__ == '__main__': while True: try: print(longest_sentence(input())) except EOFError: break |
Complexity analysis: The Python program has a nested for-loop. Thus, the run-time complexity is quadratic.
ANTLR Solution
Why bother?
You are right. If it's a simple and quick solution you want (aka in a 45-min interview question), there is no need to go into parser generators like ANTLR. But if your problem changes a little bit from time to time or gets more complicated, ANTLR is a better choice for maintaining your English parser.What is ANTLR?
Simply put, ANTLR is a program that generates another program for you. You put in the rules of how to parse your message and ANTLR spews out a parser (a program) for you to do it. There are other parser generators like ANTLR, namely flex & bison, JavaCC, etc. What makes ANTLR special is its popularity, cool extra features and abundance of target languages. You can write a program in ANTLR language (.g4 file extension) and produce parsers in Python, Java, C# and many more. The downsides of ANTLR are the lack of (free) learning resources and its slower parsing algorithm compared to flex & bison.Now into ANTLR!
Since we are doing something simple, there is just going to be one ANTLR .g4 file.Empty grammar .g4 file |
ANTLR grammar rules allow regex! |
Tokens in ANTLR |
Generate and compile the parser
Now that we have the grammar finished, let's see if it works. To do this, be sure to have set up the Java classpath environment variable on your system as described in the installation. I am using Windows 10 with Git Bash, which requires the most complicated set up of all (-.-). For people following my foot steps, set up your CLASSPATH environment variable for your user. Then you just need to use a shorter alias in your ~/.bashrc.# antlr when CLASSPATH has been set properly in Windows alias antlr4='java -Xmx500M org.antlr.v4.Tool' alias grun='java -Xmx500M org.antlr.v4.gui.TestRig'
$ source ~/.bashrc
To run the ANTLR tool, you can use the alias in the directory of your ANTLR file.
$ antlr4 Sentences.g4 -no-listener
$ javac *.java
Running our parser with grun
We will use the grun alias to run the Java target of our parser. This is a fast way of knowing if our grammar file is wrong, before we generate the Python target. Use Ctrl+Z (shown ^Z) on Windows to send an end-of-file (EOF) token to the parser. On Linux, use Ctrl+D.$ grun Sentences start hello world! ^Z $
So the input is accepted as it should be! The syntax of the command 'grun' is the name of your grammar followed by the start rule of your grammar. Your start rule may have a different name, e.g. the C++ ANTLR grammar start rule is called 'translationunit'. If you are wondering, you can input a grammar rule that is not the start rule. That will make grun run the parser from that rule only, which is an excellent way of testing your individual grammar rules when the .g4 grammar gets long.
Let's try something that the parser should reject.
$ grun Sentences start 13 is the lucky number!JK. ^Z line 1:0 token recognition error at: '1' line 1:1 token recognition error at: '3' $
Visualizing the parse tree with grun
This feature is what sets ANTLR apart. You can visualize the parse tree of an input easily. As of today, this feature seems to be only available on the Java target. But that shouldn't be a problem for us, unless the Python target has errors.Textual visualization
To visualize the parsed input textually, we use the -tree flag on grun.$ grun Sentences start -tree this!I like ^Z (start (sentence this) ! (sentence I like \r\n)) $
Graphical visualization
To visualize the parsed input graphically, we use the -gui flag on grun. You do not need to use both.$ grun Sentences start -tree -gui this! I like it very much! ^Z (start (sentence this) ! (sentence I like it very much) ! (sentence \r\n)) $
ANTLR's visualization feature in Java targets |
Generating and running the Python 3 target
Let's finally turn the ANTLR file into a Python parser, by using the -Dlanguage flag.$ antlr4 Sentences.g4 -Dlanguage=Python3
from antlr4 import * from SentencesLexer import SentencesLexer from SentencesParser import SentencesParser def main(): input_stream = StdinStream() lexer = SentencesLexer(input_stream) stream = CommonTokenStream(lexer) parser = SentencesParser(stream) tree = parser.start() if __name__ == "__main__": main()
Now we can run the Python 3 parser by calling our main script.
$ python Main.py k k ^Z $
Great! The main script is working. It is time we actually return the word count of the longest sentence. To do this, we will start to code a Listener child class in Python.
Accessing the parse tree with a listener
The listener is the last unused Python file generated by ANTLR. In our case, it is called SentencesListener.py. We will start a new Python file called LongestSentence.py. This file only contains a class definition.1 2 3 4 | from SentencesListener import SentencesListener class LongestSentence(SentencesListener): # must inherit pass |
The automatically generated listener class has 2 methods per grammar rule. The listener object-oriented design pattern removes the burden of walking the parse tree ourselves. We just say what the program should do when it enters or leaves a rule. If you really want to walk down the tree yourself, you can tell ANTLR to generate a visitor with the -visitor flag.
In our problem, we are interested in the number of words, so we will only override the exit rule methods. Let's just try one exit rule: the start rule.
1 2 3 4 5 6 7 | from SentencesListener import SentencesListener from SentencesParser import SentencesParser class LongestSentence(SentencesListener): # must inherit def exitStart(self, ctx:SentencesParser.StartContext): print("parsed") |
from antlr4 import * from SentencesLexer import SentencesLexer from SentencesParser import SentencesParser from LongestSentence import LongestSentence def main(): input_stream = StdinStream() lexer = SentencesLexer(input_stream) stream = CommonTokenStream(lexer) parser = SentencesParser(stream) tree = parser.start() ls = LongestSentence() walker = ParseTreeWalker() walker.walk(ls, tree) if __name__ == "__main__": main()
$ python Main.py e ^Z parsed $
Now we know that it works! Let's count the number of words in each sentence. Thanks to ANTLR's generated parse tree, we can take advantage of its runtime library to get the length of the word list. Inside the constructor, we define a member field to hold the current maximum, which is printed out at the end of the walk.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | from SentencesListener import SentencesListener from SentencesParser import SentencesParser class LongestSentence(SentencesListener): # must inherit def __init__(self): super().__init__() self.count_seen_so_far = 0 def exitStart(self, ctx:SentencesParser.StartContext): print(self.count_seen_so_far) def exitSentence(self, ctx:SentencesParser.SentenceContext): self.count_seen_so_far = max( self.count_seen_so_far, len(ctx.Word())) |
$ python Main.py e e e ^Z 3 $
Well done! :)
Complexity analysis of ANTLR parsers: I said before that ANTLR isn't as fast as flex & bison in terms of its parsing algorithm. According to a research paper from ANTLR's authors, the parsing algorithm used in ANTLR is O(n^4) --yikes-- but empirically it is acceptably fast as state-of-art parsers. However, the paper omitted comparison against flex & bison, so ANTLR may not be the fastest.
On the other hand, working in ANTLR is simpler and faster than in flex & bison, so the slower parsing time is compensated by a faster development and fewer headaches. If you are using Python instead of C++, you know what I mean :). I come from the flex & bison world, so I am carrying a bias when evaluating ANTLR's learning curve. We haven't touched ambiguous grammars yet, but since ANTLR allows parse tree visualization, I can feel it to be easier to deal with than solving shift-reduce conflicts in flex & bison.
Previously, we wrote a tiny Python script by hand for this specific parsing problem and the script has a O(n^2) runtime. In a future blog post, we will benchmark the simple Python program against the ANTLR parser.
Conclusion
Congratulations on making your first ANTLR parser! Don't worry if it was difficult, because parser generators are kind of an academic topic. I am trying to simplify this as much as I can, but there is so much background knowledge I must assume to keep this tutorial tidy.There are still many topics in ANTLR that I haven't covered. But everything we did here is the bare essentials of getting something to work in ANTLR. Further topics in ANTLR include custom syntax error messages and grammar actions. There is the reference book on ANTLR, though it is not free.
In addition, given ANTLR's (relative) popularity, there are already a bunch of grammars available on GitHub. If you are elevating your context-free grammar skills, be sure to check out how others are doing it. Chances are, there is already a grammar written for you. Still, you should copy-paste responsibly and always test out the things you copied online.
Until next time, love ya neighbor,
Bob the Cod
Comments
Post a Comment