Tokenizers
Home
Introduction
Philosophy
General techniques
Sorting
Searching
Factory
Persistence
Logging
Streaming
Tokenizers
Parsing
File Searching
Command
PseudoPatterns
Compiling
Downloads
FeedBack

What is a tokenizer ?

[Syntax coloring]
[
Sample tokenizers]

Tokenizers are constructs (objects or procedures) that take an input stream and recognize the “parts” of it. To be able to recognize the parts, there has to be an agreement as to what forms these parts.  For this there is a need for a language definition. The language definition tells us : “This string is well-formed, this string isn’t.”

The languages that can be recognized by tokenizers can be described by things such as regular expressions.  A tokenizer is something capable of doing matches on something like a regular expression.

There is a large established literature on tokenizers (aka lexical analyzers, aka scanners). It talks about what exactly is a regular expression. The entire theory, including deterministic and non-deterministic finite automata can be read in “Elements of the theory of computation”.  This is very dense stuff and I had to learn it all when I was in the university.

Computer programs able to do lexical analysis have been known for many years.  The all-time classic that describes them is the so-called “dragon book”. It is must-read material for anybody ever involved in compiler-writing.

The tokenizers we are going to write are not of this kind. Our tokenizers will be less-than-optimal but -hopefully- much easier to understand and maintain.  If you want to know where it came from, you can read “building parsers in Java” (ouch ! a Java book !).There was also an article in JavaPro of september 2001 (“ a tokenizer for your collection). Finally, there is a short hint at this implementation in the GoF book (discussing the Interpreter pattern).

Ok, let’s get on with the story. For tokenizers, there are 4 main constructions :

A

Matches a literal “A” in the input

A | B

Matches either A or B in the input

A*

Matches zero or more occurences of A

A B

Matches an A followed by B

With any combination of these expressions you can match anything which can be matched by a regular expression. However, this generally makes notation very hard and unreadable.  Therefore, in general the following additions are made :

A+

Matches one or more occurences of A. Equivalent to :
A A*

A {n,m}

Matches between and m occurences of A. For n = 2 and m = 4 this is equivalent to :
AA | AAA | AAA.

[ A ]

An optional (one or zero) occurence of A.  Euivalent to :
A |

A-B

Matches all “characters” in the range from A to B. For A = “R” and B = “U” this is euivalent to :
R | S | T | U

cToken

Tokenizers can be used for two purposes : match the entire string or match a part that is as long as possible and return it.  The next part will be matched where the previous was stopped. In general it is a good idea to consider something like a “Token” to be returned by the tokenizer. Here is what my token class looks like :

class cToken:
Property Get StartPos() As cTokenOffset
Property Get EndPos() As cTokenOffset
Property Get UniqueID() As Long
Property Get Value() As String

I could have made “token” and abstract interface, or even a memento to be interpreted by the program that receives the output from the tokenizer.  I chose not to do this because it would preclude writing standard tokenizers.

Tokens have positional information in the form of a StartPosition (At what position did the match start ?), an EndPosition (what was the last character position for the match). These are of class cTokenOffset, which is actually just a tripple (streamOffset (starting at 0, lineNumber(starting at1), and lineOffset (at 0))

Tokens can possibly have a unique ID to distinguish the categories of tokens.  For a programming language, you will typically define an enum for all tokens (strings, number, operator, ...) in your language. The UniqueID allows you to see what got matched. Unique ID are generally most usefull as input for parsers.

The last property of a token is the value. It contains the string that got matched.

With out little token class in place, we are able to have a look at the main interface for this component :

ITokenizer

A tokenizer is something that reads input and returns a token if a match was found.  If no match was found, it returns Nothing.

This can be converted into the following interface definition

interface ITokenizer
Function NextToken(ByVal objInput As cInputStreamBuffered) As cToken

If you read the short article about streams, you migh have expected objInput to be of type IInputStream. This is not the case.  Why ?

The answer is simple.  Our tokenizers are trial-and-error tokenizers.  We have a bunch of them and let each of them try in turn to see if he can match the input. If he is not able to find a match, we have to ‘push back’ the characters he reads.  Then wecan let the next tokenizer have a try at exactly the same position in the input.

Since this functionality is easy to build on top of an existing IInputstream, we have created a new class called cInputStreamBuffered. It will work on any existing or non-existing IInputStream.

It accomplishes what it has to do, by keeping a buffer of all characters read so far.  In case of a rollback, it just resets the position and reads from the buffer until it is exhausted, after which it starts reading from the IInputstream again.  When it is reading from the inputstream, it always makes sure it adds the read characters to the buffer for a possible next rollback.

Here are the relevant methods of CInputStreamBuffered :

cInputStreamBuffered:
Function ReadChar() As String
Property
Get Bookmark() As Object
Property Set Bookmark(ByVal objVal As Object)
Sub CommitBookMark (ByVal objVal As Object)
Property Get Offset() As cTokenOffset
Property Get EOF() As Boolean

The Bookmark methods are the one enabling backtracking.  They are emitted as Object to make them blackboxes (Memento’s) to the user of the class. The CommitBookmark is the one that gives the sign to clear the buffer. (Without this method, the buffer would contain the contents of the entire IInputStream, which is not the idea)

There was no need at all to make cInputStreamBuffered an abstract interface since there is basically only one implementation (possible).

A first Tokenizer : cTokenizerChar

It is time to implement our first Tokenizer. It will be one of our most basic building blocks : cTokenizerChar. This tokenizer is capable of recognizing a single character. Here is what the code looks like :

Implements iTokenizer

Private mChar       As String
Private
mTokenID    As Long
Private mCompareMethod As VbCompareMethod

Friend Sub Initialize(ByVal strChar As String, _
                     ByVal nTokenID As Long, _
                     Optional ByVal bIgnoreCase As Boolean = False)
   mChar = strChar
   mTokenID = nTokenID
   mCompareMethod = IIf(bIgnoreCase, vbTextCompare, vbBinaryCompare)
End Sub

Private
Function iTokenizer_NextToken(ByVal objInput As cInputStreamBuffered) As cToken
  Dim objPosition       As cTokenOffset
  Dim objBookMark       As Object
  Dim strCurrent        As String
 
  With objInput
       Set objPosition = .Offset
       Set objBookMark = .Bookmark
      
       strCurrent = .ReadChar
      
       If StrComp(strCurrent, mChar, mCompareMethod) = 0 Then
             Set iTokenizer_NextToken = CreateToken(objPosition, .Offset, mTokenID, strCurrent)
       Else
          
' no match found
           Set iTokenizer_NextToken = Nothing
           Set .Bookmark = objBookMark
' roll back
       End If
       Set objPosition = Nothing
       Set objBookMark = Nothing
  End With
End
Function

As always, I constructed a factory method for this object. It is currently a private object, keeping tthe implementation details hidden.

Other primitives

If you remember the 4 primitives in the table above, you known that we need to implement 3 more primitives. These are (in the order of the table) cTokenizerChoice (A| B), cTokenizerKleeneStar(A *) and cTokenizerConcat ( A B).

[Note : Kleene is a person.  the one who invented the * and + notation and did some theoretical work on tokenizers]

The implementation of the 3 is basically the same. I will give you a snippet of cTokenizerKleeneStar :

Private Function iTokenizer_NextToken(ByVal objInput As cInputStreamBuffered) As cToken
  Dim objPosition       As cTokenOffset
  Dim objBookMark       As Object
  Dim objSubToken       As cToken
  Dim strLexeme         As String
 
  Set iTokenizer_NextToken = Nothing
  With objInput
       Set objPosition = .Offset
       Set objBookMark = .Bookmark
       Set objSubToken = mSubPattern.NextToken(objInput)
      
       strLexeme =
""
      
       While (Not objSubToken Is Nothing) And Not (.EOF)
           strLexeme = strLexeme & objSubToken.Value
           Set objBookMark = .Bookmark
           Set objSubToken = mSubPattern.NextToken(objInput)
       Wend
       Set .Bookmark = objBookMark
' roll back to last position returning a subtoken
      
      
' no way to not have a match
       Set iTokenizer_NextToken = CreateToken(objPosition, .Offset, mTokenID, strLexeme)
       Set objPosition = Nothing
       Set objBookMark = Nothing
       Set objSubToken = Nothing
  End With
End
Function

With these 4 primitive tokenizers, we are theoretically settled for all regular languages.

 To create a Tokenizer that matches any number of A’s followed by a B, you would write

CreateTokenizerConcat(CreateTokenizerKleeneStar(CreateTokenizerChar(“A”)),CreateT okenizerChar(“B”))

Needless to say, this code is very hard to read. We come back to the problem of readability later.

Other Standard Tokenizers

Now, I wouldn’t have mentioned the tokenizers in the second table if I was not going to implement them.  So I have written a cTokenizerRepeat(n,m), cTokenizerKleenePlus (+) and a cTokenizerRange.

Then it was time for the interesting work. If you take a look at the code of cKleeneStar, you will see that the value of the token is a concatenation of the values of the subpattern.  An obvious variant would be a decorator to ignore the result (the Value in the token) of a submatch.  With this decorator it becomes possible to match a pattern like p.e. : (1233) (45) - 899, while only putting the numeric values in the ‘Value’ of the token. It also enables you to parse VB strings, chopping of the surrounding quotes and replacing double quotes with single quotes (by ignoring the second quote).

Ok, we’re only getting started.  How about a “pseudo”-tokenizer that matches the beginning of the stream (CTokenizerBOF) and one that matches the end of the stream (cTokenizerEOF). This allows you to define an exact match for the input stream. They are mainly useful for pattern matching.

In language parsing, it is common to ignore certain tokens (comments and white space come to mind). We therefore wrote a simple decorator that is able to ignore a token based on its ID. (cTokenizerIgnoreID)

There is also a very important tokenizer for language parsing : cTokenizerCascade. This object will take any number of sub-tokenizers and evaluate them in-order. The first tokenizer that finds a match is the result of the cTokenizerCascade. Notice that it is similar in behavior to a cTokenizerChoice, but it doesn’t mess with the UniqueID returned by the sub-tokenizer.

In addition I wrote some Facade methods to allow for easy creation of certain common tokenizers :

createTokenizerString

Creates a concatenation of all characters in the string

createTokenizerStringChoice

Creates a choice of all characters in the string

Currently, these two methods are Facades, but there is nothing preventing me of creating specialized tokenizers to optimize performance. As a component user you will never know.

Making the code more readable

There is a huge problem with all these tokenizers. If you did not write them yourself, you could quickly get lost.  To worsen things, the code to generate them tends to be very LISP-like and unreadable.

I therefore set out to write a little parser which is able to convert a string into a myriad of tokenizers. You never need to know which tokenizers get generated or how they function.  The parser will generate them for you.

Most regular expression matchers use a language which comes from the unix world and which became popular with the tool grep (or was it vi ?) It uses very cryptic syntax, like in :

^a*(b|c [o-v]+){2,3}$

I can read it but I don’t like the syntax for several reasons. I like a more verbose syntax.  Therefore I invented my own.  The previous expression would become :

START 0MORE(“a”) REPEAT(OR( “b” , 1MORE (RANGE(“o”, “v”))), 2, 3) END

You might argue that this is hardly more readable. Let’s just say that this a largely a matter of opinion. If you don’t like the syntax, you can always roll your own little parser.

Enough said, the grammar goes like this :

Tokenizer ::= BaseTokenizer EOF
           | BaseTokenizer Tokenizer
           ;

BaseTokenizer ::= REPEAT ( TOKENIZER , NUMBER [ , NUMBER ] )
               | 0MORE ( TOKENIZER )
               | 1MORE ( TOKENIZER )
               | IGNORE ( TOKENIZER )
               | RANGE ( STRING , STRING)
               | START
               | END
               | OR ( TokenizerList )
               | DIGIT
               | DIGIT ( NUMBER )
               | DIGIT ( NUMBER , NUMBER)
               | LETTER
               | LETTER ( NUMBER )
               | LETTER ( NUMBER , NUMBER)
               | ALFA
               | ALFA ( NUMBER )
               | ALFA ( NUMBER , NUMBER)
               ;

TokenizerList ::= Tokenizer
               | Tokenizer , TokenizerList
               ;

The parser is a recursive-descend parser which constructs the necessary tokenizers on the fly.  What’s really interesting is that it uses the tokenizer classes to specify its own tokenizer. This always gives a marvelous impression common to bootstrap problems.  To conclude the discussion, the parser is encapsulated behind the facade method CreateTokenizerFromScript.

What’s ahead

The code for all this is published in the downloads section.

I am seriously thinking about making the basic tokenizers visitable. This would make it possible to write visitors that generate the VB  code that is behind the compound tokenizers generated by a script.  It also makes it possible to restore these visitors to a normalized expression in my little matching languages. We’ll see what we will do.

I wrote these tokenizers as part of a much larger project that involves parsing VB6 code.  The idea of that large project is to take an existing VB6 project and convert it to VB.NET (or C#) the way I want it to be converted.  After all - a lot off the classes I wrote are available in .NET (be it in a much more powerful version).

Conclusion

These few classes (21 - interfaces counted in) make it possible to quickly write your own tokenizer. As many of the other components they are not screamingly fast, but they are understandeable and maintainable. What’s more, they are extensible.  You could easily write your own high-performance tokenizers and combine them with existing ones. Or you could write a tokenizer that swaps the input of two sub-tokenizers making it really powerful.  I hope you enjoy using them as much as I enjoyed writing them.  Let me know.

These classes have the benefit of not requiring a generator ( like Lex, Flex, PCLex, VisualParse, ...) to deal with lexical analysis. Therefore you never need to leave your favorite tool and you don’t have a hard time doing cross-language debugging.

If you like this section , you can read about related stuff :

[Syntax coloring]
[
Sample tokenizers]

Subitems :

[Syntax coloring]
[
Sample tokenizers]

 

Site updated : Monday, February 17, 2003