Compiler | 3 Lexical Analyzer

This note is partially copied from Dragon Book - Compilers Principals Techniques and Tools (2nd Edition) Chapter 3 Lexical Analyzer.

3.1 The Role of the Lexical Analyzer

At the first phase of a compiler, the main task of the lexical analyzer is to read the input characters of the source program, group them into lexemes, and produce as output a sequence of tokens for each lexeme in the source program. The stream of tokens is sent to the parser for syntax analysis.

Strip out comments and whitespace (blank, newline, tab, and perhaps other characters that are used to separate tokens in the input).
Correlate error messages generated by the compiler with the source program.
The expansion of macros.

Sometimes, lexical analyzers are divided into a cascade of two processes:

Scanning consists of the simple processes that do not require tokenization of the input, such as deletion of comments and compaction of consecutive whitespace characters into one.
Lexical analysis proper is the more complex portion, where the scanner produces the sequence of tokens as output.

3.1.2 Tokens, Patterns, and Lexemes

When discussing lexical analysis, we use three related but distinct terms:

A token is a pair consisting of a token name and an optional attribute value. The token name is an abstract symbol representing a kind of lexical unit, e.g., a particular keyword, or a sequence of input characters denoting an identifier. The token names are the input symbols that the parser processes.
A pattern is a description of the form that the lexemes of a token may take. In the case of a keyword as a token, the pattern is just the sequence of characters that form the keyword. For identifiers and some other tokens, the pattern is a more complex structure that is matched by many strings.
A lexeme is a sequence of characters in the source program that matches the pattern for a token and is identified by the lexical analyzer as an instance of that token.

In many programming languages, the following classes cover most or all of the tokens:

One token for each keyword. The pattern for a keyword is the same as the keyword itself.
Tokens for the operators, either individually or in classes such as the token comparison (< or > or <= or >= or == or !=).
One token representing all identifiers.
One or more tokens representing constants, such as numbers and literal strings.
Tokens for each punctuation symbols, such as left and right parentheses, comma, and semicolon.

3.2 Input Buffering

3.2.1 Buffer Pairs

Because of the amount of time taken to process characters and the large number of characters that must be processed during the compilation of a large source program, specialized buffering techniques have been developed to reduce the amount of overhead required to process a single input character. An important scheme involves two buffers that are alternatively reloaded, as suggested in Fig 3.3.

Each buffer is of the same size N, and N is usually the size of a disk block, e.g. 4096 bytes.

Two pointers to the input are maintained:

Pointer lexemeBegin, marks the beginning of the current lexeme, whose extent we are attempting to determine.
Pointer forward scans ahead until a pattern match is found.

Advancing forward requires that we first test whether we have readched the end of one of the buffers, and if so, we must reload the other buffer from the input, and move forward to the beginning of the newly loaded buffer.

3.2.2 Sentinels

The sentinel is a special character that cannot be part of the source program, and a natural choice is the character eof.

3.3 Specification of Tokens

regular expressions

3.3.1 Strings and Languages

An alphabet is any finite set of symbols. Typical examples of symbols are letters, digits, and punctuation. The set {0, 1} is the binary alphabet. ASCII is an important example of an alphabet; it is used in many software systems. Unicode, which includes approximately 100,000 characters from alphabets around the world, is another example of an alphabet.

A string over an alphabet is a finite sequence of symbols drawn from that alphabet. In language theories, the terms "sentence" and "word" are often used as synonyms for "string". The length of a string s, usually written |s|, is the number of occurrences of symbols in s. The empty string, denoted ε, is the string of length zero.

A language is any countable set of strings over some fixed alphabet. Note that the definition of "language" does not require that any meaning be ascribed to the strings in the language.

3.3.2 Operations on Languages

3.3.3 Regular Expressions

The regular expressions are built recursively out of smaller regular expressions, using the rules described below. Each regular expression r denotes a language L(r), which is also defined recursively from the languages denoted by r's subexpressions. Here are the rules that define the regular expressions over some alphabet ∑ and the languages that those expressions denote.

BASIS: There are two rules that form the basis:

ε is a regular expression, and L(ε) is {ε}, that is, the language whose sole member is the empty string.
If a is a symbol in ∑, then a is a regular expression, and L(a) = {a}, that is, the language with one string, of length one, with a in its one position.

INDUCTION: There are four parts to the induction whereby larger regular expressions are built from smaller ones. Suppose r and s are regular expressions denoting languages L(r) and L(s), respectively.

(r) | (s) is a regular expression denoting the language L(r) ∪ L(s).
(r) (s) is a regular expression denoting the language L(r)L(s).
(r)* is a regular expression denoting (L(r))*.
(r) is a regular expression denoting L(r). This last rule says that we can add additional pairs of parentheses around expressions without changing the language they denote.

As defined, regular expressions often contain unnecessary pairs of parentheses. We may drop certain pairs of parentheses if we adopt the conventions that:

a) The unary operator * has highest precedence and is left associative.
b) Concatenation has second highest precedence and is left associative.
c) | has lowest precedence and is left associative.

A language that can be defined by a regular expression is called a regular set.

3.3.4 Regular Definitions

3.3.5 Extensions of Regular Expressions

One or more instances. If r us a regular expression, then (r)⁺ denotes the language (L(r))⁺.
Zero or one instance. r? is equivalent to r|ε, or put in another way, L(r?) = L(r) ∪ {ε}.
Character classes. A regular expression a₁|a₂|···|a_n, where the a_i's are each symbols of the alphabet, can be replaced by the shorthand [a₁a₂···a_n]. More importantly, when a₁|a₂|···|a_n form a logical sequence, e.g., consecutive uppercase letters, lowercase letters, or digits, we can replace them by a₁-a_n, that is, just the first and last separated by a hyphen. Thus, [abc] is shorthand for a|b|c, and [a-z] is shorthand for a|b|···|z.

3.4 Recognition of Tokens

3.4.1 Transition Diagrams

Transition diagrams have a collection of nodes or circles, called states. Each state represents a condition that could occur during the process of scanning the input looking for a lexeme that matches one if several patterns. We may think of a state as summarizing all we need to know about what characters we have seen between the lexemeBegin pointer and the forward pointer (as in the situation of Fig 3.3).

Edges are directed from one state of the transition diagram to another. Each edge is labeled by a symbol or set of symbols. If we are in some state s, and the next input symbol is a, we look for an edge out of state s labeled by a. If we find such an edge, we advance the forward pointer and enter the state of the transition diagram to which that edge leads. We shall assume that all out transition diagrams are deterministic, meaning that there is never more than one edge out of a given state with a given symbol among its labels.

Some conventions about transition diagrams:

Certain states are said to be accepting, or final. These states indicate that a lexeme has been found, although the actual lexeme may not consist of all positions between the lexemeBegin and forward pointers. We always indicate an accepting state by a double circle, and if there is an action to be taken —— typically returning a token and an attribute value to the parse —— we shall attach that action to the accepting state.
In addition, if it is necessary to retract the forward pointer one position (i.e., the lexeme does not include the symbol that got us to the accepting state), then we shall additionally place a * near that accepting state. (If necessary, we can attach any number of *'s to the accepting state.)
One state is designated the start state, or initial state; it is indicated by an edge, labeled "start", entering from nowhere. The transition diagram always begins in the start state before any input symbols have been read.

3.4.2 Recognition of Reserved Words and Identifiers

There are two ways that we can handle reserved words that look like identifiers:

Install the reserved words in the symbol table initially. A field of the symbol-table entry indicates that these strings are never ordinary identifiers, and tells which token they represent.
Create separate transition diagrams for each keyword.

3.4.4 Architecture of a Transition-Diagram-Based Lexical Analyzer

A variable state is holding the number of the current state for a transition diagram. A switch based on the value of state takes us to code for each of the possible states, where we find the action of that state. Often, the code for a state is itself a switch statement or multiway branch that determines the next state by reading and examining the next input character.

To place the simulation of one transition diagram in perspective, let us consider the ways code like Fig 3.18 could fit into the entire lexical analyzer.

We could arrange for the transition diagrams for each token to be tried sequentially.
We could run the various transition diagrams "in parallel", feeding the next input character to all of them and allowing each one to make whatever transitions it required. Note that a normal rule is to take the longest prefix of the input that matches any pattern.
The preferred approach is to combine all the transition diagrams into one. We allow the transition diagram to read input until there is no possible next state, and then take the longest lexeme that matched any pattern.

3.5 The Lexical-Analyzer Generator Lex

3.6 Finite Automata

Finite automata are essentially graphs like transition diagrams, with a few differences:

Finite automata are recognizers; they simply say "yes" or "no" about each possible input string.
Finite automata come in two flavors:
(a) Nondeterministic finite automata (NFA) have no restrictions on the labels of their edges. A symbol can label several edges out of the same state, and ε, the empty string, is a possible label.
(b) Deterministic finite automata (DFA) have, for each state, and for each symbol of its input alphabet exactly one edge with that symbol leaving that state.

Both deterministic and nondeterministic finite automata are capable of recognizing the same languages. In fact these languages are exactly the same languages, called the regular languages, that regular expressions can describe.

3.6.1 Nondeterministic Finite Automata

A nondeterministic finite automata (NFA) consists of:

A finite set of states S.
A set of input symbols ∑, the input alphabet. We assume that ε, which stands for the empty string, is never a member of ∑.
A transition function that gives, for each state, and for each symbol in ∑∪{ε} a set of next states.
A state s₀ from S that is distinguished as the start state (or initial state).
A set of states F, a subset of S, that is distinguished as the accepting states (or final states).

We can represent either an NFA or DFA by a transition graph, where the nodes are states and the labeled edges represent the transition function. There is an edge labeled a from state s to state t if and only if t is one of the next states for state s and input a. This graph is very much like a transition diagram, except:

a) The same symbol can label edges from one state to several different states and

b) An edge may be labeled by ε, the empty string, instead of, or in addition to, symbols from the input alphabet.

3.6.2 Transition Tables

We can also represent an NFA by a transition table, whose rows correspond to states, and whose columns correspond to the input symbols and ε. The entry for a given state and input is the value of the transition function applied to those arguments. If the transition function has no in formation about that state-input pair, we put Ø in the table for the pair.

3.6.3 Acceptance of Input Strings by Automata

An NFA accepts input string x if and only if there is some path in the transition graph from the start state to one of the accepting states, such that the symbols along the path spell out x.

Remember that an NFA accepts a string as long as some path labeled by that string leads from the start to accepting state. The existance of other paths leading to a nonaccepting state is irrelevant.

The language defined (or accepted) by an NFA is the set of strings labeled some path from the start to an accepting state. We may use L(A) to stand for the language accepted by automaton A.

3.6.4 Deterministic Finite Automata

A deterministic finite automaton (DFA) is a special case of an NFA where:

There are no movers on input ε, and
For each state s and input symbol a, there is exactly one edge out of s labeled a.

While NFA is an abstract representation of an algorithm to recognize the strings of a certain language, the DFA is a simple, concrete algorithm for recognizing strings. It is fortunate indeed that every regular expression and every NFA can be converted to a DFA accepting the same language, because it is the DFA that we really implement or simulate when building lexical analyzers.

3.7 From Regular Expressions to Automata

3.7.1 Conversion of an NFA to a DFA

The general idea behind the subset construction is that each state of the constructed DFA corresponds to a set of NFA states. After reading the input a₁a₂···a_n, the DFA is in that state which corresponds to the set of states that the NFA can reach, from its start state, following paths labeled a₁a₂···a_n.

It is possible that the number of DFA states is exponential in the number of NFA states, which could lead to difficulties when we try to implement this DFA. However, part of the power of the automaton-based approach to lexical analysis is that for real languages, the NFA and DFA have approximately the same number of states, and the exponential behavior is not seen.

(It is rare in my notes to put an example, but the following one is an exception for that this example provides a concrete supply for the above algorithm.)

3.7.2 Simulation of an NFA

3.7.3 Efficiency of NFA Simulation

O (k(m+n)), in which k is the length of the input, m+n is the size of the transition graph.

Compiler | 3 Lexical Analyzer