Compiler | 3 Lexical Analyzer

This note is partially copied from Dragon Book - Compilers Principals Techniques and Tools (2nd Edition) Chapter 3 Lexical Analyzer.


3.1 The Role of the Lexical Analyzer

At the first phase of a compiler, the main task of the lexical analyzer is to read the input characters of the source program, group them into lexemes, and produce as output a sequence of tokens for each lexeme in the source program. The stream of tokens is sent to the parser for syntax analysis.

  • Strip out comments and whitespace (blank, newline, tab, and perhaps other characters that are used to separate tokens in the input).

  • Correlate error messages generated by the compiler with the source program.

  • The expansion of macros.

Sometimes, lexical analyzers are divided into a cascade of two processes:

  • Scanning consists of the simple processes that do not require tokenization of the input, such as deletion of comments and compaction of consecutive whitespace characters into one.

  • Lexical analysis proper is the more complex portion, where the scanner produces the sequence of tokens as output.

3.1.2 Tokens, Patterns, and Lexemes

When discussing lexical analysis, we use three related but distinct terms:

  • A token is a pair consisting of a token name and an optional attribute value. The token name is an abstract symbol representing a kind of lexical unit, e.g., a particular keyword, or a sequence of input characters denoting an identifier. The token names are the input symbols that the parser processes.

  • A pattern is a description of the form that the lexemes of a token may take. In the case of a keyword as a token, the pattern is just the sequence of characters that form the keyword. For identifiers and some other tokens, the pattern is a more complex structure that is matched by many strings.

  • A lexeme is a sequence of characters in the source program that matches the pattern for a token and is identified by the lexical analyzer as an instance of that token.

In many programming languages, the following classes cover most or all of the tokens:

  1. One token for each keyword. The pattern for a keyword is the same as the keyword itself.

  2. Tokens for the operators, either individually or in classes such as the token comparison (< or > or <= or >= or == or !=).

  3. One token representing all identifiers.

  4. One or more tokens representing constants, such as numbers and literal strings.

  5. Tokens for each punctuation symbols, such as left and right parentheses, comma, and semicolon.


3.2 Input Buffering

3.2.1 Buffer Pairs

Because of the amount of time taken to process characters and the large number of characters that must be processed during the compilation of a large source program, specialized buffering techniques have been developed to reduce the amount of overhead required to process a single input character. An important scheme involves two buffers that are alternatively reloaded, as suggested in Fig 3.3.

Each buffer is of the same size N, and N is usually the size of a disk block, e.g. 4096 bytes.

Two pointers to the input are maintained:

  1. Pointer lexemeBegin, marks the beginning of the current lexeme, whose extent we are attempting to determine.

  2. Pointer forward scans ahead until a pattern match is found.

Advancing forward requires that we first test whether we have readched the end of one of the buffers, and if so, we must reload the other buffer from the input, and move forward to the beginning of the newly loaded buffer.

3.2.2 Sentinels

The sentinel is a special character that cannot be part of the source program, and a natural choice is the character eof.


3.3 Specification of Tokens

regular expressions

3.3.1 Strings and Languages

An alphabet is any finite set of symbols. Typical examples of symbols are letters, digits, and punctuation. The set {0, 1} is the binary alphabet. ASCII is an important example of an alphabet; it is used in many software systems. Unicode, which includes approximately 100,000 characters from alphabets around the world, is another example of an alphabet.

A string over an alphabet is a finite sequence of symbols drawn from that alphabet. In language theories, the terms "sentence" and "word" are often used as synonyms for "string". The length of a string s, usually written |s|, is the number of occurrences of symbols in s. The empty string, denoted ε, is the string of length zero.

A language is any countable set of strings over some fixed alphabet. Note that the definition of "language" does not require that any meaning be ascribed to the strings in the language.

3.3.2 Operations on Languages

3.3.3 Regular Expressions

The regular expressions are built recursively out of smaller regular expressions, using the rules described below. Each regular expression r denotes a language L(r), which is also defined recursively from the languages denoted by r's subexpressions. Here are the rules that define the regular expressions over some alphabet ∑ and the languages that those expressions denote.

BASIS: There are two rules that form the basis:

  1. ε is a regular expression, and L(ε) is {ε}, that is, the language whose sole member is the empty string.

  2. If a is a symbol in ∑, then a is a regular expression, and L(a) = {a}, that is, the language with one string, of length one, with a in its one position.

INDUCTION: There are four parts to the induction whereby larger regular expressions are built from smaller ones. Suppose r and s are regular expressions denoting languages L(r) and L(s), respectively.

  1. (r) | (s) is a regular expression denoting the language L(r) ∪ L(s).

  2. (r) (s) is a regular expression denoting the language L(r)L(s).

  3. (r)* is a regular expression denoting (L(r))*.

  4. (r) is a regular expression denoting L(r). This last rule says that we can add additional pairs of parentheses around expressions without changing the language they denote.

As defined, regular expressions often contain unnecessary pairs of parentheses. We may drop certain pairs of parentheses if we adopt the conventions that:

a) The unary operator * has highest precedence and is left associative.
b) Concatenation has second highest precedence and is left associative.
c) | has lowest precedence and is left associative.

A language that can be defined by a regular expression is called a regular set.

3.3.4 Regular Definitions

3.3.5 Extensions of Regular Expressions

  1. One or more instances. If r us a regular expression, then (r)+ denotes the language (L(r))+.

  2. Zero or one instance. r? is equivalent to r|ε, or put in another way, L(r?) = L(r) ∪ {ε}.

  3. Character classes. A regular expression a1|a2|···|an, where the ai's are each symbols of the alphabet, can be replaced by the shorthand [a1a2···an]. More importantly, when a1|a2|···|an form a logical sequence, e.g., consecutive uppercase letters, lowercase letters, or digits, we can replace them by a1-an, that is, just the first and last separated by a hyphen. Thus, [abc] is shorthand for a|b|c, and [a-z] is shorthand for a|b|···|z.


3.4 Recognition of Tokens

3.4.1 Transition Diagrams

Transition diagrams have a collection of nodes or circles, called states. Each state represents a condition that could occur during the process of scanning the input looking for a lexeme that matches one if several patterns. We may think of a state as summarizing all we need to know about what characters we have seen between the lexemeBegin pointer and the forward pointer (as in the situation of Fig 3.3).

Edges are directed from one state of the transition diagram to another. Each edge is labeled by a symbol or set of symbols. If we are in some state s, and the next input symbol is a, we look for an edge out of state s labeled by a. If we find such an edge, we advance the forward pointer and enter the state of the transition diagram to which that edge leads. We shall assume that all out transition diagrams are deterministic, meaning that there is never more than one edge out of a given state with a given symbol among its labels.

Some conventions about transition diagrams:

  1. Certain states are said to be accepting, or final. These states indicate that a lexeme has been found, although the actual lexeme may not consist of all positions between the lexemeBegin and forward pointers. We always indicate an accepting state by a double circle, and if there is an action to be taken —— typically returning a token and an attribute value to the parse —— we shall attach that action to the accepting state.

  2. In addition, if it is necessary to retract the forward pointer one position (i.e., the lexeme does not include the symbol that got us to the accepting state), then we shall additionally place a * near that accepting state. (If necessary, we can attach any number of *'s to the accepting state.)

  3. One state is designated the start state, or initial state; it is indicated by an edge, labeled "start", entering from nowhere. The transition diagram always begins in the start state before any input symbols have been read.

3.4.2 Recognition of Reserved Words and Identifiers

There are two ways that we can handle reserved words that look like identifiers:

  1. Install the reserved words in the symbol table initially. A field of the symbol-table entry indicates that these strings are never ordinary identifiers, and tells which token they represent.

  2. Create separate transition diagrams for each keyword.

3.4.4 Architecture of a Transition-Diagram-Based Lexical Analyzer

A variable state is holding the number of the current state for a transition diagram. A switch based on the value of state takes us to code for each of the possible states, where we find the action of that state. Often, the code for a state is itself a switch statement or multiway branch that determines the next state by reading and examining the next input character.

To place the simulation of one transition diagram in perspective, let us consider the ways code like Fig 3.18 could fit into the entire lexical analyzer.

  1. We could arrange for the transition diagrams for each token to be tried sequentially.

  2. We could run the various transition diagrams "in parallel", feeding the next input character to all of them and allowing each one to make whatever transitions it required. Note that a normal rule is to take the longest prefix of the input that matches any pattern.

  3. The preferred approach is to combine all the transition diagrams into one. We allow the transition diagram to read input until there is no possible next state, and then take the longest lexeme that matched any pattern.


3.5 The Lexical-Analyzer Generator Lex


3.6 Finite Automata

Finite automata are essentially graphs like transition diagrams, with a few differences:

  1. Finite automata are recognizers; they simply say "yes" or "no" about each possible input string.

  2. Finite automata come in two flavors:
    (a) Nondeterministic finite automata (NFA) have no restrictions on the labels of their edges. A symbol can label several edges out of the same state, and ε, the empty string, is a possible label.
    (b) Deterministic finite automata (DFA) have, for each state, and for each symbol of its input alphabet exactly one edge with that symbol leaving that state.

Both deterministic and nondeterministic finite automata are capable of recognizing the same languages. In fact these languages are exactly the same languages, called the regular languages, that regular expressions can describe.

3.6.1 Nondeterministic Finite Automata

A nondeterministic finite automata (NFA) consists of:

  1. A finite set of states S.

  2. A set of input symbols ∑, the input alphabet. We assume that ε, which stands for the empty string, is never a member of ∑.

  3. A transition function that gives, for each state, and for each symbol in ∑∪{ε} a set of next states.

  4. A state s0 from S that is distinguished as the start state (or initial state).

  5. A set of states F, a subset of S, that is distinguished as the accepting states (or final states).

We can represent either an NFA or DFA by a transition graph, where the nodes are states and the labeled edges represent the transition function. There is an edge labeled a from state s to state t if and only if t is one of the next states for state s and input a. This graph is very much like a transition diagram, except:

a) The same symbol can label edges from one state to several different states and

b) An edge may be labeled by ε, the empty string, instead of, or in addition to, symbols from the input alphabet.

3.6.2 Transition Tables

We can also represent an NFA by a transition table, whose rows correspond to states, and whose columns correspond to the input symbols and ε. The entry for a given state and input is the value of the transition function applied to those arguments. If the transition function has no in formation about that state-input pair, we put Ø in the table for the pair.

3.6.3 Acceptance of Input Strings by Automata

An NFA accepts input string x if and only if there is some path in the transition graph from the start state to one of the accepting states, such that the symbols along the path spell out x.

Remember that an NFA accepts a string as long as some path labeled by that string leads from the start to accepting state. The existance of other paths leading to a nonaccepting state is irrelevant.

The language defined (or accepted) by an NFA is the set of strings labeled some path from the start to an accepting state. We may use L(A) to stand for the language accepted by automaton A.

3.6.4 Deterministic Finite Automata

A deterministic finite automaton (DFA) is a special case of an NFA where:

  1. There are no movers on input ε, and

  2. For each state s and input symbol a, there is exactly one edge out of s labeled a.

While NFA is an abstract representation of an algorithm to recognize the strings of a certain language, the DFA is a simple, concrete algorithm for recognizing strings. It is fortunate indeed that every regular expression and every NFA can be converted to a DFA accepting the same language, because it is the DFA that we really implement or simulate when building lexical analyzers.


3.7 From Regular Expressions to Automata

3.7.1 Conversion of an NFA to a DFA

The general idea behind the subset construction is that each state of the constructed DFA corresponds to a set of NFA states. After reading the input a1a2···an, the DFA is in that state which corresponds to the set of states that the NFA can reach, from its start state, following paths labeled a1a2···an.

It is possible that the number of DFA states is exponential in the number of NFA states, which could lead to difficulties when we try to implement this DFA. However, part of the power of the automaton-based approach to lexical analysis is that for real languages, the NFA and DFA have approximately the same number of states, and the exponential behavior is not seen.

(It is rare in my notes to put an example, but the following one is an exception for that this example provides a concrete supply for the above algorithm.)

3.7.2 Simulation of an NFA

3.7.3 Efficiency of NFA Simulation

O (k(m+n)), in which k is the length of the input, m+n is the size of the transition graph.

3.7.4 Construction of an NFA from a Regular Expression

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 203,324评论 5 476
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 85,303评论 2 381
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 150,192评论 0 337
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 54,555评论 1 273
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 63,569评论 5 365
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 48,566评论 1 281
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 37,927评论 3 395
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 36,583评论 0 257
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 40,827评论 1 297
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 35,590评论 2 320
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 37,669评论 1 329
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 33,365评论 4 318
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 38,941评论 3 307
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 29,928评论 0 19
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 31,159评论 1 259
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 42,880评论 2 349
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 42,399评论 2 342

推荐阅读更多精彩内容