手写语法分析使用递归下降分析法和算符优先分析法。
BNF
语法分析对应上下文无关文法。定义时一般用BNF描述出来。lua的BNF大致如下
chunk ::= block
block ::= {stat [";"]}
stat ::=
"do" block "end" |
"while" exp "do" block "end" |
"if" exp "then" block {"elseif" exp "then" block} ["else" block] "end" |
"for" Name "=" exp "," exp ["," exp] "do" block "end" |
"for" namelist "in" explist "do" block "end" |
"function" funcname funcbody |
"local" "function" Name funcbody |
"local" namelist ["=" explist] |
"return" [explist] |
"break" |
varlist "=" explist |
Name {tableindex | funccall} funccall
namelist ::= Name {"," Name}
varlist ::= var {"," var}
var ::= Name [{tableindex | funccall} tableindex]
funcname ::= Name {"." Name} [":" Name]
funcbody ::= "(" [parlist] ")" block "end"
parlist ::= Name {"," Name} ["," "..."] | "..."
explist ::= {exp ","} exp
tableconstructor ::= "{" [fieldlist] "}"
fieldlist ::= field {fieldsep field} [fieldsep]
field ::= "[" exp "]" "=" exp | Name "=" exp | exp
fieldsep ::= "," | ";"
exp ::= mainexp | exp binop exp
mainexp ::= nil | false | true | Number | String |
"..." | function | tableconstructor |
prefixexp |
unop exp|
function ::= "function" funcbody
prefixexp ::= (Name | "(" exp ")") {tableindex | funccall}
tableindex ::= "[" exp "]" | "." Name
funccall ::= args | ":" Name args
args ::= "(" [explist] ")" | tableconstructor
binop ::= "+" | "-" | "*" | "/" | "^" | "%" | ".." |
"<" | "<=" | ">" | ">=" | "==" | "~=" |
"and" | "or"
unop ::= "-" | "not" | "#"
这个语法定义经过整理,便于手写解析器。
递归下降解析
递归下降分析法用于处理常规语句。上面的BNF,经过特殊设计,读取至多两个前缀单词就可以确定选什么解析函数,可以算是LL(2)
文法了。
代码实现比较无脑,差不多是一个BNF语句对应一个解析函数和一个语法树结点类型。读取前缀单词,看看应该选择什么解析函数。
以stat为例伪代码
Block ParseBlock()
{
var block = new Block();
for (; ; )
{
SyntaxTree statement = null;
var token_ahead = LookAhead();
switch (token_ahead.m_type)
{
case (int)';':
NextToken(); continue;
case (int)TokenType.DO:
statement = ParseDoStatement(); break;
case (int)TokenType.WHILE:
statement = ParseWhileStatement(); break;
case (int)TokenType.IF:
statement = ParseIfStatement(); break;
case (int)TokenType.FOR:
statement = ParseForStatement(); break;
case (int)TokenType.FOREACH:
statement = ParseForEachStatement(); break;
case (int)TokenType.FUNCTION:
statement = ParseFunctionStatement(); break;
case (int)TokenType.LOCAL:
statement = ParseLocalStatement(); break;
case (int)TokenType.RETURN:
statement = ParseReturnStatement(); break;
case (int)TokenType.BREAK:
statement = ParseBreakStatement(); break;
case (int)TokenType.CONTINUE:
statement = ParseContinueStatement(); break;
default:
statement = ParseOtherStatement();
break;
}
if (statement == null)
break;
block.statements.Add(statement);
}
return block;
}
一个特殊的地方,最后两个stat,赋值语句和函数调用语句不容易做区分。
他们的开头结构类似,都是Name {tableindex | funccall}
,区别是函数调用语句的结尾是funccall
,赋值语句不是。
也就是需要比较后缀结构,但是结构是一样的,用同一个函数解析,检查结果类型就行了。
bool IsVar(SyntaxTree t)
{
return t is TableAccess || t is Terminator;
}
SyntaxTree ParseOtherStatement()
{
// lua做了限制,其他语句只有两种,assign statement and func call
SyntaxTree exp;
if (LookAhead().m_type == (int)TokenType.NAME)
{
exp = ParsePrefixExp();
if(IsVar(exp))
{
// assign statement
var assign_statement = new AssignStatement();
assign_statement.var_list.Add(exp);
while(LookAhead().m_type != (int)'=')
{
if (NextToken().m_type != (int)',')
throw new ParserException("expect ',' to split var-list");
if (LookAhead().m_type != (int)TokenType.NAME)
throw new ParserException("expect 'id' to start var");
exp = ParsePrefixExp();
if (!IsVar(exp))
throw new ParserException("expect var here");
assign_statement.var_list.Add(exp);
}
NextToken();// skip '='
assign_statement.exp_list = ParseExpList();
return assign_statement;
}
else
{
Debug.Assert(exp is FuncCall);
return exp;
}
}
else
{
if (IsMainExp())
throw new ParserException("incomplete statement");
return null;
}
}
算符优先分析法
上面的BNF有个地方是递归下降方法解析不了的,文法定义里存在递归结构,基本所有的语言都有这类表达式语句。
exp ::= mainexp | exp binop exp
这儿做了相当的简化,mainexp
包含了"(" exp ")" | unop exp
,简化后表达式就是最简单的二元表达式了。
二元表达式是最基本的表达式了,算符优先分析法可以很好的解析这种语法结构。
文字简述过程:
设置两个栈,操作数栈和操作符栈。
扫描表达式,遇到操作数直接入栈,遇到操作符时,做下优先级判断确定下一步操作。
如果操作符优先级高于栈顶操作符,则入栈;否则,弹出栈顶操作符并从操作数栈里弹出两个操作数,组合成二元表达式,当成新的操作数入栈,并继续这个过程,直到当前操作符入栈。
写代码时有个特别的技巧,可以用函数调用栈代替操作数栈和操作符栈。
SyntaxTree ParseExp(int left_priority = 0)
{
var exp = ParseMainExp();
while(true)
{
int right_priority = GetOpPriority(LookAhead());
if (left_priority < right_priority ||(left_priority == right_priority && IsRightAssociation(LookAhead())))
{
// C++的函数参数执行顺序没有明确定义,方便起见,不在函数参数里搞出两个有依赖的函数调用,方便往C++里迁移
var op = NextToken();
exp = new BinaryExpression(exp, op, ParseExp(right_priority));
}
else
{
return exp;
}
}
}