在使用Babel的时间,总是遇到AST(抽象语法树)的问题,觉得有点高深。其实这些东西在大学编译原理这么课程中学到过,但长久不用,早已还给了喜子姐。昨天看Babel文档,看到这个超小编译器的demo,就决定看下源码回顾一下。
编译步骤(此处不讨论完整的编译流程,只针对下面例子编译的步骤进行说明):
- 词法分析(Lexical Analysis):从左到右一个一个的读入源程序,识别一个单词或符号,并进行归类。
- 语法分析(Syntactic Analysis):在词法分析的基础上,将单词序列分解成各类语法短语,如“程序”,“语句”,“表达式”等。称之为中间表示或者抽象语法树。
- 转换(Transformation):在上一步生成的AST上进行修改,将之前的语言翻译为另外一种语言。当操作AST时,我们可以通过添加、删除、替换属性来操作节点,也可以增加或删除节点,甚至可以基于之前的AST创建一个全新的AST。
- 代码生成(Code Generation):一般情况下意味着将AST字符串化为代码。
例子:
将LISP语言的函数调用翻译为C语言的函数调用。
LISP | C | |
---|---|---|
2 + 2 | (add 2 2) | add(2, 2) |
4 - 2 | (subtract 4 2) | subtract(4, 2) |
2 + (4 - 2) | (add 2 (subtract 4 2)) | add(2, subtract(4, 2)) |
(add 2 (subtract 4 2))经过词法分析得到
[
{ type: 'paren', value: '(' },
{ type: 'name', value: 'add' },
{ type: 'number', value: '2' },
{ type: 'paren', value: '(' },
{ type: 'name', value: 'subtract' },
{ type: 'number', value: '4' },
{ type: 'number', value: '2' },
{ type: 'paren', value: ')' },
{ type: 'paren', value: ')' },
]
然后经过语法分析得到AST
{
type: 'Program',
body: [{
type: 'CallExpression',
name: 'add',
params: [{
type: 'NumberLiteral',
value: '2',
}, {
type: 'CallExpression',
name: 'subtract',
params: [{
type: 'NumberLiteral',
value: '4',
}, {
type: 'NumberLiteral',
value: '2',
}]
}]
}]
}
然后进行转换得到
{
type: 'Program',
body: [{
type: 'ExpressionStatement',
expression: {
type: 'CallExpression',
callee: {
type: 'Identifier',
name: 'add'
},
arguments: [{
type: 'NumberLiteral',
value: '2'
}, {
type: 'CallExpression',
callee: {
type: 'Identifier',
name: 'subtract'
},
arguments: [{
type: 'NumberLiteral',
value: '4'
}, {
type: 'NumberLiteral',
value: '2'
}]
}
}
}]
}
最后通过代码生成得到
add(2, subtract(4, 2))
源码分析
- tokenizer(用来词法分析)
function tokenizer(input) {
// 创建一个用来标记代码中当前位置的变量
let current = 0;
// 创建一个tokens数组用来保存识别出的token
let tokens = [];
// We start by creating a `while` loop where we are setting up our `current`
// variable to be incremented as much as we want `inside` the loop.
//
// We do this because we may want to increment `current` many times within a
// single loop because our tokens can be any length.
while (current < input.length) {
// We're also going to store the `current` character in the `input`.
let char = input[current];
// 首先检测开括号
if (char === '(') {
// If we do, we push a new token with the type `paren` and set the value
// to an open parenthesis.
tokens.push({
type: 'paren',
value: '(',
});
// Then we increment `current`
current++;
// And we `continue` onto the next cycle of the loop.
continue;
}
// 检测闭括号,增加一个新的token
// increment `current`, and `continue`.
if (char === ')') {
tokens.push({
type: 'paren',
value: ')',
});
current++;
continue;
}
// 识别空白符号并抛弃
// So here we're just going to test for existence and if it does exist we're
// going to just `continue` on.
let WHITESPACE = /\s/;
if (WHITESPACE.test(char)) {
current++;
continue;
}
// 下一个token类型是number
// So we start this off when we encounter the first number in a sequence.
let NUMBERS = /[0-9]/;
if (NUMBERS.test(char)) {
// We're going to create a `value` string that we are going to push
// characters to.
let value = '';
// Then we're going to loop through each character in the sequence until
// we encounter a character that is not a number, pushing each character
// that is a number to our `value` and incrementing `current` as we go.
while (NUMBERS.test(char)) {
value += char;
char = input[++current];
}
// After that we push our `number` token to the `tokens` array.
tokens.push({ type: 'number', value });
// And we continue on.
continue;
}
// 增加对双引号包围字符串的支持
// We'll start by checking for the opening quote:
if (char === '"') {
// Keep a `value` variable for building up our string token.
let value = '';
// We'll skip the opening double quote in our token.
char = input[++current];
// Then we'll iterate through each character until we reach another
// double quote.
while (char !== '"') {
value += char;
char = input[++current];
}
// Skip the closing double quote.
char = input[++current];
// And add our `string` token to the `tokens` array.
tokens.push({ type: 'string', value });
continue;
}
// 最后一种token是 `name` token. 这是一个字母的序列,在我们对的lisp语法中是函数的名字
let LETTERS = /[a-z]/i;
if (LETTERS.test(char)) {
let value = '';
// Again we're just going to loop through all the letters pushing them to
// a value.
while (LETTERS.test(char)) {
value += char;
char = input[++current];
}
// And pushing that value as a token with the type `name` and continuing.
tokens.push({ type: 'name', value });
continue;
}
// Finally if we have not matched a character by now, we're going to throw
// an error and completely exit.
throw new TypeError('I dont know what this character is: ' + char);
}
return tokens;
}
- parser(进行语法分析生成AST)
function parser(tokens) {
// Again we keep a `current` variable that we will use as a cursor.
let current = 0;
// But this time we're going to use recursion instead of a `while` loop. So we
// define a `walk` function.
function walk() {
// Inside the walk function we start by grabbing the `current` token.
let token = tokens[current];
// We're going to split each type of token off into a different code path,
// starting off with `number` tokens.
//
// We test to see if we have a `number` token.
if (token.type === 'number') {
// If we have one, we'll increment `current`.
current++;
// And we'll return a new AST node called `NumberLiteral` and setting its
// value to the value of our token.
return {
type: 'NumberLiteral',
value: token.value,
};
}
// If we have a string we will do the same as number and create a
// `StringLiteral` node.
if (token.type === 'string') {
current++;
return {
type: 'StringLiteral',
value: token.value,
};
}
// Next we're going to look for CallExpressions. We start this off when we
// encounter an open parenthesis.
if (
token.type === 'paren' &&
token.value === '('
) {
// We'll increment `current` to skip the parenthesis since we don't care
// about it in our AST.
token = tokens[++current];
// We create a base node with the type `CallExpression`, and we're going
// to set the name as the current token's value since the next token after
// the open parenthesis is the name of the function.
let node = {
type: 'CallExpression',
name: token.value,
params: [],
};
// We increment `current` *again* to skip the name token.
token = tokens[++current];
// And now we want to loop through each token that will be the `params` of
// our `CallExpression` until we encounter a closing parenthesis.
//
// Now this is where recursion comes in. Instead of trying to parse a
// potentially infinitely nested set of nodes we're going to rely on
// recursion to resolve things.
//
// To explain this, let's take our Lisp code. You can see that the
// parameters of the `add` are a number and a nested `CallExpression` that
// includes its own numbers.
//
// (add 2 (subtract 4 2))
//
// You'll also notice that in our tokens array we have multiple closing
// parenthesis.
//
// [
// { type: 'paren', value: '(' },
// { type: 'name', value: 'add' },
// { type: 'number', value: '2' },
// { type: 'paren', value: '(' },
// { type: 'name', value: 'subtract' },
// { type: 'number', value: '4' },
// { type: 'number', value: '2' },
// { type: 'paren', value: ')' }, <<< Closing parenthesis
// { type: 'paren', value: ')' }, <<< Closing parenthesis
// ]
//
// We're going to rely on the nested `walk` function to increment our
// `current` variable past any nested `CallExpression`.
// So we create a `while` loop that will continue until it encounters a
// token with a `type` of `'paren'` and a `value` of a closing
// parenthesis.
while (
(token.type !== 'paren') ||
(token.type === 'paren' && token.value !== ')')
) {
// we'll call the `walk` function which will return a `node` and we'll
// push it into our `node.params`.
node.params.push(walk());
token = tokens[current];
}
// Finally we will increment `current` one last time to skip the closing
// parenthesis.
current++;
// And return the node.
return node;
}
// Again, if we haven't recognized the token type by now we're going to
// throw an error.
throw new TypeError(token.type);
}
// Now, we're going to create our AST which will have a root which is a
// `Program` node.
let ast = {
type: 'Program',
body: [],
};
// And we're going to kickstart our `walk` function, pushing nodes to our
// `ast.body` array.
//
// The reason we are doing this inside a loop is because our program can have
// `CallExpression` after one another instead of being nested.
//
// (add 2 2)
// (subtract 4 2)
//
while (current < tokens.length) {
ast.body.push(walk());
}
// At the end of our parser we'll return the AST.
return ast;
}
- traverser(深度遍历指定AST,并让visitor执行相关行为)
function traverser(ast, visitor) {
// A `traverseArray` function that will allow us to iterate over an array and
// call the next function that we will define: `traverseNode`.
function traverseArray(array, parent) {
array.forEach(child => {
traverseNode(child, parent);
});
}
// `traverseNode` will accept a `node` and its `parent` node. So that it can
// pass both to our visitor methods.
function traverseNode(node, parent) {
// We start by testing for the existence of a method on the visitor with a
// matching `type`.
let methods = visitor[node.type];
// If there is an `enter` method for this node type we'll call it with the
// `node` and its `parent`.
if (methods && methods.enter) {
methods.enter(node, parent);
}
// Next we are going to split things up by the current node type.
switch (node.type) {
// We'll start with our top level `Program`. Since Program nodes have a
// property named body that has an array of nodes, we will call
// `traverseArray` to traverse down into them.
//
// (Remember that `traverseArray` will in turn call `traverseNode` so we
// are causing the tree to be traversed recursively)
case 'Program':
traverseArray(node.body, node);
break;
// Next we do the same with `CallExpression` and traverse their `params`.
case 'CallExpression':
traverseArray(node.params, node);
break;
// In the cases of `NumberLiteral` and `StringLiteral` we don't have any
// child nodes to visit, so we'll just break.
case 'NumberLiteral':
case 'StringLiteral':
break;
// And again, if we haven't recognized the node type then we'll throw an
// error.
default:
throw new TypeError(node.type);
}
// If there is an `exit` method for this node type we'll call it with the
// `node` and its `parent`.
if (methods && methods.exit) {
methods.exit(node, parent);
}
}
// Finally we kickstart the traverser by calling `traverseNode` with our ast
// with no `parent` because the top level of the AST doesn't have a parent.
traverseNode(ast, null);
}
- transformer(生成新的语法树)
function transformer(ast) {
// We'll create a `newAst` which like our previous AST will have a program
// node.
let newAst = {
type: 'Program',
body: [],
};
// Next I'm going to cheat a little and create a bit of a hack. We're going to
// use a property named `context` on our parent nodes that we're going to push
// nodes to their parent's `context`. Normally you would have a better
// abstraction than this, but for our purposes this keeps things simple.
//
// Just take note that the context is a reference *from* the old ast *to* the
// new ast.
ast._context = newAst.body;
// We'll start by calling the traverser function with our ast and a visitor.
traverser(ast, {
// The first visitor method accepts any `NumberLiteral`
NumberLiteral: {
// We'll visit them on enter.
enter(node, parent) {
// We'll create a new node also named `NumberLiteral` that we will push to
// the parent context.
parent._context.push({
type: 'NumberLiteral',
value: node.value,
});
},
},
// Next we have `StringLiteral`
StringLiteral: {
enter(node, parent) {
parent._context.push({
type: 'StringLiteral',
value: node.value,
});
},
},
// Next up, `CallExpression`.
CallExpression: {
enter(node, parent) {
// We start creating a new node `CallExpression` with a nested
// `Identifier`.
let expression = {
type: 'CallExpression',
callee: {
type: 'Identifier',
name: node.name,
},
arguments: [],
};
// Next we're going to define a new context on the original
// `CallExpression` node that will reference the `expression`'s arguments
// so that we can push arguments.
node._context = expression.arguments;
// Then we're going to check if the parent node is a `CallExpression`.
// If it is not...
if (parent.type !== 'CallExpression') {
// We're going to wrap our `CallExpression` node with an
// `ExpressionStatement`. We do this because the top level
// `CallExpression` in JavaScript are actually statements.
expression = {
type: 'ExpressionStatement',
expression: expression,
};
}
// Last, we push our (possibly wrapped) `CallExpression` to the `parent`'s
// `context`.
parent._context.push(expression);
},
}
});
// At the end of our transformer function we'll return the new ast that we
// just created.
return newAst;
}
- codeGenerator(递归调用自己将每个节点转成字符串)
function codeGenerator(node) {
// We'll break things down by the `type` of the `node`.
switch (node.type) {
// If we have a `Program` node. We will map through each node in the `body`
// and run them through the code generator and join them with a newline.
case 'Program':
return node.body.map(codeGenerator)
.join('\n');
// For `ExpressionStatement` we'll call the code generator on the nested
// expression and we'll add a semicolon...
case 'ExpressionStatement':
return (
codeGenerator(node.expression) +
';' // << (...because we like to code the *correct* way)
);
// For `CallExpression` we will print the `callee`, add an open
// parenthesis, we'll map through each node in the `arguments` array and run
// them through the code generator, joining them with a comma, and then
// we'll add a closing parenthesis.
case 'CallExpression':
return (
codeGenerator(node.callee) +
'(' +
node.arguments.map(codeGenerator)
.join(', ') +
')'
);
// For `Identifier` we'll just return the `node`'s name.
case 'Identifier':
return node.name;
// For `NumberLiteral` we'll just return the `node`'s value.
case 'NumberLiteral':
return node.value;
// For `StringLiteral` we'll add quotations around the `node`'s value.
case 'StringLiteral':
return '"' + node.value + '"';
// And if we haven't recognized the node, we'll throw an error.
default:
throw new TypeError(node.type);
}
}
- compiler(将以上方法串联起来)
function compiler(input) {
let tokens = tokenizer(input);
let ast = parser(tokens);
let newAst = transformer(ast);
let output = codeGenerator(newAst);
// and simply return the output!
return output;
}