[Common] 编译原理概述

心血来潮第二波~ 这次参照了Stanford的课啦：https://courses.edx.org/courses/course-v1:StanfordOnline+SOE.YCSCS1+1T2020/course/
（再次强推计算机入门的CS50虽然是哈佛的）

Intro

※ Interpreters and Compilers

首先区分一下什么是interpreters（解释器），解释器是给它输入program和data，它就会根据program以及data得到一个输出output，也就是实时根据输入给到输出；而compilers（编译器）是通过program得到一个可执行程序，你可以往可执行程序里面输入data得到输出，可执行程序是静态的，所以编译器是offline的，不是实时的。

interpreters and compilers

所以其实解释器就类似一个即时编译 + 运行的盒子，所以效率会比较低，即使执行的program和之前的一样也要重新跑一次。但是它可以做到跨平台，但编译器对不同的平台需要输出不同的可执行程序。

※ Compiler的组成：

lexical analysis
parsing,
semantic analysis
optimization
code generation

※ Step 1: recognize words

你可以很快的看出来这句话：'this is a sentence'. 但是如果这么看就会有点儿别捏：'thi sis ase ntence'。

The goal of lexical analysis, then, is to divide the
program text into its words, or what we call in compiler speak, the tokens.

举个例子：if x == y then z = 1; else z = 2; 这句话就需要分出来x、y、z三个变量名；以及keyword例如if、else、then；token例如空格、分号；常量例如1 & 2；操作operators例如==以及=号。

※ Step 2: parsing（diagram图解树）

先举个例子，英文句子是如何组织的，需要主语、动词和宾词之类的：

英文句子的parse

同样的，可以用一个树来分解if-else这种编程语句，也就是parse：

parse if-else语句

※ Step 3: semantic analysis

semantic analysis常常是去看程序本身有什么inconsistencies的地方。

例如：Jack said Jerry left Jerry's assignment at home.这句话里面的his就有指代不清的问题，不知道是jack还是jerry。

举个程序的列子，语义分析就需要分析各种变量的绑定、作用域之类的：

int jack = 4;
{
  int jack = 3;
  cout << jack;
}

而且这个part还需要检测一些错误~

※ Step 4: optimization

优化的目的就是run faster并且用更少的内存。例如x = y * 0可以简化为x = 0。

※ Step 5: code generation

一般是翻译为汇编语言，也可能会是其他语言~

※ Programming Languages

Q1: 为啥要有这么多种编程语言呢？
因为很多Application Domain会有很多矛盾的需求，不能用一个语言满足各种Application Domain的需求。
例如，科学领域和商业领域的需求就不一样，科学领域会需要更好的计算和浮点支持，但是商业领域会对数据分析以及report之类更care。

Q2: 为什么会有新的语言？
在信息革命的时候，因为有各种各样的Application Domain的需求，就诞生了很多种编程语言。
因为科技在变化，但旧的语言会越来越稳定倾向于不变化，所以就伴随会有新的语言，新语言的主要成本是需要教学使用者。但新的语言总是看起来有一些地方像旧的语言，这样就会更容易上手，例如java比较像C++。

Q3: 怎样是一个好的编程语言？
没有一个universally accepted的语言设计标准。一个语言是不是会被widely use也不完全由技术决定，也会和它的辅助开发工具是不是齐全、应用领域之类的相关。

Lexical Analysis

这一步需要根据divider分割语句，以及识别各个字符(词素, lexeme)的角色。例如英文句子里的主谓宾。

这些角色就是Token Class，例如identifier变量名 / keyword / whitespace / numbers /( / ) / ...

Lexical Analysis会输出一系列的token，也就是token class和lexeme的键值对给parser：

tokens

下面的例子是fortran里面的一个例子，Fortran语言是不识别空格的，也就是la la和lala是一样的变量名。

这里一个标点符号完全改变了整个句子的意思，如果是,号则代表一个do loop，i的取值会是从1到25然后；如果是.号，代表的就是一个赋值语句，并没有loop，只是赋值了1.25给do5i这个变量：

一个例子

那么要如何应对这种状况呢？识别到DO的时候并不确认是一个do loop还是变量名的一部分，这个时候就需要lookahead了。

所以Lexical Analysis的一个需求就是要lookahead看一下类似这个例子里面的是逗号还是别的符号，Lexical Analysis的一个目标也就是尽量少数量的lookahead，看的越近越好。

另一个lookahead的例子就是，==在识别的时候会不会识别成=号呢，就需要再往后看一个字符。

※ regular languages

Lexical Analysis是通过正则来识别tokens的~可以用正则识别数字、变量名、空格(whitespace & newline & tab)、keyword先~

所以操作大概是用字符串从第0位开始到第n位去匹配identifier/number/keyword/whitespace的union，然后拿匹配到以后从原字符串去掉再开始匹配新的。

那么有的时候第0到第1位是=，第0到第2位是==，怎么办呢？
以长度更长的为准
如果第0到n位同时满足了keyword以及identifier怎么办呢？
以keyword为准，keyword的优先级更高
如果什么都没有匹配上要怎么handle error呢？
需要一个error的正则，就是前面的几种都不符合，优先级最低，当哪个都没match就会进入这个集合。

※ Finite Automata

有限自动机就是正则表达的实现，它的原理是从状态1读入字符以后转换到状态2：S1 ---input---> S2。

例如读入一串儿字符，每次拿其中一个做状态转换，如果到最后进入了accepting state就是被接受了~

自动机图示

举个栗子：

识别1...0

非常难过的是我之前这里有写NFA/DFA以及如何转换之类的，以及对应的状态转换table，以table来考量输入，实现正则表达式。这一系列都被简书吞了。。。（好气啊！

其实最开始我们都是把正则转为NFA，然后转DFA，然后得到table。

转化过程

确定有限自动机(Deterministic Finite Automaton) 简称DFA。DFA是匹配速度，是确定的。
非确定有限自动机(Nondeterministic Finite Automaton) 简称NFA，NFA是匹配结果，是不确定的。从一个状态输入同样的字符会有多种结果，但表占用内存少。

※ 正则转NFA

正则转NFA其实就是有一系列的套路，就是有几个公式，毕竟正则也就几个逻辑，and or not啥的，按照规定转NFA即可：

从RegExp到NFA

举个栗子

※ NFA转DFA

然后就是NFA转DFA啦，用到的主要概念就是Epsilon Closure，就是找到所有通过空操作可以到达的状态点们，从这些点出发输入同样一个字符又会到达一些状态点，把这些点设为一个就成为了DFA：

举个栗子

※ DFA到implement

DFA就是确定状态机了，输入一个字符就会跳转到一个固定的新状态，故而我们可以创建一个状态 & 输入的新字符 & 转换到新的状态是啥的一个table，根据table来code。其实也可以从NFA直接到转换table哈~

DFA impl

所以其实DFA会更快，但是NFA会更compat~ 但根本还是通过input看到哪个状态，然后再根据下一个input做状态转换。

Parsing

We need some way of describing the valid strings of tokens and
then some kind of algorithm for distinguishing the valid and invalid
strings of tokens from each other.

※ Context Free Grammars 上下文无关文法

可以参考：https://www.jianshu.com/p/e1d47de41331

CFG组成

CFG由非终结符集合、非空有限的终结符集、开始符号（非终结符）、产生式集合组成，产生式就类似我们的语法，而CFG其实就是根据语法，把一个句子转换成符合语法的各个part，以此来确认这个句子是不是符合语法。

示例

※ Derivation

Derivation

Derivation就是通过tree的方式，把CFG分解的过程表示出来~ 最底层的叶子都是终结符，里面的节点都是非终结符，并且如果做inward reversal of the leaves我们将得到输入的表达式~

left-most和right-most derivation其实就是parse的方向不一样，从左开始or从右开始，但parse tree结果都是一致的。

left-most derivation举例

※ Ambiguity

A grammar is ambiguous if it has more than one Parse tree for some string.

一种消除的方式就是重写语法保持语法树唯一，另外一种方式是增加precedence and associativity declaration (优先权和可结合性声明)，后者是比较常用的方式，例如if else里面 if 总是和最近的else match的。

如何消除if-else二义性

left先结合消除二义性

※ Error Handling

例如Panic mode恐慌模式：从剩余的输入中不断删除字符，直到词法分析器能够在剩余输入的开头发现一个正确的词法单元为止；以及Error production，用error产生式识别error。

※ Abstract Syntax Trees

抽象语法树是源代码的抽象语法结构的树状表示，树上的每个节点都表示源代码中的一种结构，这所以说是抽象的，是因为抽象语法树并不会表示出真实语法出现的每一个细节，比如说，嵌套括号被隐含在树的结构中，并没有以节点的形式呈现。

※ Recursive Descent Parsing

循环产生式，直到发现错了就回退，穷尽所有可能性的方式parse。

举个例子：
𝐸→𝐸′ | 𝐸′+𝐸
𝐸′→−𝐸′ | 𝑖𝑑 | (𝐸)

E
E’
-E’
id
(E)
E’ + E
-E’ + E
id + E
id + E’
id + -E’
id + id

※ Recursive Descent Algorithm

Recursive Descent实现

Start the parser up, we have to initialize the next pointer to point to the first token in the input stream and we have to invoke the function that matches anything derivable from the start symbols.

栗子

但是注意哦，如果输入是int * int，那么在int match了T()的第一个int就return了，但实际上T()的第二个表达式才是真正match的，所以会涉及一个backtrace的问题。

※ Left Recursion

Left Recursion的就是如果做Recursive Descent会不断地循环，因为右边最左的S和左边的S一致。

左循环

解决这个问题的方式就是改成Right Recursion的产生式：

left to right recursive

Predictive Parsing

这个part是如何能predict要用哪个产生式而不出错0.0

主要还是通过lookahead来实现的，也就是LL(K)语法，left-to-right scan，a leftmost derivation，K tokens of look ahead。也就是从左到右看，提前看K个字符。

int * int就很不容易predict的产生式

上面的问题是，T可以转换成int开头的两种方式，那么就面临选择，所以下面通过改写避免了这种问题：（类似合并同类项）

改写成只有一种选择的

LL(1) TABLE

根据table借助stack记录parse的产生式，如果栈顶是终结符则pop，如果是非终结符则根据LL(1) table看下一个的输入决定替换为哪个产生式：

示例

※ First set & Follow set

First集：该关于该符号的所有产生式右部第一个遇到终结符

用上面那句话：关于S的产生式有两个：S->AB,S->bC

先看简单的情况：S->bC，明显右部第一个终结符是b, 那关于这个产生式的终结符就是b 了。

然后是S->AB，这时右部的第一个是A，非终结符，所以不成立。这时你就要再把A的产生式引进来（因为A有关于他的产生式）。

关于A的产生式为：A->#,A->b,分别代入S->AB的产生式得：S->B(应该是S->#B，但是#可以省略) 和S->bB，看第二个S->bB ，马上就可以知道遇到的第一个终结符是b

然后看第一个S->B，这个时候B不是终结符，所以不成立，这时就要把B的产生式导进来，变成S->aD ,S->#。

则这个时候first(S)={b, a , #}

first集就是所有第一个终结符 and 所有非终结符的first集的并集~ 如果第一个非终结符可能是空，则再并上后面的终结符的first集以此类推

Follow集：该符号后面跟着的第一个终结符

follow set

※ LL1 Parsing Tables

Our goal is to construct a parsing table T for
a context free grammar G.

规则

很多table是不能做到每个move（每个格子）里只有一个选项的，也就是不是LL(1)的。

If any entry is multiply defined in the parsing table, then the grammar is not LL(1). And in fact, this is the definition of an LL(1) grammar, so the only way to be sure that the grammar is LL(1) or the mechanical way to check that the grammar is LL(1), is to build the LL(1) parsing table and see if all the entries in the table is unique.

※ Bottom-Up Parsing

Bottom up parsing is more general
than deterministic top down parsing.

自底向上parse

a bottom up parser traces a rightmost derivation in reverse，也就是parse的时候都是去parse最右侧的非终结符

※ Handlers

一个句型的最左直接短语称为该句型的句柄，句型的句柄是和某产生式右部匹配的子串，并且，把它规约成该产生式左部的非终结符，句柄代表了最右推导过程的逆过程的一步。

Semantic Analysis

前端parse

因为有些error不是context free的也就是上下文有关的，语法是不能够发现的error，所以需要Semantic Analysis这一步去做类似的check：

check list

※ Scope

有些是static scope，也有些language是dynamic scope。scope容易引起的问题就类似定义了一个class但是先于定义使用了这个class。

※ Symbol Tables

可以通过遇到一个变量就压栈的方式，每次找都在栈里面找到最近的变量，并且如果出了这个变量的作用域就弹栈，来check是不是有define这个变量。

用table来看scope

class的是不是已经define过了是不能这么做的，只能一开始先pass一遍程序拿到所有的class definition，然后再check一遍。

※ Types

types

不同语言的type check

type check就类似如果e1是int，e2也是int，那么e1+e2就还是应该是int。

※ Type Environments

So what is a free variable, a variable is free in an expression if it is not defined within that expression.

The type environment encodes this information so a type environment is a function from object identifiers from variable names to types.

※ Implementing Type Checking

实现

※ Static vs. Dynamic Typing

The static type of a variable will be its given type. The dynamic type of that variable will depend on what is assigned to it during program execution.

※ Self Type

self就是runtime的时候实际的type，有的时候如果你写死了返回父类的type，就不能用这个方法赋值给子类，但是如果你return self就可以了~

但是self是static type不是dynamic type哦~~

The best way to think of an occurrence of self-type is that it's a type variable that ranges over all the sub-classes of the class in which it appears.

※ Error Recovery

如果没有声明类型的会当做Object类型的，然后去做检查：

没有声明的都当做object

但是上面这种方式会引发一连串的error，比如x假设为object，那么x+2就是illegal的操作，然后x+2又被作为object，然后y身为int被赋值就又有问题了。

另一种方式是引入No_Type，可以作为任一种类型的子类：

No_Type

Runtime Organization

The main thing we're going to cover in this sequence of videos is the management of Runtime resources and in particular I'm going to be stressing the correspondence and the distinction between static and dynamic structures. So static structures are things that exist to compile time and dynamic structures, those are the things that exist or happen at Runtime.

内存中低地址画在top，高地址在bottom酱紫：

code gen

※ Activations

Activations就是函数被调用~

the activation tree depends on the runtime behavior of the program. So it depends on the runtime value who's exactly which procedures are called and what the activation tree turns out to be.

Now, this was not illustrated in our examples but it
should be obvious that the activation tree can be different for different inputs.

当procedure被调用会push到栈里面，当执行结束返回会pop栈。code下面就是activations stack用于记录函数调用栈。

memory

※ Activation Records

An activation record is all the information that's needed to manage the execution of one procedure activation And often, this is also called a frame that means exactly the same thing as activation record. These are just two names for the same thing.

栈帧

※ Globals and Heap

globals是全局有效的，所以不能存在activation record里面。 So. The way that little variables are implemented is that all global are signed the fix address once And these variables with fixed addresses are said to be statically allocated because they're allocated essentially at compiled times.

增加了static变量的memory布局

布局

Now many lang uage implementations use both the heap and the stack and there is a little bit of an issue here because both the heap and the stack grow. And so we have to take care that they don't grow into each and step on each other's data And there is a very nice and simple solution to this and as a start to heap and the stack at opposite ends of memory and let them grow towards each other.

memory

※ Alignment

32bit和64bit分别对应4/8 byte的内存boundary，也就是内存的单位是4/8 byte，如果内容没有满一个单位，则会填充空bit：

内存对齐

※ Stack Machines

比如7+5，会先把7和5压栈，然后弹栈相加，把result 12压栈：

stack

如果op = e1 + e2 + …… + en，那么会先把e1弹栈，压栈e1的结果，以此类推，到en的时候只是计算en不压栈，然后弹栈n-1个之前的result，相加后再压栈。accumulator会用来存储计算的值~

多个相加

Code Generation

MIPS架构（MIPS architecture，为Microprocessor without interlocked piped stages architecture的缩写，亦为Millions of Instructions Per Second的双关语），是一种采取精简指令集（RISC）的处理器架构，最早的MIPS架构是32位，最新的版本已经变成64位。

code gen其实就是用MIPS实现Stack Machines。寄存器a0就是acc用于存储，sp指向下一条指令地址：

实现

※ Temporaries

the improvement that we're going to make Is have the co-generator assign a fixed location In the activation record for each temporaries.

We're going to pre-allocate memory or a spot in the activation record for each temporary and then we will be able to save and restore the temporary without having to do the stack pointer manipulations.

if we know how many temporaries that needs in advance then we could allocate the space for those in the activation record rather having to do push and pop, pushing and popping from the stack at runtime. 常量区的存在避免了频繁push/pop。

※ Object Layout

cool里面的object

Now the class tag is an integer which just identifies the class of the object. So the compiler will number all of the classes.

object size is also an integer which is just a size of the object in words and the dispatch pointer.

Dispatch pointer is a pointer to a table of methods so the methods are stored off to the side and the dispatch pointer is a pointer to that table.

All of this is laid out in the continuous chunk of memory.

class的属性会跟在方法列表pointer的后面~

Q: 为什么属性是直接embed在class里面，而方法列表用pointer指出去呢？
A: 因为属性对于有100个对象有相同名字的属性，也不能用一个位置存储，他们都是独立的。但是方法列表如果有100个类都有相同的，那么其实他们可以用同一个method pointer，因为方法不涉及数据，是可以共用的，可以节约空间。

Local Optimization & Global Optimization

※ Intermediate Code

Intermediate Language is just that, it's a language that's intermediate between the source language and the target language

应该类似bitcode叭~ 中间代码和汇编target语言的生成非常类似~

The main difference between generating assembly code and generating intermediate code is that we can use any number of registers in the Intermediate Language to hold intermediate results.

中间语言的优点：

中间语言与具体机器特性无关，一种中间语言可以为生成多种不同型号的目标机的目标代码服务。
可对中间语言进行与机器无关的优化，有利于提高目标代码的质量。
把源程序映射成中间代码表示，再映射成目标代码的工作分在几个阶段进行，使编译算法更加清晰。

对于中间语言，要求其不但与机器无关，而且有利于代码生成。

※ Optimization Overview

应该在哪个语言层次实施优化

优化距离t = 2 * x; s = t + x其实可以简化为s = 3 * x如果要是t在其他地方木有使用~

优化啥

三种优化

※ Local Optimization

主要是常量替换 & dead code elimination。

例如x = x + 0; x = x * 1; x = x * 0，都可以不用跨函数的local优化，x = x * 8还可以优化为x = x << 3

if 2 < 0 jump P可以被删除优化掉，因为if的条件永远是false的。还有例如if DEBUG then在debug的时候是有的，release的时候就会被优化删掉。

优化举例

没用过的变量可以删掉

Each local optimization actually does very little by itself. And some of these optimizations, some of these transformations that are presented actually don't make the program run faster at all. They don't make it run slower either but by themselves they don't actually make any improvement to the program. But, Typically, the optimizations will interact. So performing one optimization will enable another. 有些优化看起来没啥用，但可以帮助后面的优化，优化是一步一步互相影响的。

※ Peephole Optimization

Peephole Optimization是直接优化汇编代码的一种tech。

Peephole

※ Global Optimization

在dataflow过程中，如果if-else两边都木有修改x，那么之后的x还是可以用最初的赋值替换：

control flow用常量替换

"global dataflow analysis," and it's designed specifically to check conditions like this. And essentially, global dataflow analysis is called "global" because it requires an analysis of the entire control-flow graph.

Register Allocation

中间代码的一个问题就是用了无限个寄存器，如何解决这个问题呢，就是用多对一的方式：

many temp to one reg

举个例子

So, if I have two temporaries t1 and t2, I want to know when they can share register. So, they're allowed to share a register and they're allowed to be in the same register if they are not live at the same time.

※ Graph Coloring

control

首先RIG图会把同时出现的变量连起来（例如同时在等式右侧，左右侧是不算共存的），所以没有connection的两个点才可以放到同一个reg里面。

然后我们可以用图着色(graph coloring)方法解决寄存器分配问题。我们可以用N个颜色，也就是有多少个寄存器来着色：

着色图示例

用reg实现

如何着色，首先先找到neighbor少于k的节点，一个一个放入堆栈并移除，然后直到所有都放入以后，从栈顶一个一个pop然后分配颜色，原则就是不能和已经分配的neighbor同色：

着色规则

※ Managing Caches

寄存器访问很快，所以比较少，很expensive；cache访问比较慢，也会相对多一点；内存访问会更慢一点，但大小会更大；硬盘就会更大的容量了。

存储对比

把最忙的循环放在内层也是一种编译优化~

※ Automatic Memory Management

如果有unused内存，会被释放掉。哪些是unused的呢，其实和java里面的GC很像，就是引用数无法找到它了，他成为了一个孤岛unreachable。

※ Mark and Sweep

当内存用光，就会进行Mark and Sweep标记-清除算法。

标记：从根集合进行扫描，对存活对象进行标记（每个object有一个bit是用于mark的）
清除：对堆内存从头到尾进行线性遍历，回收不可达对象内存
缺点：碎片化，产生内存碎片

※ Stop and Copy

将所有存货的对象从当前的堆复制到另一个堆，没有被复制的全部都是垃圾。

这种方式效率会降低，原因有两个。

得有2个堆，在这2个堆之间来回使用。也就是需要多使用一个堆的空间。
当程序进入稳定状态后，可能只会产生少量、甚至没有垃圾，但是仍然会来回复制，就显得很浪费。

※ Reference Counting

每个object存一下有多少指向它的指针，当归零的时候就应该被回收了。

引用计数

这种的优点是好实现，回收快；缺点是循环引用无法释放以及每次assign都要操作计数比较慢。相比之下GC的可以并行就会效率更高一些。

原谅我如此划水。。。最后一个小问题，编译器是用什么语言写的？

第一个C语言编译器应该是用汇编写的，但是第一个成熟的C语言编译器应该是由汇编和C语言共同写的。

编译原理讲到了“自举编译器”。大意就是先用底层语言（应该是汇编）写一个能运行，但效率极低的C语言编译器（底层语言不好优化），有了C语言的编译器以后，就可以用C语言好好写一个编译器了。