At the beginning, this lecture introduced some preliminaries of data flow analysis:
Program Points:
Program points refers to the points before and after each statement in a program.
Input & Output States:
Each program points is associated with a program state. When inspecting a specific statement, we can name the program point before and after it with input state and output state. Each execution of an IR(Intermediate Representation) statement would transform an input state to another output state. For some more complex situations in a control flow, we can use sign ^
to indicate the "meet" of two states.
Here are some illustrations:
To further abstraction, we can take the transformation corresponding as a function. In this context, considering a statement \(s\) with the input state \({\rm IN}[s]\) and the output state \({\rm OUT}[s]\) we have two different way of analysis. For forward analysis, we inspect the function \({\rm OUT}[s] = f_s({\rm IN}[s])\), and for backward analysis, we inspect the function \({\rm IN}[s] = f_s({\rm OUT}[s])\).
We can also generalize these notations to BBs(Basic Blocks) in a CFG(Control Flow Graph). For instance, considering a block \(B\) sequentially consisting the statements \(s_1, \dots, s_n\), we can name that \({\rm IN}[B] = {\rm IN}[s_1]\), \({\rm OUT}[B] = {\rm OUT}[s_n]\).
And in forward analysis \({\rm OUT}[B] = f_B({\rm IN})[B]\), we have \(f_B = f_{s_n} \circ \dots, \circ f_{s_2} \circ f_{s_1}\). In case there is a control flow meets at the point before B, we also have \({\rm IN}[B] = \bigwedge_{P \to B} {\rm OUT}[P]\). What's more, we can obviously find that to do backward analysis is to do forward analysis on the reverse of the given graph, so these notations are naturally similar in backward analysis contexts.
The lecture continued to introduce the data flow analysis applications. We would save some complex issues including method calls and aliases to the further lectures, and focus on the inter-procedural analysis.
The first application is Reaching Definitions Analysis, which is defined as
A definition d at program point d reaches a point q if there is a path from p to q such that d is not "killed" along the path,
where a definition \(d\) of a variable \(v\) is a statement that assigns a value to v, and "killed" means there is a new definition of \(v\) occurs along the path.
Reaching definition can be useful in complier optimization, or some simple undefined variables detection. For example, we can introduce a dummy definition for each variable at the entry(beginning node) of CFG, and if a dummy definition reaches a point p where the corresponding variable is used, bugs may occur here.
To achieve a sound analysis of reaching definitions, we should do safe-approximation(over-approximation) when implementing the transfer function.
For a definition D: v = x op y
, we generates a definition D of variable v and "kills" all the other definitions in the program that define v, while leaving the remaining income definitions unaffected. Formally we have: \({\rm OUT}[B] = gen_B \cup ({\rm IN}[B] - kill_B)\), where \({\rm IN}\) and \({\rm OUT}\) refers to the bits compressed state of the reaching definitions; 0 refers to killed while 1 refers to defined. Compressing states to a vector of bits is a common way when doing algorithms concerning states like search or dynamic programming.
And for the occasion where control flow meets, naturally we should make sure that \({\rm IN}[B] = \bigcup_{P \to B}{\rm OUT}[P]\).
After these work, we can design a iterative algorithm for data flow analysis(the algorithm is a template not only for reaching definitions analysis):
We can give a brief proof that the algorithm would end in finite steps. Considering each time a compressed state changes from \(a\) to \(b\), as the implementation of our transfer function indicates, it must satisfies \(a \or b = b\). In another word, the change can only be \(0 \to 1\), while \(1 \to 0\) would be impossible. So in the worst case, the state would become 111111...1
, and that is an ending.