-
Notifications
You must be signed in to change notification settings - Fork 147
Writing a Data Flow Analysis
Why I cannot write a data-flow analysis on-the-fly? Writing a data-flow analysis is a challenging task and can be tough. Therefore, you should be familiar with the underlying theory in order to be able to develop a novel data-flow analysis. We compiled a (incomplete) list of literature for you which you may consult before writing an analysis. Please have a look at Useful Literature.
PhASAR provides several sophisticated mechanisms that allow you to write your own data-flow analysis. In general, PhASAR is designed in such a way that the analysis writer has to choose from several possible interfaces which he can use for a concrete static analysis. Depending on your analysis problem some interfaces are more suited than others. When having found the right interface for your problem, the analysis writer usually just has to provide a new class which implements the interfaces missing functionality which then in turn is the problem description for the analysis. This concrete problem description is then handed over to a corresponding solver, which solves the problem in a completely automatic fashion. In the following the possible data-flow solvers are ordered according to their power (and difficulty to provide an analysis for).
In general, all data-flow analysis is performed on the codes control-flow graph. When writing an analysis a user has to choose a control flow graph for his analysis to work. For that reason all of our solvers work either on the CFG.h (intra-procedural control-flow graph) or on the ICFG.h (inter-procedural control-flow graph) interface.
For instance, when writing a simple intra-procedural data-flow analysis using the monotone framework the use must use one of CFG.h's concrete implementation or provide his own implementation for this interface. Usually the pre-implemented LLVMBasedCFG.h should do the job and can be used out-of-the-box.
The inter-procedural call-string approach and the IFDS/ IDE frameworks solve a concrete analysis based on an inter-procedural control-flow graph describing the structure of the code under analysis. Depending on the analysis needs, you can use a forward-, backward-, or bi-directional inter-procedural control-flow graph. However, most of the time the 'LLVMBasedICFG' should work just fine.
If necessary it is also possible to provide you own implementation of an ICFG. In that case just provide another class for which you provide a reasonable name. You implementation must at least implement the interface that is defined by the ICFG.h interface.
In the following section some useful coding shortcuts are presented which may come very handy when writing a new analysis within PhASAR.
When overriding the classes in order to describe the problems within the monotone framework, oftentimes set operations like set union, set intersect or set difference are required. Writing these functions yourself is tedious. Therefore it makes much sense to use the existing set operation functions which are defined in the std::algorithm header file. Many useful functionality is provided there such as:
...
std::includes /* subset */
std::set_difference
std::set_intersection
std::set_union
...
The neat thing about these functions is that they are completely generic as they operating an iterators and provide several useful overloads. They work on different container types which follow the concept required by these functions. In the following a small code example is presented:
std::set<int> a = {1, 2, 3};
std::set<int> b = {6, 4, 3, 2, 8, 1};
std::set<int> c;
std::set_union(a.begin(), a.end(),
b.begin(), b.end(),
inserter(c, c.begin()));
bool isSubSet = std::includes(b.begin(), b.end(), a.begin(), a.end());
When defining flow functions in IFDS or IDE, sometimes a certain flow function type occurs more than once. For instance, the Kill flow function that kills a specific data-flow fact is often needed many times. As the writer of an IFDS or IDE analysis you can find several useful pre-defined flow functions that can be used directly. Another very useful example is the Identity flow function. Some of these flow functions are also defined as a singleton if possible in order to keep the amount of overhead as small as possible. You can use the pre-defined flow functions inside your flow function factories using std::make_shared(). We also provide some LLVM specific flow functions and some general edge functions that may be used when implementing an IDE analysis.
Here is a small code example:
// in this example let domain D be const llvm::Value*
shared_ptr<FlowFunction<const llvm::Value*>> getNormalFlowFunction(....) {
// check the type of the instruction
if( ...) {
// do some work
} else if (...) {
// kill some data-flow fact
return make_shared<Kill<const llvm::Value*>>(/* some fact to be killed */);
} else {
// just treat everything else as Identity
return Identity<const llvm::Value*>::getInstance();
}
}
The code is written in a very generic way. For that reason we use a lot of template parameters. Here we describe the most important template parameters:
- D
- The type of the data-flow facts of your data-flow domain D. Very often you probably would like to use const llvm::Value *.
- N
- The type of nodes in you inter-procedural control-flow graph (or statements/ instructions). When using analysis on llvm IR it will always be const llvm::Instruction *.
- M
- The type of functions/ methods used by the framework. When using llvm it will be const llvm::Function *.
- I
- The type of the inter-procedural control-flow graph to be used. Usually it will be some reference to a type implementing the ICFG.h interface. For example: LLVMBasedICFG &.
- V/L
- The is the type for the second - value domain - of IDE problem. What this should be really depends of your concrete analysis. When using IFDS you do not have to worry about this type, since it is automatically chosen for you as:
enum class BinaryDomain {
BOTTOM = 0,
TOP = 1
};
This is probably the easiest analysis you can write and if the analysis writer is a beginner, he should definitely start at this point. Using an intra-procedural monotone framework analysis, an data-flow analysis problem can be solved within one single function (caution: function calls within the function under analysis are not followed, but the call-sites are still in the code of course). In order to formulate such an analysis the user has to implement the IntraMonotoneProblem.h interface. His implemented analysis is then handed over to the corresponding IntraMonotonSolver.h which solves his analysis problem.
The implementation will be finished soon.
This analysis can be used when inter-procedural data-flow problems must be solved. It uses the classical monotone framework combined with the call-string approach to achieve k-context sensitivity. The k can be specified by the analysis implementor. The interface the analysis writer has to implement (InterMonotoneProblem) contains a few more functions than the IntraMonotoneProblem.h and thus is slightly more complex. Please note that this solver has scaling problems for a large k on large programs. If the analysis writer demands a scalable analysis with infinite context sensitivity, he may would like to formulate his data-flow problem with an IFDS or IDE analysis (caution: IFDS and IDE can only be used when the flow functions used are distributive).
When you would like to write your own data-flow analysis using IFDS you basically just have to implement a single class. It is a good idea to create a new directory for your new analysis that lives below 'ifdside_problems' and is name after the naming conventions that you will find there and contains the name of the analysis in one form or another.
To make your class an analysis problem our solver is able to solve, you let your class inherit from 'DefaultIFDSTabulationProblem'. The concrete analysis is formulated by overwriting all abstract functions of the 'DefaultIFDSTabulationProblem'. The member functions you have to override are:
-
getNormalFlowFunction()
- Here you formulate you flow function(s) that is (are) applied to every instruction within a functions body.
-
getCallFlowFunction()
- Here you express what the solver should do when it hits a call-site. In this flow function the actual parameters are usually mapped to the formal parameters of the called function.
-
getRetFlowFunction()
- Here you usually propagate the information at the end of the callee back to the caller.
-
getCallToRetFlowFunction()
- Every information that is not touched by the function called is handled here. Usually you just want to pass every information as identity.
-
Optional: getSummaryFlowFunction()
- Default: Implement to return nullptr. But can be used to model llvm.intrinsic functions or libc function that you do not want to follow as no implementation is available.
-
initialSeeds()
- The initial seeds are the starting points of your analysis. An analysis can start at one or more points in your program. The functions must return start points as well as a set of data-flow facts that hold at these program points. Unless your analysis requires otherwise you would just give the first instruction of the main function as a starting point and use the special zero fact that holds at the analysis start.
-
createZeroValue()
- This function must define what value your analysis should use as the special zero value.
-
Constructor
- The constructor of you analysis receives the ICFG that shall be used for this analysis as a parameter. Furthermore, the constructor of the DefaultIFDSTabulationProblem that you inherit from must be called AND the special zero value must be initialized in a suitable way. Here is an example of how your constructor can looks like:
IFDSSolverTest::IFDSSolverTest(LLVMBasedICFG &I, vector<string> EntryPoints)
: DefaultIFDSTabulationProblem<const llvm::Instruction *,
const llvm::Value *, const llvm::Function *,
LLVMBasedICFG &>(I), EntryPoints(EntryPoints) {
DefaultIFDSTabulationProblem::zerovalue = createZeroValue();
}
If you read this, you made it very far and will now explore the most complex solver we currently provide within PhASAR (but do not worry, we are already planing to include another even more abstract solver). When writing an IDE analysis you only have to implement a single class as well. The general concept is very similar to writing an IFDS analysis. But this time your analysis class has to inherit from 'DefaultIDETabulationProblem'. In addition to this documentation, please also refer to the built-in implementations of IDE data-flow analysis of PhASAR.
The member functions you have to provide some implementations for are:
-
getNormalFlowFunction()
- See Writing an IFDS analysis
-
getCallFlowFuntion()
- See writing an IFDS analysis
-
getRetFlowFunction()
- See writing an IFDS analysis
-
getCallToRetFlowFunction()
- See writing an IFDS analysis
-
Optional: getSummaryFlowFunction()
- see writing an IFDS analysis
-
initialSeeds()
- See writing an IFDS analysis
-
topElement()
- A function that returns the top element of the lattice the analysis is using -> meaning no information at all
-
bottomElement()
- A function that returns the bottom element of the lattice the analysis is using -> meaning all information (the most imprecise lattice elemnt)
-
join()
- A function that defines how information is joined (the merge operator of the lattice), which gets you higher up in the lattice (making the result less precise).
-
allTopFunction()
- Function that returns the a special edge function allTop that can be viewed as the function representation of the top value.
-
getNormalEdgeFunction()
- Here you formulate your edge function(s) that is (are) applied to every instruction within a functions body.
-
getCallEdgeFunction()
- Here you express what the solver should do when it hits a call-site. In this edge function the actual parameters are usually mapped to the formal parameters of the called function.
-
getReturnEdgeFunction()
- Express the edge function that is applied to map data-flow facts that hold at the end of a callee back to the caller.
-
getCallToReturnEdgeFunction()
- Here the edge function is defined that is applied for data-flow facts that 'go around' the call-site.
-
Constructor
- See Constructor of Writing an IFDS analysis
Please also refer to the IDE analysis IDELinearConstantAnalysis.
WPDS is the latest solver of distributive data-flow problems that has been implemented recently in PhASAR . WPDS is similar to IDE. We obtain WPDS by adding weights to PDS. The weights in WPDS corresponds to distributive summary functions in IDE.
In our implementation of WPDS, firstly we construct an exploded super graph, using IDESolver, afterwards we use the WALi library to generated rule sets from the constructed ESG. The following variants of WPDS can be created in our implementation, using WALi library, :
- WPDS
- EWPDS
- FWPDS
- SWPDS
In order to write a WPDS analysis you need to write an IDE analysis, so the class that you need to implement is similar to when you write an IDE analysis. Your analysis class has to inherit from 'DefaultIDETabulationProblem'.
The member functions that you have to provide some implementations for are:
-
getNormalFlowFunction()
- See Writing an IFDS analysis
-
getCallFlowFuntion()
- See writing an IFDS analysis
-
getRetFlowFunction()
- See writing an IFDS analysis
-
getCallToRetFlowFunction()
- See writing an IFDS analysis
-
Optional: getSummaryFlowFunction()
- see writing an IFDS analysis
-
initialSeeds()
- See writing an IFDS analysis
-
topElement()
- see writing an IDE analysis
-
bottomElement()
- see writing an IDE analysis
-
join()
- see writing an IDE analysis
-
allTopFunction()
- see writing an IDE analysis
-
getNormalEdgeFunction()
- see writing an IDE analysis
-
getCallEdgeFunction()
- see writing an IDE analysis
-
getReturnEdgeFunction()
- see writing an IDE analysis
-
getCallToReturnEdgeFunction()
- see writing an IDE analysis
-
getSummaryEdgeFunction()
- see writing an IDE analysis
-
Constructor
- See Constructor of Writing an IFDS analysis
Similar to IDE analysis you can refer to WPDSLinearConstantAnalysis.
- Home
- Reference Material
- Getting Started:
- Building PhASAR
- Using PhASAR with Docker
- A few uses of PhASAR
- Coding Conventions
- Contributing to PhASAR
- Errors and bug reporting
- Update to Newer LLVM Versions
- OS Support
- FAQ