Skip to content

Writing a Data Flow Analysis

LinusJungemann edited this page Jul 29, 2020 · 15 revisions

Before writing a static analysis

Why I cannot write a data-flow analysis on-the-fly? Writing a data-flow analysis is a challenging task and can be tough. Therefore, you should be familiar with the underlying theory in order to be able to develop a novel data-flow analysis. We compiled a (incomplete) list of literature for you which you may consult before writing an analysis. Please have a look at Useful Literature.

Writing a static analysis

PhASAR provides several sophisticated mechanisms that allow you to write your own data-flow analysis. In general, PhASAR is designed in such a way that the analysis writer has to choose from several possible interfaces which he can use to implement a concrete static analysis. Depending on a user's analysis problem, some interfaces might be more suitable than others. When having found the right interface for the problem, the analysis writer usually only has to provide a new class which implements the interface's missing functionality which serves as the problem description for the analysis. This concrete problem description is handed over to the corresponding solver, which solves the problem in a fully automated manner. In the following the possible data-flow solvers are ordered according to their expressiveness (and complexity).

Choosing a control-flow graph

In general, all data-flow analyses are performed on a control-flow graph. When writing an analysis, a user has to choose a control-flow graph for their analysis. For that reason, all of our solvers work either on the CFG (intra-procedural control-flow graph) or on the ICFG (inter-procedural control-flow graph) interface.

For instance, when writing a simple intra-procedural data-flow analysis using the monotone framework, the user has to use one of CFG's concrete implementations or provide his own implementation for that interface. The pre-implemented LLVMBasedCFG.h should usually do the job and can be used out-of-the-box.

The inter-procedural call-string approach and the IFDS/IDE/WPDS frameworks solve a concrete analysis based on an inter-procedural control-flow graph. Depending on the analysis's needs, a forward-, backward-, or bi-directional inter-procedural control-flow graph may be used. For instance, if a backward analysis is required, the LLVMBasedBackwardICFG has to be used. However, most of the time the 'LLVMBasedICFG' should work just fine.

It is also possible to provide a custom implementation of the ICFG interface if necessary. In that case, a user just needs to provide another class for which they provide a reasonable name. The implementation must at least implement the interface that is defined by the ICFG.h interface.

Important template parameters

The code for the data-flow solvers as well as the additional helper analyses is written in a very generic way. For that reason we use a lot of template parameters and type aliases. In the following, we describe the most important template parameters and type alias names:

  • D | d_t
    • Data-flow fact --- Specifies the type of an individual data-flow fact that is propagated through the program under analysis. While analysing LLVM IR, in most cases the type const llvm::Value * will suffice, for more specialised analysis, custom types may be required. An example for this can be found in the IDESecureHeapPropagation. Additionally PhASAR provides the class ExtendedValue which extends the capabilities of the standard const llvm::Value * type. For details please refer to the implementation found in include/phasar/PhasarLLVM/Domain/ExtendedValue.h
  • N | n_t
    • (Control-flow) Node --- Specifies the type of a node in the (inter-procedural) control-flow graph and can be though of as an individual statement or instruction of the program. When using analysis targeting LLVM IR it will always be const llvm::Instruction *.
  • F | f_t
    • Function --- Specifies the type of functions. Represents a function in the analysed code. When using analysis targeting LLVM IR it will always be const llvm::Function *.
  • T | t_t
    • (User-defined) type --- Specifies the type of a user-defined (i.e. struct or class) data type. This often corresponds with the const llvm::StructType* and is used by the TypeHierarchy.
  • V | v_t
    • (Pointer) value --- Specifies the type of pointers. In most cases this will correspond to the const llvm::Value * type. This parameter is used by the Points-to-Analysis that generates Points-to-Information for our analysis.
  • C | c_t
    • Intra-procedural control flow --- Specifies the type of the control-flow graph to be used. This is usually one of the included graphs that is located in the include/phasar/PhasarLLVM/ControlFlow folder which implements the CFG interface or a custom type that implements the same interface.
  • I | i_t
    • Inter-procedural control flow --- Specifies the type of the inter-procedural control-flow graph to be used. This is usually one of the included graphs that is located in the include/phasar/PhasarLLVM/ControlFlow folder which implements the ICFG interface or a custom type that implements the same interface.
  • L | l_t
    • Lattice element --- Specifies the type of the underlying lattice; the value computation domain IDE's edge functions or WPDS's weights operate on. When using IFDS, one does not have to worry about this type, since it is automatically chosen for the analysis writer as:
enum class BinaryDomain {
    BOTTOM = 0,
    TOP = 1
};

Type Parameter Bundles for Data-Flow Analysis

To avoid having to specify lots of template parameters for a concrete client data-flow analysis, an AnalysisDomain type has been added that bundles all type parameters important to data-flow analysis. For a concrete analysis, one can specify a domain by declaring a new domain type that inherits from AnalysisDomain and overrides the type aliases (default is void). The bundle is then used as a template parameter to provide the problem description with your chosen configuration of the types mentioned above. Make sure to make your types usable in your analysis by including:

using ProblemType =
      ProblemDescription<MyNewAnalysisDomain>;
  using typename ProblemType::d_t;
  using typename ProblemType::f_t;
  using typename ProblemType::i_t;
  using typename ProblemType::l_t;
  using typename ProblemType::n_t;
  using typename ProblemType::t_t;
  using typename ProblemType::v_t;

Where ProblemDescription is replaced by the name of the actual abstract problem class e.g. IDETabulationProblem.

A standard set of template parameters is bundeled in the LLVMAnalysisDomainDefault struct which can be found in include/phasar/PhasarLLVM/Domain/AnalysisDomain.h together with the implementation of the AnalysisDomain. For this struct of standard parameters the following types are used:

struct LLVMAnalysisDomainDefault : public AnalysisDomain {
  using d_t = const llvm::Value *;
  using n_t = const llvm::Instruction *;
  using f_t = const llvm::Function *;
  using t_t = const llvm::StructType *;
  using v_t = const llvm::Value *;
  using c_t = LLVMBasedCFG;
  using i_t = LLVMBasedICFG;
};

If these standard Parameters are relatively close to the parameters used in your analysis, just let your new Domain inherit from LLVMAnalysisDomainDefault and override some of the type aliases.

Useful shortcuts

In the following section some useful coding shortcuts are presented which may come in handy when writing a new analysis within PhASAR.

The std::algorithm header

When overriding the classes in order to describe the problems within the monotone framework, oftentimes set operations like set union, set intersect or set difference are required. Writing these functions yourself is tedious. Therefore it makes much sense to use the existing set operation functions which are defined in the std::algorithm header file such as:

    ...
    std::includes /* subset */
    std::set_difference
    std::set_intersection
    std::set_union
    ...

The neat thing about these functions is that they are completely generic as they operate on iterators and provide several useful overloads. Thus, they work on all container types that follow the concept's required by these functions. In the following a small code example is presented:

std::set<int> a = {1, 2, 3};
std::set<int> b = {6, 4, 3, 2, 8, 1};
std::set<int> c;
std::set_union(a.begin(), a.end(),
               b.begin(), b.end(),
               inserter(c, c.begin()));
bool isSubSet = std::includes(b.begin(), b.end(), a.begin(), a.end());
The pre-defined flow function classes

When defining flow functions in IFDS, IDE or WPDS, sometimes a certain flow function type occurs more than once. For instance, the Kill flow function that kills a specific data-flow fact is often needed many times. An analysis writer can find several useful pre-defined flow (and edge) functions that can be directly used. A very useful example of such a recurring flow function is the Identity flow function. Some of the flow functions are also defined as a singleton in order to keep the amount of overhead as small as possible. This is always possible if the flow function has no state. A user may use the pre-defined flow functions inside their flow function factories. We also provide some LLVM specific flow functions for parameter mapping at call and return sites. We also provide some general edge functions that may be used when implementing an IDE analysis.

Here is a small example that makes use of the pre-defined flow functions:

// in this example let domain D be const llvm::Value*

shared_ptr<FlowFunction<const llvm::Value*>> getNormalFlowFunction(....) {
    // check the type of the instruction
    if( ...) {
        // do some work
    } else if (...) {
        // kill some data-flow fact
        return make_shared<Kill<const llvm::Value*>>(/* some fact to be killed */);
    } else {
        // just treat everything else as Identity
        return Identity<const llvm::Value*>::getInstance();
    }
}
Clone this wiki locally