Skip to content

Writing a Data Flow Analysis

Fabian Schiebel edited this page Jun 17, 2023 · 15 revisions

Before Writing a Static Analysis

Why can't I write a data-flow analysis on-the-fly? Writing a data-flow analysis is a challenging task and can be tough. Therefore, you should be familiar with the underlying theory in order to be able to develop a proper data-flow analysis. We compiled a (incomplete) list of literature for you which you may consult before writing an analysis. Please have a look at Useful Literature.

Writing a Static Analysis

PhASAR provides several sophisticated mechanisms that allow you to write your own data-flow analysis. In general, PhASAR is designed in such a way that analysis writers can choose from several possible interfaces which they can use to implement a concrete static analysis. Depending on a user's analysis problem, some interfaces might be more suitable than others. When having found the right interface for the problem, an analysis writer usually only has to provide a new class which implements the interface's missing functionality which serves as the problem description for the analysis. This concrete problem description is handed over to the corresponding solver, which solves the problem in a fully automated manner. In the following the possible data-flow solvers are ordered according to their expressiveness (and complexity).

Choosing a Control-Flow Graph

In general, all data-flow analyses are performed on a control-flow graph. When writing an analysis, a user has to choose a control-flow graph for their analysis. For that reason, all of our solvers work either on the CFG (intra-procedural control-flow graph) or on the ICFG (inter-procedural control-flow graph) interface.

For instance, when writing a simple intra-procedural data-flow analysis using the monotone framework, the user has to use one of CFG's concrete implementations or provide their own implementation for that interface. The pre-implemented LLVMBasedCFG should usually do the job and can be used out-of-the-box.

The inter-procedural call-string approach and the IFDS/IDE frameworks solve a concrete analysis based on an inter-procedural control-flow graph. Depending on the analysis's needs, a forward- or backward ICFG graph may be used. For instance, if a backward analysis is required, the LLVMBasedBackwardICFG has to be used. However, most of the time the LLVMBasedICFG should work just fine.

It is also possible to provide a custom implementation of the ICFG interface if necessary. In that case, a user just needs to provide another class. The implementation must at least implement the interface that is defined by the ICFG interface.

Important template parameters

The code for the data-flow solvers as well as the additional helper analyses is written in a very generic way. For that reason we use a lot of template parameters and type aliases. In the following, we describe the most important template parameters and type alias names:

  • D | d_t: Data-flow fact --- Specifies the type of an individual data-flow fact that is propagated through the program under analysis. While analyzing LLVM IR, in most cases the type const llvm::Value * will suffice; for more specialized analyses, custom types may be required. An example for this can be found in the IDESecureHeapPropagation.
  • N | n_t: (Control-flow) Node --- Specifies the type of a node in the (inter-procedural) control-flow graph and can be though of as an individual statement or instruction of the program. When using an analysis targeting LLVM IR, it will always be const llvm::Instruction *.
  • F | f_t: Function --- Specifies the type of functions. Represents a function in the code under analysis. When using an analysis targeting LLVM IR it will always be const llvm::Function *.
  • T | t_t: (User-defined) type --- Specifies the type of a user-defined (i.e. struct or class) data type. This often corresponds with the const llvm::StructType* and is used by the LLVMTypeHierarchy.
  • V | v_t: (Pointer) value --- Specifies the type of pointers. In most cases this will correspond to the const llvm::Value * type. This parameter is used by the Pointer Analysis that generates Points-to/Alias information for our analysis.
  • C | c_t: Intra-procedural control flow --- Specifies the type of the control-flow graph to be used. This is usually one of the included graphs that is located in the include/phasar/PhasarLLVM/ControlFlow folder which implements the CFG interface or a custom type that implements the same interface.
  • I | i_t: Inter-procedural control flow --- Specifies the type of the inter-procedural control-flow graph to be used. This is usually one of the included graphs that is located in the include/phasar/PhasarLLVM/ControlFlow folder which implements the ICFG interface or a custom type that implements the same interface.
  • db_t: IRDB --- Specifies the type of IR database that provides access to the code under analysis. Usually LLVMProjectIRDB.
  • L | l_t: Lattice element --- Specifies the type of the value lattice for IDE problems. This is the the value computation domain where IDE's edge functions operate on. When using IFDS, one does not have to worry about this type, since it is automatically chosen for the analysis writer as:
    enum class BinaryDomain {
        BOTTOM,
        TOP
    };

Type Parameter Bundles for Data-Flow Analysis

To avoid having to specify lots of template parameters for a concrete client data-flow analysis, we use an AnalysisDomain type that bundles all type parameters that are important to data-flow analysis. For a concrete analysis, one can specify a domain by declaring a new domain type that inherits from AnalysisDomain and overrides the type aliases (default is void). The bundle is then used as a template parameter to provide the problem description with your chosen configuration of the types mentioned above. Make sure to make your types usable in your analysis by including:

using ProblemType =
      ProblemDescription<MyNewAnalysisDomain>;
  using typename ProblemType::d_t;
  using typename ProblemType::db_t;
  using typename ProblemType::f_t;
  using typename ProblemType::i_t;
  using typename ProblemType::l_t;
  using typename ProblemType::n_t;
  using typename ProblemType::t_t;
  using typename ProblemType::v_t;

Where ProblemDescription is a placeholder for the name of the actual abstract problem class e.g. IDETabulationProblem.

A standard set of template parameters is bundeled in the LLVMAnalysisDomainDefault struct. For this struct of standard parameters the following types are used:

struct LLVMAnalysisDomainDefault : public AnalysisDomain {
  using d_t = const llvm::Value *;
  using n_t = const llvm::Instruction *;
  using f_t = const llvm::Function *;
  using t_t = const llvm::StructType *;
  using v_t = const llvm::Value *;
  using c_t = LLVMBasedCFG;
  using i_t = LLVMBasedICFG;
  using db_t = LLVMProjectIRDB;
};

If these standard Parameters are relatively close to the parameters used in your analysis, just let your new Domain inherit from LLVMAnalysisDomainDefault and override some of the type aliases.

Useful shortcuts

In the following section some useful coding shortcuts are presented which may come in handy when writing a new analysis within PhASAR.

The <algorithm> Header

When overriding the classes in order to describe the problems within the monotone framework, oftentimes set operations like set union, set intersect or set difference are required. Writing these functions yourself is tedious. Therefore it makes much sense to use the existing set operation functions which are defined in the <algorithm> header file such as:

    ...
    std::includes /* subset */
    std::set_difference
    std::set_intersection
    std::set_union
    ...

The neat thing about these functions is that they are completely generic as they operate on iterators and provide several useful overloads. Thus, they work on all container types that follow the concept's required by these functions. You may use them as follows:

std::set<int> a = {1, 2, 3};
std::set<int> b = {6, 4, 3, 2, 8, 1};
std::set<int> c;
std::set_union(a.begin(), a.end(),
               b.begin(), b.end(),
               inserter(c, c.begin()));
bool isSubSet = std::includes(b.begin(), b.end(), a.begin(), a.end());

Note: When using the monotone framework, your also may want to checkout our BitVectorSet class that features several utilities that we have found useful for monotone-framework analyses.

Pre-defined Flow Function Templates

When defining flow functions in IFDS/IDE, oftentimes a certain flow function type occurs more than once. For instance, the Kill flow function that kills a specific data-flow fact is often needed many times. An analysis writer, you can find several useful pre-defined flow functions and edge functions that can be directly used. A very useful example of such a recurring flow function is the identity flow function. Some of the flow functions are also defined as a singleton in order to keep the amount of overhead as small as possible. This is always possible if the flow function has no state. We also provide some LLVM specific flow functions that you can find here.

Here is a small example that makes use of the pre-defined flow functions:

auto getNormalFlowFunction(...) -> FlowFunctionPtrType {
    // check the type of the instruction
    if(...) {
        // do some work
    } 
    else if(...) {
        // unconditionally generate some new data-flow fact
        return generateFromZero(/*some fact to be generated*/);
    }
    else if (...) {
        // kill some data-flow fact
        return killFlow(/* some fact to be killed */);
    }

    // just treat everything else as Identity
    return identityFlow();

}
Clone this wiki locally