Taint tracking in Python

Overview

Taint tracking tracks how arbitrary values, "taint", flow throughout the program. This is useful for finding whether potentially malicious input can be used in an insecure way, whether dangerous arguments are passed to vulnerable functions, and whether confidential or sensitive data can leak. It is also useful for tracking invalid, insecure, or untrusted data in other analyses.

Taint-tracking differs from basic data flow in that takes non-value-preserving steps into account as well. For example, in the assignment dir = path + "/", if path is tainted then dir is also tainted, even though there is no data flow from path to path + "/".

Fundamentals of taint tracking

Any taint tracking query has three explicit components, one of which is optional, and an implicit component. The explicit components are:

One or more sources, represented by the TaintSource class.
One or more sinks, represented by the TaintSink class.
Zero or more sanitizers, represented by the Sanitizer class.

The implicit component is the "kind" of taint, represented by the TaintKind class.

A taint tracking query gives results when there is flow of taint from a source to a sink, that is not blocked by a sanitizer.

The kind of taint determines which additional, non-value-preserving steps are possible. In the above example dir = path + "/", taint flows from path to dir if the taint is a string, but not if the taint is None.

Limitations

Although taint tracking is a powerful technique, it is worth noting that it depends on the underlying data flow graphs. Creating a data flow that is both accurate and covers a large enough part of a program is a challenge, especially for a dynamic language like Python. The call graph might be incomplete, the reachability of code is an approximation, and certain constructs, like eval, are just too dynamic to analyze.

Using taint-tracking for Python

A simple (non-path) taint tracking query has the basic form:

/**
 * @name ...
 * @description ...
 * @kind problem
 */

import semmle.python.security.TaintTracking

from TaintSource src, TaintSink sink
where src.flowsTo(sink)
select sink, "Alert message, including reference to $@.", src, "string describing the source"

For this query to work, a source and a sink must be defined. This can be done either by importing pre-existing sources and sinks, or by defining them explicitly.

As a contrived example, here is a query that looks for flow from a HTTP request to a function called "unsafe". The sources are pre-defined and accessed by importing library semmle.python.web.HttpRequest. The sink is defined by using a custom TaintSink class.

/* Expose the string taint kind needed by our custom sink */
import semmle.python.security.strings.Untrusted

/* Sources */
import semmle.python.web.HttpRequest

/* Sink */
/** A class representing any argument in a call to a function called "unsafe" */
class UnsafeSink extends TaintSink {

    UnsafeSink() {
        exists(FunctionObject unsafe |
            unsafe.getName() = "unsafe" and
            unsafe.getACall().(CallNode).getAnArg() = this
        )
    }

    override predicate sinks(TaintKind kind) {
        kind instanceof StringKind
    }

}

from TaintSource src, TaintSink sink
where src.flowsToSink(sink)
select sink, "This argument to 'unsafe' depends on $@.", src, "a user-provided value"

Implementing path queries

Although the taint tracking query above tells which sources flow to which sinks, it doesn't tell us how. For that we need a path query.

A standard taint tracking query can be converted to a path query by changing @kind problem to @kind path-problem, adding an import and changing the format of the query clauses. The import is simply:

import semmle.python.security.Paths

And the format of the query becomes:

from TaintedPathSource src, TaintedPathSink sink
where src.flowsTo(sink)
select sink.getSink(), src, sink, "Alert message, including reference to $@.", src.getSource(), "string describing the source"

Thus, our example query becomes:

/**
 * ...
 * @kind path-problem
 * ...
 */

/* This computes the paths */
import semmle.python.security.Paths

/* Expose the string taint kinds needed by our custom sink */
import semmle.python.security.strings.Untrusted

/* Sources */
import semmle.python.web.HttpRequest

/* Sink */
/** A class representing any argument in a call to a function called "unsafe" */
class UnsafeSink extends TaintSink {

    UnsafeSink() {
        exists(FunctionObject unsafe |
            unsafe.getName() = "unsafe" and
            unsafe.getACall().(CallNode).getAnArg() = this
        )
    }

    override predicate sinks(TaintKind kind) {
        kind instanceof StringKind
    }

}

from TaintedPathSource src, TaintedPathSink sink
where src.flowsTo(sink)
select sink.getSink(), src, sink, "This argument to 'unsafe' depends on $@.", src.getSource(), "a user-provided value"

Custom taint kinds and flows

In the above examples, we have assumed the existence of a suitable TaintKind, but sometimes it is necessary to model the flow of other objects, such as database connections, or None.

The TaintSource and TaintSink classes have predicates that determine which kind of taint they source and sink, respectively.

abstract class TaintSource {
    abstract predicate isSourceOf(TaintKind kind);
    ...
}

abstract class TaintSink {
    abstract predicate sinks(TaintKind taint);
    ...
}

The TaintKind itself is just a string (a QL string, not a QL entity representing a Python string), which provide methods to extend flow and allow the kind of taint to change along the path. The TaintKind class has many predicates allowing flow to modified. This simplest TaintKind does not override any predicates, meaning that it only flows as opaque data. As an example of this is the Hardcoded credentials query which defines the simplest possible taint kind class, HardcodedValue, and custom source and sink classes.

class HardcodedValue extends TaintKind {
    HardcodedValue() {
        this = "hard coded value"
    }
}

class HardcodedValueSource extends TaintSource {
    ...
    override predicate isSourceOf(TaintKind kind) {
        kind instanceof HardcodedValue
    }
}

class CredentialSink extends TaintSink {
    ...
    override predicate sinks(TaintKind kind) {
        kind instanceof HardcodedValue
    }
}

markshannon/taint-tracking.rst