By Mate Kukri
The naive method to looking for patterns in supply code is to make use of regular expressions; a greater manner is to parse the code with a customized parser, however each of those approaches have limitations. Throughout my internship, I prototyped an inside software known as Syntex that does looking on Clang ASTs to keep away from these limitations. On this put up, I’ll focus on Syntex’s distinctive method to syntactic sample matching.
Looking, however with context
Syntex addresses two overarching issues with conventional pattern-searching instruments.
First, present instruments are susceptible to producing false negatives. These instruments often comprise their very own parsers that they use relying on the language of the codebase they’re looking in. For C and C++ codebases, these instruments often parse supply code with out performing macro enlargement, looking via non-macro-expanded code as a substitute of the macro-expanded code that an actual compiler like Clang would produce. This implies these instruments can not guarantee correct outcomes. A consumer of such a software gained’t be capable of confidently say, “listed below are all of the occurrences of this sample” or “this sample by no means happens.” In idea, these instruments might match makes use of of macros in non-macro-expanded code, however in apply, they might be capable of match solely top-level makes use of of macros, that means that false negatives are doubtless.
One other drawback with these instruments is that their inside parsers don’t use the identical illustration of the code as an actual compiler would, and they don’t have an understanding of the semantics of the supply code. That’s, these instruments produce plaintext output highlighting their outcomes, so they can not present semantic details about the code wherein their outcomes seem. With out such data, it’s troublesome to additional analyze the output, particularly utilizing different evaluation instruments. It isn’t strictly talking unimaginable to entry the supply code internally parsed by these instruments, however it might not be significantly helpful in a multi-stage evaluation pipeline requiring entry to semantic data out there solely to a compiler. All this severely limits these instruments’ usefulness as a basis on which to make use of different evaluation instruments to extra deeply analyze the given code.
For example, let’s say we’re looking for code wherein the size of a string within the argument listing of a name to some dangerous perform is implicitly truncated. We would have our software seek for the sample
$func($... $str->len $...). The software will doubtless discover a superset of code snippets that we really care about. We ought to have the ability to semantically filter these search outcomes to test that len is the construction discipline of curiosity and that its use induces an implicit downcast. Nevertheless, no matter software we select to make use of wouldn’t be capable of do that filtering as a result of it might perceive solely the construction of the code, not the semantics. And due to its lack of semantic understanding, it’s tougher to introduce different evaluation instruments to assist additional analyze the outcomes.
Syntex solves each of those issues by working on precise Clang ASTs. As a result of Syntex makes use of the identical ASTs that the compiler makes use of, it eliminates the inaccuracies of typical pattern-searching instruments and gives semantic data for additional evaluation. Syntex produces outcomes with references to AST nodes, permitting builders to conduct follow-up semantic evaluation. For example, a consumer enumerating the downcast sample above will be capable of make choices based mostly on the kind and nature of the submatches similar to each
Syntex matches syntax patterns in opposition to listed code
Parsing C and C++ code is a notoriously troublesome job, in that it requires implementing unbounded lookaheads and executing Turing-complete templates simply to acquire a parse tree. Syntex solves the issue of parsing supply code by counting on Clang, however how does it parse Syntex queries themselves?
Apart from queries containing $-prefixed meta variables, Syntex queries are syntactically legitimate C and C++ code. Ideally, we’d parse Syntex queries with Clang, then unify the parsed queries and parsed supply code to establish matches. Sadly, life just isn’t really easy: Syntex queries lack the mandatory syntactic/semantic context that might permit them to be parsed. For instance, the sample
foo(0) yields different parse trees relying on the kind of
Syntex doesn’t straight resolve the sting circumstances of C and C++ syntax; as a substitute, it considers all attainable ambiguous interpretations whereas parsing queries. Nevertheless, as a substitute of defining the ambiguous language patterns itself, Syntex derives its sample grammar from the Clang compiler’s AST. Utilizing this method, we are able to assure that patterns might be accepted for each assemble showing within the listed supply code.
Synthesizing the grammar
At code constructing and indexing time, Syntex creates a context-free grammar by recursively strolling via the Clang AST and recording the relationships between AST nodes. Nodes with youngsters correspond to non-terminals; every look of such a node provides a manufacturing rule of the shape
mother or father -> child_0 ... child_n. Nodes with no youngsters turn into terminal symbols within the generated grammar. For example, the grammar (manufacturing guidelines and terminals) similar to the AST in determine 1 is as follows:
DECL_REF_EXPR -> IDENTIFIER INTEGER_LITERAL -> NUMERIC_CONSTANT IMPLICIT_CAST_EXPR -> DECL_REF_EXPR BINARY_OPERATOR -> IMPLICIT_CAST_EXPR PLUS IMPLICIT_CAST_EXPR PARENT_EXPR -> L_PARENTHESIS BINARY_OPERATOR R_PARENTHESIS BINARY_OPERATOR -> PAREN_EXPR SLASH INTEGER_LITERAL VAR -> KEYWORD_INT IDENTIFIER EQUAL BINARY_OPERATOR DECL_STMT -> VAR SEMI IDENTIFIER, NUMERIC_CONSTANT, PLUS, L_PARENTHESIS, R_PARENTHESIS, SLASH, KEYWORD_INT, EQUAL, SEMI
If we interpret
DECL_STMT because the “begin image” of this grammar, then the grammar is deterministic, and a parser that accepts strings might be generated with the generally used LR algorithm. Nevertheless, when parsing search queries, Syntex doesn’t really know the beginning image that the question ought to cut back to. For instance, if the question consists of an
IDENTIFIER token, then Syntex can parse that token as an
DECL_REF_EXPR containing an identifier, or an
IMPLICIT_CAST_EXPR containing a
DECL_REF_EXPR containing an identifier. Because of this, in apply, Syntex assumes that each image might be a begin image and retroactively deduces which guidelines are begin guidelines based mostly on whether or not they cowl all the enter question.
Parsing Syntex queries
Conceptually, step one in parsing a question is to carry out tokenization (or lexical evaluation). Syntex performs tokenization utilizing a hand-coded state machine. One distinction between Syntax’s tokenizer and people utilized in typical compilers is that Syntex’s tokenizer returns all attainable interpretations of the enter characters as a substitute of simply the greediest interpretation. For instance, Syntex would tokenize the string
”<<“ as each
<< and two
< symbols following one another. That manner, the tokenizer doesn’t have to concentrate on which interpretation is important wherein context.
Syntex parses queries in opposition to artificial sample grammars utilizing a memoizing chart parser. Memoization prevents the parsing course of from leading to (probably) exponential runtime complexity, and the ensuing memo desk serves because the in-memory illustration of a question parse forest. The matcher (described within the subsequent part) makes use of this desk to determine which listed ASTs match the question. This method implies that Syntex doesn’t must materialize express timber for every attainable interpretation of a question. Determine 2 presents a memoization desk for the question string “
This desk exhibits that the string at index
0 may be interpreted because the tokens
++ and that the manufacturing rule
UNARY_OPERATOR -> PLUS_PLUS DECL_REF_EXPR is matched at this index. To acquire that match, the parser, after seeing that the left nook of the manufacturing above can match
PLUS_PLUS, recursively obtains the parses at index
2. With this information, it will probably enumerate the manufacturing ahead and conclude that it matches in its entirety up till index
After the parse desk of a question is crammed, Syntex must find appearances of all interpretations of the question within the listed supply code. This course of begins with all entries within the desk at index 0 whose subsequent index is the size of the enter; these entries correspond to parses masking all the enter string.
Syntex’s matching algorithm operates on a proprietary Clang AST serialization format. Serialized ASTs are deserialized into in-memory timber. The deserialization course of builds an index of tree node sorts (similar to grammar symbols) to deserialized nodes, which allows Syntex to rapidly uncover candidates that might match in opposition to a question’s root grammar image. A recursive unification algorithm is utilized pairwise to every match candidate and every attainable parse of the question. The algorithm descends the timber, checking node sorts and the order wherein they seem, and bottoms out on the precise tokens themselves.
For the question “
++i” in determine 2, Syntex begins matching at an AST node with the image
UNARY_OPERATOR. On this case, we all know that the one option to produce such a node is to make use of the rule physique
PLUS_PLUS DECL_REF_EXPR. First, the matcher makes positive the aforementioned AST node has the proper construction: there are two baby nodes, a
PLUS_PLUS and a
DECL_REF_EXPR. Then, Syntex recursively repeats the identical course of for these nodes. For instance, for the
PLUS_PLUS baby, Syntex ensures that it’s a token node with the spelling “
Further options and instance makes use of
An vital characteristic of Syntex, briefly proven within the size truncation instance above, is that it helps “meta variables” in queries. For example, when a question reminiscent of “
++i” is specified, the matcher will discover solely expressions incrementing variables known as
i. Nevertheless, if we have been to specify “
++$” because the question, then Syntex will discover expressions that increment something. Meta variables can be named, reminiscent of within the question “
++$x”, permitting the consumer to retrieve the matching sub-expression by title (
x) for additional processing. Moreover, Syntex permits constraints to be utilized to meta variables, reminiscent of within the question “
++$x:DECL_REF_EXPR”; with these constraints, Syntex would match solely increments to expressions
x referencing a declaration. In-query constraints are restricted in expressivity, which is why the C++ API permits arbitrary C++ features to be hooked up as predicates that resolve to just accept or reject probably matching candidates.
One other vital characteristic, additionally proven within the size truncation instance above, is the globbing operator “
$...”. It permits customers to specify queries reminiscent of “
printf($...)”. The glob operator is beneficial when one needs to match an unknown variety of nodes. Glob operator semantics are neither grasping nor lazy. That is partially due to the non-traditional nature of Syntex-generated grammars: the place a hand-coded grammar may condense list-like repetition by way of recursive guidelines, Syntex grammars explicitly symbolize every noticed listing size by way of a special rule. Thus, a name to
printf with one argument is matched by a special rule than a name to printf with 5 arguments. As a result of Syntex can “see” all of those totally different guidelines of various lengths, it’s capable of specific attention-grabbing patterns with globbing, reminiscent of “
printf($... i $...)”, which can discover all calls to
i showing someplace within the argument listing.
Syntex’s method is exclusive amongst conventional syntactic sample looking instruments: the search engine accommodates little or no language-specific code and simply generalizes to different programming languages. The one requirement for utilizing Syntex is that it must have entry to the tokens that produced every AST node. In my prototype, the C and C++ ASTs are derived from
Syntex has already exceeded the capabilities of open-source and/or publicly out there alternate options, reminiscent of Semgrep and weggli. Syntex isn’t “completed,” although. The subsequent step is to develop Syntex in order that it searches via supply code textual content that doesn’t fairly exist. Some of the highly effective options of the C++ language is its templates: they permit programmers to explain the form of generic knowledge buildings and the computations involving them. These templates are configured with parameters which can be substituted with concrete sorts or values at compile time. This substitution, known as “template instantiation,” creates variants of code that have been by no means written. In future variations of Syntex, we plan to make C++ template instantiations syntax-searchable. Our imaginative and prescient for this characteristic depends on Clang’s capability to fairly print AST nodes again to supply code, which can present us with the code wanted for our grammar-building course of.
Final however not least, I want to thank Path of Bits for offering the chance to sort out such an attention-grabbing analysis mission throughout my internship. And I want to thank Peter Goodman for the mission concept and for mentoring me all through the method.
*** It is a Safety Bloggers Community syndicated weblog from Trail of Bits Blog authored by Trail of Bits. Learn the unique put up at: https://blog.trailofbits.com/2022/12/22/syntax-searching-c-c-clang-ast/
Leave a Reply