paint-brush
XPath Expression Generation: How We Generated XPath Queriesby@zpath
New Story

XPath Expression Generation: How We Generated XPath Queries

by XPathMarch 8th, 2025
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

In this section, we introduce how we generate XPath queries. We encountered two main challenges that we had to tackle when generating XPath expressions.

Companies Mentioned

Mention Thumbnail
Mention Thumbnail
featured image - XPath Expression Generation: How We Generated XPath Queries
XPath HackerNoon profile picture
0-item

Abstract and 1 Introduction

2 Background

3 Approach and 3.1 Differential Testing for XML Processors

3.2 XPath Expression Generation

3.3 XML Generation

4 Evaluation

4.1 Effectiveness

4.2 Efficiency

4.3 Comparison to the State of the Art

4.4 Analysis of BaseX Historical Bug Reports

5 Related Work

6 Conclusion, Acknowledgments, and References

3.2 XPath Expression Generation

In this section, we introduce how we generate XPath queries. We encountered two main challenges that we had to tackle when generating XPath expressions.


Non-existent elements. Randomly generated queries could be semantically correct, but reference non-existent nodes or attributes. For the document in Figure 1, //Author[@id < 1] is a valid XPath expression. However, none of the Author nodes contain an id attribute. Thus, XPath returns an empty sequence for each node, causing the predicate @id < 1 to evaluate to false. We believe that queries, where only non-existent attributes or nodes are referenced, are less likely to exercise the logic of the processors under test, as subsequent operations are likely to evaluate to an empty sequence as well. Thus, we aim to avoid generating such queries.


Empty results. Randomly generated predicates might likely evaluate to false and cause queries to generate empty result sets. For the document in Figure 1, the XPath predicate starts-with(text(), x) identifies nodes whose text starts with x. If x is a randomly generated string, the possibility is high that no nodes in the current result set match the condition. Consequently, any use of the predicate would yield an empty result. Any subsequently added section would yield an empty result as well, meaning that such queries would be less likely to exercise the processor under test. Consequently, we want to avoid generating such predicates, in particular, when they involve multiple sections. This relates to the first problem, as non-existent nodes or attributes can also introduce empty results.


Approach overview. We designed the XPath generation process of XPress tackling the two aforementioned issues. To create XPath expressions that refer to valid nodes and attributes to trigger deeper logic of the system under test, we generate queries that reference existent context relative to the so-called targeted node, such as its tag name and attributes (steps 3 and 4). Since randomly generated predicates might miss the targeted node from the result set, we rectify the generated expressions to ensure the inclusion of the targeted node (step 5).


Iterative section generation. We create XPath expressions sectionby-section by executing step ○2 to step ○5 for each section, which allows us to ensure non-empty results after generating each section. In the example, we first generate section /Books by selecting as the targeted node, and after executing steps 2 to step , the result of /Books is non-empty—containing the node. Based on this, we further proceed to generate the next section /Book[count(Author) > 1] starting at step 2.


Section prefix. We randomly generate one of the applicable section prefixes. First, we randomly select the start of the section to be / or //. We then retrieve the current context node sequence by executing the expression—/Books in the example—on a processor. Based on the result, we include all possible axes that would not lead to an empty result set by simple conditional checks. We support all 11 axes described in the XPath 3.0 standard [4]. For example, applying the axis /descendants:: will lead to a non-empty result, if at least one non-leaf node exists in the current selection. From the possible axes, we select a random one and apply it. When generating the section prefix /Book, the axis step is implicit. It is equivalent to /Books/child::Book, which selects all child nodes of the previously selected nodes. We again execute the query and retrieve the result node-set. We use the result for the name test, for which we either select a tag name from the result node-set, or use the wildcard *. By doing so, we are again guaranteed a non-empty result set. In the example, the tag name Book is selected and applied, resulting in the selection of all three Book nodes. In our artifact, we include a table that details the conditional checks for all 11 axes.


Target node selection. To generate targeted queries that fetch at least one node, we select a so-called targeted node to guide the predicate generation process. We use information about the target node, such as its text content, the attributes it holds, and its relationship to other nodes during the predicate generation. This is similar to the concept of the pivot row in PQS [36], which is a technique that has been proposed to test relational DBMSs. After the generated predicate is applied, we expect the target node to be included in the result node-set. In step 3, we select node 1 as the target node for the predicate generation process. Constraining the context to exist for the targeted node does not affect the evaluation of the expression on other candidate nodes and, therefore, still allows finding bugs that are triggered only when referring to nodes’ non-existing attributes or child nodes.


Predicate generation. We use a tree structure to represent the predicate and take a bottom-up construction approach to enable tracking of expression results along tree construction. We start generating the predicate from a specific subject, which is either the targeted node or a node sequence derived from the targeted node with equal probability. In the example, we select the child node sequence from the targeted node as the subject. We then iteratively apply random function nodes and supply function parameters to construct the predicate, until the predicate reaches a desired length. We keep track of the data type and value of the current subexpression when constructing the predicate, by executing the subexpression on one randomly chosen XML processor—we use this XML processor also for predicate rectification and we subsequently refer to this XML processor as the designated XML processor. We use the value and data type of the current sub-expression in the following two ways: by (1) selecting function nodes of according data types and (2) supplying arguments to reference existent context and triggering corner cases. Specifically, we select a random function node from functions that could accept the value of the current data type as input. A function node can either represent a function or an operation. In the example, Author is a node sequence and count is a randomly selected function from functions that accept node sequence as input. For function nodes that require additional arguments, we supply arguments while taking the current result value into consideration. As an example, when selecting attribute values from node sequences, we use name tests referencing existent attributes. For the = operator, we choose an operand that is equal to the current value with a high probability of triggering the equal case which is of low probability under random generation. Aside from constants, we also set the possibility for operands to be other predicate trees. Through this, we support the generation of expressions with multiple subject occurrences. Besides boolean predicates, we also apply positional predicates to the XPath expression randomly.


Predicate rectification. Lastly, we rectify the generated predicate to guarantee that the targeted node is contained in the final result set. We first execute the generated predicate on the designated XML processor. If the result set misses the targeted node, we rectify the predicate. To negate the predicate’s result, we can always apply a not operator. However, as shown in Algorithm 1, we probabilistically apply more specific rectification for certain operators to uncover additional potential bugs. For logical operators such as and, both child expressions need to be modified to evaluate to true to contain the targeted node, while or needs only modification of one random child expression. For comparison operators, such as <=, we replace them with their opposite operators, which, in the example, is >. Thus, the targeted node is guaranteed to be contained in the result set.


This paper is available on arxiv under CC BY 4.0 DEED license.

Authors:

(1) Shuxin Li, Southern University of Science and Technology China and Work done during an internship at the National University of Singapore ([email protected]);

(2) Manuel Rigger, National University of Singapore Singapore ([email protected]).