Tatoo : the path forward

TYPEX meeting, Grenoble April 14-16 2014

XPath

W3C query (sub-)language. Allows one to denote a set of nodes in a document


   /descendant::M/ancestor::A[ descendant::P  or parent::Q ]/child::*

A path is made of several steps steps
A step is made of :
- An axis amongst : self, child, parent, descendant, ancestor, descendant-or-self, ancestor-or-self, following-sibling, preceding-sibling, …
- A node test : label name, *, text(), node(), …
- An optional filter : an XPath expression whose truth value determine is the node is kept

Filters (or predicates)

An arbitrary expression whose value is cast into a Boolean (true keeps the selected node, false discards it)

Example :


   //A[ child::B and .//* = count(/descendant::X) mod 12 ]

* is an alias for child::*
// is an alias for /descendant-or-self::node()/
. is an alias for self::node()

BTW: XPath semantics is crazy rich. In the example we want all A nodes (ok) that have a B child (ok) and that have a descendant (hum… ) for which the (string) concatenation of its text descendants (?) is equal to the number of X elements in the document mod 12 (whaaat ?)

SXSI

Succint XML Self-Index: (limited) XPath query engine developped during my postdoc at Nicta. It uses:

Self-Index developped by Mäkinen, Sirén, Valimaki : stores a text T of N characters (in an alphabet of size σ) in N·H_k(T) + o(N·log σ) bits.
Can answer queries like « first position after i that matches a constant string s » in O(|s|·log σ)
Succinct tree index of Arroyuello, Claude, Navarro : stores a tree of size M with labels in an alphabet of size θ in 4M(2 + log θ) + o(M) bits.
Full navigation + jumping from a node to the next one with a given label (and also LCA of two nodes) in O(1).
Non det. alternating tree-automata of Maneth and myself, on the fly determinization and approximation of the sets of relevant nodes (worst case O(M·|A|) to evaluate an automaton of size A).

Can answer downward (no ancestor, parent, …) XPath with text expressions of the form (f(., constant), with f ∈ { contains, starts-with, ends-with, =, <, > , <=, >= }). Translation from XPath to automata is linear in size

Pros and cons of SXSI

Pros

For the considered subset: much faster than state of the art optimized XML DB
Very small space overhead
Theoreticall sound formalism, proven worst-case bounds

Cons

The subset is ~~too restricted~~ (no upward axes, no arbitrary joins)
Encoding was ~~wrong~~ in subtle ways (e.g. in some corner cases with double negations in filters)

Tree Automata TOOlkit

Ré-implmentation of the automata part of SXSI

Supports full navigational XPath (upward axis)
Retain polynomial (combined complexity) running time, i.e. O(|Doc|·|Q|) for an XPath query Q.

How does it work ?

Sample document: alphabet.xml