Exploring a striped XML world
EXtensible Markup Language, XML, was designed as a markup language for structuring, storing and transporting data on the World Wide Web. The focus of XML is on data content; arbitrary markup is used to describe data. This versatile, self-describing data representation has established XML as the universal data format and the de facto standard for information exchange on the Web. This has gradually given rise to the need for efficient storage and querying of large XML repositories. To that end, we propose a new model for building a native XML store which is based on a generalisation of vertical decomposition. Nodes of a document satisfying the same label-path, are extracted and stored together in a single container, a Stripe. Stripes make use of a labelling scheme allowing us to maintain full structural information. Over this new representation, we introduce various evaluation techniques, which allow us to handle a large fragment of XPath 2.0. We also focus on the optimisation opportunities that arise from our decomposition model during any query evaluation phase. During query validation, we present an input minimisation process that exploits the proposed model for identifying input that is only relevant to the given query, in terms of Stripes. We also define query equivalence rules for query rewriting over our proposed model. Finally, during query optimisation, we deal with whether and under which circumstances certain evaluation algorithms can be replaced by others having lower I/O and/or CPU cost. We propose three storage schemes under our general decomposition technique. The schemes differ in the compression method imposed on the structural part of the XML document. The first storage scheme imposes no compression. The second storage scheme exploits structural regularities of the document to minimise storage and, thus, I/O cost during query evaluation. Finally, the third storage scheme performs structureagnostic compression of the document structure which results in minimised storage, regardless the actual XML structure. We experiment on XML repositories of varying size, recursion and structural regularity. We consider query input size, execution plan size and query response time as metrics for our experimental results. We process query workloads by applying each of the proposed optimisations in isolation and then all of their combinations. In addition, we apply the same execution pipeline for all proposed storage schemes. As a reference to our proposed query evaluation pipeline, we use the current state-of-the-art system for XML query processing. Our results demonstrate that: • Our proposed data model provides the infrastructure for efficiently selecting the parts of the document that are relevant to a given query. • The application of query rewriting, combined with input minimisation, reduces query input size as well as the number of physical operators used. In addition, when evaluation algorithms are specialised to the decomposition method, query response time is further reduced. • Query evaluation performance is largely affected by the storage schemes, which are closely related to the structural properties of the data. The achieved compression ratio greatly affects storage size and therefore, query response times.