Towards effective analysis of big graphs: from scalability to quality
This thesis investigates the central issues underlying graph analysis, namely, scalability and quality. We first study the incremental problems for graph queries, which aim to compute the changes to the old query answer, in response to the updates to the input graph. The incremental problem is called bounded if its cost is decided by the sizes of the query and the changes only. No matter how desirable, however, our first results are negative: for common graph queries such as graph traversal, connectivity, keyword search and pattern matching, their incremental problems are unbounded. In light of the negative results, we propose two new characterizations for the effectiveness of incremental computation, and show that the incremental computations above can still be effectively conducted, by either reducing the computations on big graphs to small data, or incrementalizing batch algorithms by minimizing unnecessary recomputation. We next study the problems with regards to improving the quality of the graphs. To uniquely identify entities represented by vertices in a graph, we propose a class of keys that are recursively defined in terms of graph patterns, and are interpreted with subgraph isomorphism. As an application, we study the entity matching problem, which is to find all pairs of entities in a graph that are identified by a given set of keys. Although the problem is proved to be intractable, and cannot be parallelized in logarithmic rounds, we provide two parallel scalable algorithms for it. In addition, to catch numeric inconsistencies in real-life graphs, we extend graph functional dependencies with linear arithmetic expressions and comparison predicates, referred to as NGDs. Indeed, NGDs strike a balance between expressivity and complexity, since if we allow non-linear arithmetic expressions, even of degree at most 2, the satisfiability and implication problems become undecidable. A localizable incremental algorithm is developed to detect errors using NGDs, where the cost is determined by small neighbors of nodes in the updates instead of the entire graph. Finally, a rule-based method to clean graphs is proposed. We extend graph entity dependencies (GEDs) as data quality rules. Given a graph, a set of GEDs and a block of ground truth, we fix violations of GEDs in the graph by combining data repairing and object identification. The method finds certain fixes to errors detected by GEDs, i.e., as long as the GEDs and the ground truth are correct, the fixes are assured correct as their logical consequences. Several fundamental results underlying the method are established, and an algorithm is developed to implement the method. We also parallelize the method and guarantee to reduce its running time with the increase of processors.
Showing items related by title, author, creator and subject.
Wu, Yinghui (The University of Edinburgh, 2011-06-30)Among the vital problems in a variety of emerging applications is the graph matching problem, which is to determine whether two graphs are similar, and if so, find all the valid matches in one graph for the other, based ...
Peternek, Fabian Hans Adolf (The University of Edinburgh, 2018-07-02)This thesis presents work done on compressed graph representations via hyperedge replacement grammars. It comprises two main parts. Firstly the RePair compression scheme, known for strings and trees, is generalized to ...
Basile, Raffaele (The University of Edinburgh, 2015-07-01)The chemical master equation (CME) represents the accepted stochastic description of chemical reaction kinetics in mesoscopic systems. As its exact solution – which gives the corresponding probability density function – ...