Towards effective analysis of big graphs: from scalability to quality
Abstract
This thesis investigates the central issues underlying graph analysis, namely, scalability
and quality.
We first study the incremental problems for graph queries, which aim to compute
the changes to the old query answer, in response to the updates to the input graph.
The incremental problem is called bounded if its cost is decided by the sizes of the
query and the changes only. No matter how desirable, however, our first results are
negative: for common graph queries such as graph traversal, connectivity, keyword
search and pattern matching, their incremental problems are unbounded. In light of
the negative results, we propose two new characterizations for the effectiveness of
incremental computation, and show that the incremental computations above can still
be effectively conducted, by either reducing the computations on big graphs to small
data, or incrementalizing batch algorithms by minimizing unnecessary recomputation.
We next study the problems with regards to improving the quality of the graphs.
To uniquely identify entities represented by vertices in a graph, we propose a class of
keys that are recursively defined in terms of graph patterns, and are interpreted with
subgraph isomorphism. As an application, we study the entity matching problem,
which is to find all pairs of entities in a graph that are identified by a given set of
keys. Although the problem is proved to be intractable, and cannot be parallelized in
logarithmic rounds, we provide two parallel scalable algorithms for it.
In addition, to catch numeric inconsistencies in reallife graphs, we extend graph
functional dependencies with linear arithmetic expressions and comparison predicates,
referred to as NGDs. Indeed, NGDs strike a balance between expressivity and complexity,
since if we allow nonlinear arithmetic expressions, even of degree at most 2, the
satisfiability and implication problems become undecidable. A localizable incremental
algorithm is developed to detect errors using NGDs, where the cost is determined by
small neighbors of nodes in the updates instead of the entire graph.
Finally, a rulebased method to clean graphs is proposed. We extend graph entity
dependencies (GEDs) as data quality rules. Given a graph, a set of GEDs and a block of
ground truth, we fix violations of GEDs in the graph by combining data repairing and
object identification. The method finds certain fixes to errors detected by GEDs, i.e.,
as long as the GEDs and the ground truth are correct, the fixes are assured correct as
their logical consequences. Several fundamental results underlying the method are established,
and an algorithm is developed to implement the method. We also parallelize
the method and guarantee to reduce its running time with the increase of processors.
Collections
Related items
Showing items related by title, author, creator and subject.

Extending graph homomorphism and simulation for real life graph matching
Wu, Yinghui (The University of Edinburgh, 20110630)Among the vital problems in a variety of emerging applications is the graph matching problem, which is to determine whether two graphs are similar, and if so, find all the valid matches in one graph for the other, based ... 
Graph compression using graph grammars
Peternek, Fabian Hans Adolf (The University of Edinburgh, 20180702)This thesis presents work done on compressed graph representations via hyperedge replacement grammars. It comprises two main parts. Firstly the RePair compression scheme, known for strings and trees, is generalized to ... 
Graphbased approach for the approximate solution of the chemical master equation
Basile, Raffaele (The University of Edinburgh, 20150701)The chemical master equation (CME) represents the accepted stochastic description of chemical reaction kinetics in mesoscopic systems. As its exact solution – which gives the corresponding probability density function – ...