Back
Research Overview
Other
Machine Learning for Scientific Environments
Interpretable Machine Learning
When machine learning is being applied in other scientific disciplines, it is of utmost importance to be able to distinguish "random artifacts" from "true structure". We work on such theoretical guarantees, for example in the field of network analysis.
Two sample tests for random networks. A key question in this field is whether two different networks are likely to have been generated from the same underlying process or not. This question is often formulated in terms of statistical hypothesis testing. Surprisingly, there has been little research on the theoretical aspects of network testing, and it is often not clear whether one can meaningfully test large networks by observing only a few replicates, which is a typical setting in applications of network analysis. In this project, we focus on the problem of two-sample hypothesis testing of networks and provide a clear understanding of the statistical complexity of testing small populations of networks, each defined on a large number of nodes. We formally define a testing problem based on different models and different tests in the small sample, large graph regime, and derive minimax-optimal theoretical hypothesis tests [].
We put theory into practice and develop practical two-sample tests based on asymptotic distributions []. We show that these tests are both theoretically consistent in the small sample regime and exhibit good empirical performance for moderate sized networks. We also study the more general problem of comparing two large graphs defined on different vertex sets. Using such testing principles, we provide a formal framework and demonstrate that the testing problem can be ill-posed, and network statistics may not distinguish highly dissimilar networks.
Inductive bias of graph algorithms. In applied scientific domains such as neuroscience or climate science, people often use a network sampling approach to test hypotheses about their networks. While most classical network sampling models are too simplistic to reproduce real-world graphs, a GAN-based approach recently emerged as an attractive alternative: by training a GAN to learn the random walk distribution of the input graph, the algorithm is able to reproduce a large number of important network patterns simultaneously, without explicitly specifying any of them. In one of our recent papers [] we investigate the implicit bias of NetGAN. We find that the root of its generalization properties does not lie in the GAN architecture, but in an inconspicuous low-rank approximation of the logits random walk transition matrix. Step by step we can strip NetGAN of all unnecessary parts, including the GAN, and obtain a highly simplified reformulation that achieves comparable generalization results, but is orders of magnitudes faster and easier to adapt. Being much simpler on the conceptual side, we reveal the implicit inductive bias of the algorithm — an important step towards increasing the interpretability, transparency and acceptance of machine learning systems.
Members
More information