Robust Machine Learning

Mechanistic Interpretability

In a typical psychophysics setup, we show participants two sets of feature visualizations (both maximally and minimally activating a certain unit in a deep neural network) and then let them decide which one of the two query images is highly exciting for that unit.

Projects