GitHub - frasertheking/TowardsInterpretability: Towards an interpretable understanding of MLP behaviour in applied machine learning

Understanding physical features in toy precipitation models using sparse autoencoders, maintained by Fraser King

Overview

Building on insights from Toy Models of Superpositon, we conduct a top-down analysis of commonly used single-hidden-layer physical models for precipitation, aiming to identify what physical features (if any) these models learn. While we do not claim these findings are fully interpretable, we highlight this technique as a valuable step toward understanding the internal decision-making processes of such models and pinpointing their strengths, weaknesses, and uncertainties for broader application across the Geosciences. By clarifying these processes, we can also bolster trust in applied machine learning systems—an increasingly critical goal as these models are used to inform high-stakes decisions. Overall, this work seeks to examine the core structure of neural networks, their sensitivities to observation-based inputs and initial conditions, and the ways in which they self-organize to address complex predictive tasks.

The overarching structure of our approach, adapted from the work of Nelson Elhage and Christopher Olah, is illustrated below. In this model-agnostic framework, we first train sparse autoencoders on MLP activations, and then examine the bottleneck layer activations to uncover potential monosemantic physical features. These features may otherwise be obscured by the polysemantic nature of the default MLP activations—an effect commonly attributed to network superposition. We anticipate this approach could generalize well to more complex, multi-layer networks in future work.

In this repository, we provide a select set of Jupyter notebooks and processing scripts that can be used to reproduce our results, or be adapted to new problems in the future.

Installation

git clone https://github.com/frasertheking/TowardsInterpretability.git
conda env create -f env.yml
conda activate circuits

Google Colab Examples

Classifier:

Regressor:

NN Inspector

If you'd like to play around with these models in real-time, check out our WIP interactive neural network inspector application repository.

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change. Note that, as a living project, code is not as clean as it could (should) be, and unit tests need to be produced in future iterations to maintain stability.

Authors & Contact

*Fraser King, University of Michigan, (kingfr@umich.edu)
Claire Pettersen, University of Michigan
Derek Posselt, NASA Jet Propulsion Laboratory
Sarah Ringerud, NASA Goddard Space Flight Center
Yan Xie, University of Michigan

Funding

This project was primarily funded by NASA New (Early Career) Investigator Program (NIP) grant at the University of Michigan.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
figures		figures
models		models
notebooks		notebooks
scripts		scripts
LICENSE		LICENSE
README.md		README.md
env.yml		env.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Overview

Installation

Google Colab Examples

NN Inspector

Contributing

Authors & Contact

Funding

About

Releases

Packages

Languages

License

frasertheking/TowardsInterpretability

Folders and files

Latest commit

History

Repository files navigation

Overview

Installation

Google Colab Examples

NN Inspector

Contributing

Authors & Contact

Funding

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages