Mechanistic Explanation in Deep Learning (Millière)

Raphaël Millière,  PhilosophyMacquarie University14 September, 2024

VIDEO


Abstract: Deep neural networks such as large language models (LLMs) have achieved impressive performance across almost every domain of natural language processing, but there remains substantial debate about which cognitive capabilities can be ascribed to these models. Drawing inspiration from mechanistic explanations in life sciences, the nascent field of “mechanistic interpretability” seeks to reverse-engineer human-interpretable features to explain how LLMs process information. This raises some questions: (1) Are causal claims about neural network components, based on coarse intervention methods (such as “activation patching”), genuine mechanistic explanations? (2) Does the focus on human-interpretable features risk imposing anthropomorphic assumptions? My answer will be “yes” to (1) and “no” to (2), closing with a discussion of some ongoing challenges.

Raphael Millière is Lecturer in Philosophy of Artificial Intelligence at Macquarie University in Sydney, Australia. His interests are in the philosophy of artificial intelligence, cognitive science, and mind, particularly in understanding artificial neural networks based on deep learning architectures such as Large Language Models. He has investigated syntactic knowledge, semantic competence, compositionality, variable binding, and grounding.

Elhage, N., et al. (2021). A mathematical framework for transformer circuitsTransformer Circuits Thread

Machamer, P., Darden, L., & Craver, C. F. (2000). Thinking about MechanismsPhilosophy of Science, 67(1), 1–25. 

Millière, R. (2023). The Alignment Problem in Context. arXiv preprint arXiv:2311.02147

Mollo, D. C., & Millière, R. (2023). The vector grounding problemarXiv preprint arXiv:2304.01481

Yousefi, S., et al. (2023). In-Context Learning in Large Language Models: A Neuroscience-inspired Analysis of Representations. arXiv preprint arXiv:2310.00313.

Leave a comment