Mechanistic Explanation in Deep Learning (Millière)

Raphaël Millière,  PhilosophyMacquarie University14 September, 2024

VIDEO


Abstract: Deep neural networks such as large language models (LLMs) have achieved impressive performance across almost every domain of natural language processing, but there remains substantial debate about which cognitive capabilities can be ascribed to these models. Drawing inspiration from mechanistic explanations in life sciences, the nascent field of “mechanistic interpretability” seeks to reverse-engineer human-interpretable features to explain how LLMs process information. This raises some questions: (1) Are causal claims about neural network components, based on coarse intervention methods (such as “activation patching”), genuine mechanistic explanations? (2) Does the focus on human-interpretable features risk imposing anthropomorphic assumptions? My answer will be “yes” to (1) and “no” to (2), closing with a discussion of some ongoing challenges.

Raphael Millière is Lecturer in Philosophy of Artificial Intelligence at Macquarie University in Sydney, Australia. His interests are in the philosophy of artificial intelligence, cognitive science, and mind, particularly in understanding artificial neural networks based on deep learning architectures such as Large Language Models. He has investigated syntactic knowledge, semantic competence, compositionality, variable binding, and grounding.

Elhage, N., et al. (2021). A mathematical framework for transformer circuitsTransformer Circuits Thread

Machamer, P., Darden, L., & Craver, C. F. (2000). Thinking about MechanismsPhilosophy of Science, 67(1), 1–25. 

Millière, R. (2023). The Alignment Problem in Context. arXiv preprint arXiv:2311.02147

Mollo, D. C., & Millière, R. (2023). The vector grounding problemarXiv preprint arXiv:2304.01481

Yousefi, S., et al. (2023). In-Context Learning in Large Language Models: A Neuroscience-inspired Analysis of Representations. arXiv preprint arXiv:2310.00313.

LLMs: Indication or Representation? (Søgaard)

Anders Søgaard , Computer Science & Philosophy, U. Copenhagen, 23-Nov 2023

VIDEO

ABSTRACT: People talk to LLMs – their new assistants, tutors, or partners – about the world they live in, but are LLMs parroting, or do they (also) have internal representations of the world? There are five popular views, it seems:

  • LLMs are all syntax, no semantics. 
  • LLMs have inferential semantics, no referential semantics. 
  • LLMs (also) have referential semantics through picturing
  • LLMs (also) have referential semantics through causal chains. 
  • Only chatbots have referential semantics (through causal chains) 

I present three sets of experiments to suggest LLMs induce inferential and referential semantics and do so by inducing human-like representations, lending some support to view (iii). I briefly compare the representations that seem to fall out of these experiments to the representations to which others have appealed in the past. 

Anders Søgaard is University Professor of Computer Science and Philosophy and leads the newly established Center for Philosophy of Artificial Intelligence at the University of Copenhagen. Known primarily for work on multilingual NLP, multi-task learning, and using cognitive and behavioral data to bias NLP models, Søgaard is an ERC Starting Grant and Google Focused Research Award recipient and the author of Semi-Supervised Learning and Domain Adaptation for NLP (2013), Cross-Lingual Word Embeddings (2019), and Explainable Natural Language Processing (2021). 

Søgaard, A. (2023). Grounding the Vector Space of an Octopus. Minds and Machines 33, 33-54.

Li, J.; et al. (2023) Large Language Models Converge on Brain-Like Representations. arXiv preprint arXiv:2306.01930

Abdou, M.; et al. (2021) Can Language Models Encode Perceptual Structure Without Grounding? CoNLL

Garneau, N.; et al. (2021) Analogy Training Multilingual Encoders. AAAI