CAI Logo

A Vector Symbolic Algebra for Language Models


Description: Sparse autoencoders have been shown to find meaningful interpretations of internal features in Language Models (LMs) [1]. In detail, Cunningham et al. have used sparse autoencoders to reconstruct internal representations, i.e. neuron activations, of an LM, isolating the features that cause counterfactual behaviour on a simple NLP task. A key challenge is the polysemanticity of neurons, meaning single neurons can activate in multiple, semantically different contexts and are therefore not easily identified to belonging to a specific feature. Polysemanticity is believed to be caused by superposition of neuron activations, which will be analysed in detail in this project.

Vector Symbolic Algebras (VSAs), which can be similarly considered as generative models [2], use superposition deliberately to represent complex concepts. More specifically, a VSA uses hyper-dimensional vectors to represent single concepts, called symbols. Operations on these symbols are defined as superposition (a form of addition) and binding (a form of multiplication), which forms a full algebra that can represent complex concepts by composition of symbols. For example, a screen with a red triangle and a blue square can be represented as RED * TRIANGLE + BLUE * SQUARE. VSAs have desirable properties, such as near lossless decoding of the symbols that form complex concepts, as well as vector sparsity and fixed dimensionality.

In this project, you will analyse the similarities and differences in both representations, i.e. aligning the VSA representation with the neuron activation patterns found in LMs on a simple natural language task called indirect object identification (IOI) [3].

You can follow the following steps:

  • Select a suitable NLP task e.g. IOI [3] When Matteo and Anna went to the store, Matteo gave a drink to?
  • Analyse an LM (Pythia-70m) features obtained via Sparse Auto-Encoders [1]
  • Explicitly model the task with a VSA [2]
  • Analyse and align both representations

Goal: Analyse the similarities of features found in LMs with sparse autoencoders and VSAs

Supervisor: Matteo Bortoletto and Anna Penzkofer

Distribution: 20% literature review, 40% implementation, 40% analysis

Requirements: : Programming proficiency in Python, experience with deep learning, interest in coming up with own ideas.

Literature: [1]H. Cunningham, A. Ewart, L. Riggs, R. Huben, and L. Sharkey. Sparse Autoencoders Find Highly Interpretable Features in Language Models, Oct. 2023. Paper link.

[2] P. M. Furlong and C. Eliasmith. Bridging Cognitive Architectures and Generative Models with Vector Symbolic Algebras. Proceedings of the AAAI Symposium Series, 2(1):262–271, 2023. doi: 10.1609/aaaiss.v2i1.27686. Paper link.

[3] K. Wang, A. Variengien, A. Conmy, B. Shlegeris, and J. Steinhardt. Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small, Nov. 2022. Paper link.