Link Search Menu Expand Document

Lectures & Readings

The course is divided into two main parts. The first part consists of instructor-led lectures to establish foundational concepts. The second part is a student-led seminar where we will critically analyze key research papers.


Part I: Foundations (Instructor-Led)

Week 1: Introduction & Transformer Basics (Completed)

  • Topics: Course administration, autoregressive probabilistic models, and the basics of the Transformer architecture.
  • Readings:
    • “Attention Is All You Need” (Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, & Polosukhin, 2017) [Link]
    • “Formal Algorithms for Transformers” (Phuong and Hutter, 2022) [Link]
    • “An Overview of Large Language Models for Statisticians” (Ji et al., 2025) [Link]

Week 2: A Language for Transformers: RASP & RASP-L

  • Topics: A detailed look at the Transformer architecture and their training including training with RL for the purpose of making them better at reasoning; Modeling transformer computation with the RASP language; The RASP-L conjecture for predicting length generalization.
  • Readings:
    • “Thinking Like Transformers” (Weiss, Goldberg, & Yahav, 2021) [Link]
    • “What Algorithms can Transformers Learn? A Study in Length Generalization” (Zhou et al., 2024) [Link]

Week 3: Computational Limits of Transformers

  • Topics: Computational Universality of Transformers with scaffolding; The Transformer as a circuit and its relation to the complexity class TC0.
  • Readings:
    • “Memory Augmented Large Language Models are Computationally Universal” (Schuurmans, 2023) [Link]
    • “Autoregressive Large Language Models are Computationally Universal” (Schuurmans, Dai, & Zanini, 2024) [Link]
    • “Saturated Transformers are Constant-Depth Threshold Circuits” (Merrill, Sabharwal, & Smith, 2022) [Link]
    • “What Formal Languages Can Transformers Express? A Survey” (Lena Strobl, William Merrill, Gail Weiss, David Chiang, Dana Angluin, 2024) [Link]
    • “Chain of Thought Empowers Transformers to Solve Inherently Serial Problems” (Zhiyuan Li, Hong Liu, Denny Zhou, Tengyu Ma, 2024) [Link]

Week 4: Language Learnability

  • Topics: Formal learnability of programming languages and its connection to natural language acquisition.
  • Readings:
    • “Learning theory and natural language” (Osherson, Stob, & Weinstein, 1984) [Link]
    • “Density Measures for Language Generation” (Kleinberg, & Wei, 2025) [Link]
    • “Language Generation in the Limit” (Kleinberg & Mullainathan, 2024) [Link]
    • “Generation through the lens of learning theory” (Li, Raman, & Tewari, 2024) [Link]
    • “On Union-Closedness of Language Generation” (Hanneke, Karbasi, Mehrotra & Velegkas, 2025) [Link]
    • “Language Generation in the Limit: Noise, Loss, and Feedback” (Bai, Panigrahi, & Zhang, 2025) [Link]

Week 5 (short): From Reasoning to Exact Learning

  • Topics: The fundamental misalignment between statistical learning and sound reasoning; Why exact learning is essential for general intelligence; A case study in learning exact algorithmic instructions.
  • Readings:
    • “Beyond Statistical Learning: Exact Learning Is Essential for General Intelligence” (György, Lattimore, Lazić, & Szepesvári, 2025) [Link]
    • “Learning to Add, Multiply, and Execute Algorithmic Instructions Exactly with Neural Networks” (Back de Luca, Giapitzakis, & Fountoulakis, 2025) [Link]

Part II: Student-Led Research Seminar Readings

Student pairs will select from the following papers for their presentations, beginning in Week 6.

  • Theme 1: Provable Reasoning & Algorithmic Solutions
    • “From Reasoning to Super-Intelligence: A Search-Theoretic Perspective” (Shalev-Shwartz & Shashua, 2025) [Link]
    • “The Expressive Power of Transformers with Chain of Thought” (Feng, Li, & Ma, 2024) [Link]
    • “Transformers Provably Solve Parity Efficiently with Chain of Thought” (Kim & Suzuki, 2024) [Link]
    • “Multi-Head Transformers Provably Learn Symbolic Multi-Step Reasoning via Gradient Descent” (Yang, Huang, Liang, & Chi, 2025) [Link]
    • “Metastable Dynamics of Chain-of-Thought Reasoning: Provable Benefits of Search, RL and Distillation” (Kim, Wu, Lee, & Suzuki, 2025) [Link]
    • “Learning Compositional Functions with Transformers from Easy-to-Hard Data” (Wang et al., 2025) [Link]
    • “A Theory of Learning with Autoregressive Chain of Thought” (Joshi et al., 2025) [Link]
    • “Sub-Task Decomposition Enables Learning in Sequence to Sequence Tasks” (Wies, Levine, & Shashua, 2022) [Link]
    • “How Transformers Learn Causal Structure with Gradient Descent” (Nichani, Damian, & Lee 2024) [Link]
    • “Unveiling Induction Heads: Provable Training Dynamics and Feature Learning in Transformers” (Siyu Chen, Heejune Sheen, Tianhao Wang, Zhuoran Yang, 2024, NeurIPS) [Link]
    • “Repeat After Me: Transformers are Better than State Space Models at Copying” (Samy Jelassi, David Brandfonbrener, Sham M. Kakade, Eran Malach, 2024) [Link]
  • Theme 2: Hardness, Limitations & Trade-offs
    • “Hardness of Learning Fixed Parities with Neural Networks” (Shoshani & Shamir, 2025) [Link]
    • “How Far Can Transformers Reason? The Globality Barrier and Inductive Scratchpad” (Abbe et al., 2024) [Link]
    • “Theoretical Limitations of Multi-Layer Transformer” (Chen, Peng, & Wu, 2024) [Link]
    • “Computational-Statistical Tradeoffs at the Next-Token Prediction Barrier: Autoregressive and Imitation Learning under Misspecification.” (Rohatgi et al., 2025) [Link]
    • “Trade-Offs in Data Memorization via Strong Data Processing Inequalities” (Feldman, Kornowski, & Lyu, 2025) [Link]
    • “Why Cannot Large Language Models Ever Make True Correct Reasoning?” (Cheng, 2025) [Link]
    • “On Limitations of the Transformer Architecture” (Binghui Peng, Srini Narayanan, Christos Papadimitriou 2024) [Link]
    • “Quantitative Bounds for Length Generalization in Transformers” (Zachary Izzo, Eshaan Nichani, Jason D. Lee, 2025) [Link]
    • “The Pitfalls of Next-token Prediction” (Gregor Bachmann, Vaishnavh Nagarajan) [Link] This is an empirical paper, but hardness is proved for the task considered in Theorem 1 of Hu et al. [Link].
  • Theme 3: Generalization & Robustness
    • “Understanding the Failure Modes of Out-of-Distribution Generalization” (Nagarajan, Andreassen, & Neyshabur, 2020) [Link]
    • “Self-Improving Transformers Overcome Easy-to-Hard and Length Generalization Challenges” (Lee et al., 2025) [Link]
    • “Transformation-Invariant Learning and Theoretical Guarantees for OOD Generalization” (Montasser, Shao, & Abbe, 2024) [Link]
    • “Provable Advantage of Curriculum Learning on Parity Targets with Mixed Inputs” (Abbe, Cornacchia, & Lotfi, 2023) [Link]
  • Theme 4: Representation & Semantics
    • “Large Language Models Encode Semantics in Low-Dimensional Linear Subspaces” (Saglam et al., 2025) [Link]
    • “The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets” (Marks & Tegmark, 2023) [Link]
    • “On the Minimal Degree Bias in Generalization on the Unseen for Non-Boolean Functions” (Pushkin, Berthier, & Abbe, 2024) [Link]

Background & Supplementary Readings

This section contains materials for context, including historical perspectives, key empirical results, and related theoretical work. These are not required readings for presentations but are highly recommended for a deeper understanding.

  • Historical & Philosophical Context
    • “Connectionism and Cognitive Architecture: A Critical Analysis” (Fodor & Pylyshyn, 1988)
    • “Linguistics and Natural Logic” (Lakoff, 1970) [Link]
    • “Mathematics, Word Problems, Common Sense, and Artificial Intelligence” (Davis, 2024) [Link]
  • Key Empirical Papers & Surveys
    • “Stop Anthropomorphizing Intermediate Tokens as Reasoning/Thinking Traces!” (Kambhampati, Stechly, & Valmeekam, 2025) [Link]
    • “Natural Language Reasoning, A Survey” (Yu, Zhang, Tiwari, & Wang, 2024) [Link]
    • “Universal Transformers” (Dehghani et al., 2018) [Link]
    • “REASONING GYM: Reasoning Environments for Reinforcement Learning with Verifiable Rewards” (Stojanovski et al., 2025) [Link]
    • “Comment on The Illusion of Thinking…” (Lawsen, 2025) [Link]
    • “(How) Do Reasoning Models Reason?” (Kambhampati, Stechly & Valmeekam, 2025) [Link]
    • “Goedel-Prover: A Frontier Model for Open-Source Automated Theorem Proving” (Lin, .., Arora, Jin, et al. 2025) [Link]
    • “Transformers Struggle to Learn to Search” (Saparov, Pawar, ..,Kim & He. 2024) [Link]
    • “Towards an Understanding of Stepwise Inference in Transformers: A Synthetic Graph Navigation Model” (Khona, .., & Tanaka. 2024) [Link]
    • “General Intelligence Requires Reward-based Pretraining” (Han, Pari, Gershman & Agrawal, 2025) [Link
  • Related Theoretical/Empirical Work
    • “Learning to Reason with Neural Networks: Generalization, Unseen Data and Boolean Measures” (Abbe et al., 2022) [Link]
    • “The Surprising Agreement between Convex Optimization Theory and Learning-Rate Scheduling for Large Model Training” (Schaipp et al., 2025) [Link]
    • “Features at Convergence Theorem: a first-principles alternative to the Neural Feature Ansatz for how networks learn representations” (Boix-Adsera, Mallinar, Simon, & Belkin) [Link]
    • “On the Role of Initialization in the Training of Recurrent Neural Networks” (Sutskever, Martens, Dahl, & Hinton, 2013) [Link]
    • “Dynamic Chunking for End-to-End Hierarchical Sequence Modeling.” (Sukjun, Hwang, Wang Brandon, and Gu Albert. 2025) [Link]