schedule | CS 4804

Note: Since this is the first time the class is being taught, the schedule may adjust if we need more or less time on certain topics.

Date	Lecture	Readings
8/26	Introduction [ slides ]	Russell and Norvig, Chapter 1
8/28	Language Modeling [ slides ]	Jurafsky and Martin, Chapter 3.1-3.5
9/2	Neural networks [ slides ]	Jurafsky and Martin, Chapter 6.1-6.5 Bengio et al. (2003) A Neural Probabilistic Language Model
9/4	Backpropagation [ slides ]	Jurafsky and Martin, Chapter 6.6
9/9	Embeddings [ slides ]	Jurafsky and Martin, Chapter 5
9/11	Transformers [ slides ]	Vaswani et al. (2017) Attention Is All You Need
9/16	Transformers (cont.) [ slides ]	Devlin et al. (2018) BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding Raffel et al. (2019) Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer Radford et al. (2019) Language Models are Unsupervised Multitask Learners
9/18	Pretraining scaling [ slides ]	Kaplan et al. (2020) Scaling Laws for Neural Language Models Hoffmann et al. (2022) Training Compute-Optimal Large Language Models Li et al. (2025) (Mis)Fitting: A Survey of Scaling Laws
9/23	Multimodal models [ slides ]	Wang et al. (2024) Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution McKinzie et al. (2024) MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training Sebastian Raschka Understanding Multimodal LLMs [optional] Deitke et al. (2024) Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models [optional] Wang et al. (2024) Emu3: Next-Token Prediction is All You Need [optional] Jiang et al. (2025) Token-Efficient Long Video Understanding for Multimodal LLMs
9/25	No classes (Tu is OOO)
9/30	Prompting [ slides ]	Brown et al. (2020) Language Models are Few-Shot Learners Wei et al. (2022) Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
10/2	Decoding strategies [ slides ]	Holtzman et al. (2019) The Curious Case of Neural Text Degeneration
10/7	Instruction tuning [ slides ]	Wei et al. (2021) Finetuned Language Models Are Zero-Shot Learners Chung et al. (2022) Scaling Instruction-Finetuned Language Models [optional] Sanh et al. (2021) Multitask Prompted Training Enables Zero-Shot Task Generalization [optional] Longpre et al. (2023) The Flan Collection: Designing Data and Methods for Effective Instruction Tuning
10/9	Alignment [ slides ]	Ouyang et al. (2022) Training language models to follow instructions with human feedback Bai et al. (2022) Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback Rafailov et al. (2023) Direct Preference Optimization: Your Language Model is Secretly a Reward Model
10/14	Large reasoning models & Test-time scaling [ slides ]	DeepSeek-AI (2025) DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning Muennighoff et al. (2025) s1: Simple test-time scaling Brown et al. (2024) Large language monkeys: Scaling inference compute with repeated sampling [optional] Geiping et al. (2025) Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
10/16	Large reasoning models & Test-time scaling (cont'd) [ slides ]	Ye et al. (2025) LIMO: Less is More for Reasoning Yu et al. (2025) Z1: Efficient Test-time Scaling with Code [optional] Xiang et al. (2025) Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Thought
10/21	Evaluation [ slides ]	Zheng et al. (2023) Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena [optional] Vu et al. (2024) Foundational Autoraters: Taming Large Language Models for Better Automatic Evaluation
10/23	Mixture-of-Experts [ slides ]	Fedus et al. (2021) Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity Shen et al. (2023) Mixture-of-Experts Meets Instruction Tuning: A Winning Combination for Large Language Models [optional] Zoph et al. (2022) ST-MoE: Designing Stable and Transferable Sparse Expert Models [optional] Lepikhin et al. (2020) GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
10/28	No classes (Tu is OOO)
10/30	Efficient attention [ slides ]	Dao et al. (2022) FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness Gao et al. (2024) How to Train Long-Context Language Models (Effectively)
11/4	Parameter-efficient fine-tuning [ slides ]	Hu et al. (2021) LoRA: Low-Rank Adaptation of Large Language Models Lester et al. (2021) The Power of Scale for Parameter-Efficient Prompt Tuning [optional] Vu et al. (2022) SPoT: Better Frozen Model Adaptation through Soft Prompt Transfer
11/6	Efficient training and inference [ slides ]	Hinton et al. (2015) Distilling the Knowledge in a Neural Network Ilharco et al. (2022) Editing Models with Task Arithmetic Maarten Grootendorst's blog A Visual Guide to Quantization
11/11	Retrieval-augmented generation & Tool-use models [ slides ]	Lewis et al. (2020) Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks Schick et al. (2023) Toolformer: Language Models Can Teach Themselves to Use Tools [optional] Jin et al. (2025) Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning
11/13	LLM Agents [ slides ]	Andrew Ng Agentic Design Patterns Part 1-5 [optional] Khattab et al. (2023) DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines [optional] Agrawal et al. (2025) GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning
11/18	Diffusion models [ slides ]
11/20	Ethics and safety [ slides ]
11/25	No classes (Thanksgiving break)
11/27	No classes (Thanksgiving break)
12/2	No classes
12/4	Project presentations [ slides ]
12/9	Project presentations [ slides ]