17 Reading Research Papers

Sooner or later, every serious AI practitioner hits the same wall: the blog posts and tutorials run out, and the only way forward is to read the actual research papers. This is the moment where many people give up. Papers are dense, full of notation, packed with unexplained assumptions, and written for an audience that already knows the field. Reading your first machine learning paper can feel like reading a legal document written in a foreign language.

But it gets easier, and this chapter will show you how. By the end, you will have a systematic method for extracting value from papers quickly, a list of common traps to avoid, and a reading list to get you started.

Why Read Papers?

Blog posts, tutorials, and videos are great for getting started, but they are second-hand sources. The authors have already decided what to include and what to omit, what to emphasize and what to downplay. Papers are the primary source: they contain the full method, the exact experimental setup, the ablation studies, and the failure modes. If you want to truly understand a technique (not just use it), you need to read the paper. More importantly, the gap between “published” and “explained in a blog post” is typically 6 to 12 months. If you can read papers, you are always 6 months ahead.

17.1 The Three-Pass Method

Srinivasan Keshav's “How to Read a Paper” describes a three-pass approach that works remarkably well for ML papers:

First pass (5 to 10 minutes): Read the title, abstract, introduction, section headings, and conclusion. Look at the figures and tables (especially results tables). After this pass, you should know: What problem does the paper solve? What is the claimed contribution? Is the result significant? Do you need to read further?

Second pass (30 to 60 minutes): Read the entire paper, but skip detailed proofs and dense mathematical derivations. Highlight key claims, understand the method at a high level, and note how the experiments are structured. After this pass, you should be able to summarize the paper to someone else and identify its strengths and weaknesses.

Third pass (2 to 5 hours): Read the paper in complete detail. Work through every equation. Verify that the experimental setup supports the claims. Try to mentally re-derive the key results. After this pass, you should be able to reimplement the method from scratch.

You Do Not Need the Third Pass For Every Paper

Most papers only deserve the first pass. A smaller fraction deserve the second. The third pass should be reserved for papers that are directly relevant to your work or that introduce fundamental techniques you plan to use. A good researcher reads dozens of abstracts, skims a handful of papers, and deeply studies a few each month.

The Paper Reading Superpower

Here is a secret that experienced researchers know: the ability to efficiently read papers is a career multiplier. A researcher who can deeply engage with three papers per week compounds knowledge faster than one who skims thirty. After a year, the deep reader has internalized 150 papers and can synthesize ideas across them. The skimmer has a vague awareness of trends but cannot implement or build on anything. Invest in reading quality, not quantity.

17.2 Anatomy of an ML Paper

Understanding the typical structure helps you navigate papers efficiently:

Abstract: A one-paragraph summary. Often the only part most people read. Good abstracts state the problem, the method, and the key result.
Introduction: Motivates the problem, positions the contribution relative to prior work, and previews the results. This is where you learn why the paper exists.
Related Work: A survey of prior approaches. This section is gold for building your own understanding of the field and finding other papers to read.
Method: The technical contribution. This is usually the densest section. Read it carefully if you plan to implement or build on the work.
Experiments: How the method was evaluated. Pay attention to: Which baselines were compared? What datasets and metrics were used? Are the improvements statistically significant? What ablation studies were performed?
Conclusion: Summarizes findings and often suggests future work. The future work section can inspire your own research directions.
Appendix: Contains additional details, proofs, hyperparameters, and extended results that did not fit in the main text. Often essential for reproduction.

17.3 How to Take Paper Notes

Reading papers without taking notes is like attending lectures without writing anything down: you will forget 90% within a week. Here is a system that works:

The one-page summary: After reading a paper, write a one-page summary with four sections: (1) What is the problem? (2) What is the key idea? (3) What are the main results? (4) What are the limitations? Force yourself to write this from memory, then check against the paper. The gaps between your memory and the paper reveal what you did not truly understand.

Keep a paper log: Maintain a simple spreadsheet or note file with one row per paper: title, date read, one-sentence summary, and a 1-to-5 rating of relevance to your work. After six months, this log becomes an invaluable personal database of the field.

Draw the architecture: For papers that introduce new models or methods, redraw the architecture diagram from scratch. Do not copy it; reconstruct it from your understanding. If you cannot draw it, you do not understand it.

Write the equation: Similarly, re-derive key equations without looking. The act of reconstruction forces deep processing that passive reading does not.

The Zettelkasten Method for Papers

The Zettelkasten (slip-box) method, popularized by the sociologist Niklas Luhmann, works beautifully for academic reading. For each paper, write a short “atomic” note in your own words (not a summary---a single insight). Then link it to related notes from other papers. Over time, you build a web of connected ideas that reveals patterns, contradictions, and research opportunities. Tools like Obsidian and Logseq make this easy.

17.4 Reading Critically

Not all papers are created equal, and even great papers have weaknesses. Here is what to watch for:

Cherry-picked results: Does the paper report the best run out of many, or the average? Are the baselines truly the strongest available? A paper that beats GPT-2 as a baseline in 2025 is not proving much.

Benchmark gaming: Some papers are optimized for benchmarks rather than real-world performance. A model that achieves state-of-the-art on MMLU by memorizing test-similar data is not genuinely more capable.

Missing ablations: If a method has five components and no ablation study, you cannot tell which components actually matter. The paper might be 80% unnecessary complexity.

Overclaiming: Watch for the gap between what the results actually show and what the abstract claims. “Our method improves accuracy by 0.3% on one dataset” sometimes becomes “We present a revolutionary new approach that significantly advances the state of the art” in the abstract.

Reproducibility: Is the code available? Are all hyperparameters reported? Can you actually run this? Papers without code should be treated with extra skepticism.

The Reviewer's Mindset

The fastest way to become a better paper reader is to think like a reviewer. For every paper you read, ask: Would I accept this for a top conference? What are the three strongest criticisms? What experiments are missing? This adversarial mindset forces you to engage critically rather than passively absorbing claims. Eventually, you will start spotting weaknesses automatically.

17.5 Common Traps and How to Avoid Them

Getting stuck on notation: Every paper uses slightly different notation. Do not let unfamiliar symbols block you. Write down what each symbol means in your own notation as you encounter it. Build a “Rosetta Stone” for the paper.

Assuming everything is correct: Papers have errors. Equations have typos. Experimental setups have questionable choices. Read critically: does the claim follow from the evidence? Are there confounding factors? Would the result hold under different conditions?

Skipping the ablations: Ablation studies (“what happens if we remove this component?”) tell you which parts of the method actually matter. A model that achieves 95% of its improvement from a single trick and 5% from five additional tricks is really about one trick.

Confusing “state-of-the-art” with “useful”: A paper may report state-of-the-art performance on a benchmark while requiring \(100\times\) more compute than the runner-up. Always look at the efficiency-performance tradeoff.

The Notation Barrier

Every ML paper uses slightly different notation: \(\theta\) for parameters in one paper, \(\phi\) in another; \(\mathcal{D}\) for the dataset here, \(\mathcal{X}\) there. Do not let this stop you. On your first read, jot down each symbol's meaning in the margin. After a few months of reading papers, you will have internalized the most common conventions and notation will feel natural.

Reading alone: Join a paper reading group (many are available online). Discussing papers with others catches blind spots, accelerates understanding, and is far more enjoyable than reading solo.

The Arxiv Workflow

New ML papers appear on arXiv daily (sometimes dozens per day in popular areas). To stay current without drowning: follow curated feeds like Papers With Code, Hugging Face Daily Papers, or AK's (Aran Komatsuzaki's) Twitter feed. Use tools like Semantic Scholar's “Research Feed” or Connected Papers to discover related work. Set up keyword alerts for your specific interests. Read titles and abstracts daily; commit to deeply reading one or two papers per week.

17.6 Essential AI Papers

Here is a curated list of papers that are foundational to modern AI. You do not need to read all of them immediately, but having them on your reading list will serve you well:

“Attention Is All You Need” (Vaswani et al. 2017): The transformer architecture. The most cited ML paper of the decade for good reason.
“BERT: Pre-training of Deep Bidirectional Transformers” (Devlin et al. 2019): Showed that pre-training bidirectional transformers on masked language modeling produces powerful representations.
“Language Models are Few-Shot Learners” (Brown et al. 2020): The GPT-3 paper. Demonstrated that large language models can perform tasks from a few examples without fine-tuning (in-context learning).
“Training Language Models to Follow Instructions with Human Feedback” (Ouyang et al. 2022): The InstructGPT/RLHF paper. Showed how to align LLMs with human preferences.
“Chain-of-Thought Prompting Elicits Reasoning in Large Language Models” (Wei et al. 2022): Demonstrated that asking models to “think step by step” dramatically improves reasoning.
“LoRA: Low-Rank Adaptation of Large Language Models” (Hu et al. 2021): Made fine-tuning accessible by showing that adapting only a small number of parameters works surprisingly well.
“Scaling Laws for Neural Language Models” (Kaplan et al. 2020): Revealed the power-law relationships between model size, data, compute, and performance.

17.7 Tools for Reading Papers

Semantic Scholar: Free academic search engine with AI-generated summaries, citation graphs, and alerts. Far better than Google Scholar for discovering related papers.
Connected Papers: Visualizes the citation graph around a paper, making it easy to find seminal works and follow-up research.
Papers With Code: Links papers to their official code implementations and benchmark results. Essential for reproduction.
Zotero / Mendeley: Reference managers for organizing your paper library, annotating PDFs, and generating bibliographies.
Explain Paper / ChatPDF: AI tools that let you ask questions about a paper's content. Useful for quick clarification, but no substitute for careful reading.

17.8 Exercises

Pick a paper from the essential reading list above and apply the three-pass method. After each pass, write a one-paragraph summary. How does your understanding evolve across the three passes?
Find a recent paper on arXiv in your area of interest. Read the abstract and introduction. Write down: (a) What problem does it solve? (b) What is the key idea? (c) How is it evaluated? (d) What are the limitations?
Join an online paper reading group (ML Collective, Yannic Kilcher's Discord, or a university group) and participate in at least one discussion. What did you learn from the discussion that you missed in your solo reading?
Use Connected Papers to explore the citation graph around the “Attention Is All You Need” paper. Identify three important follow-up papers and three important precursor papers. How does the narrative of the field emerge from the citation structure?
Pick a paper with available code on Papers With Code. Reproduce the main result. Document any discrepancies between the paper's description and the actual implementation.

References

Brown, Tom, Benjamin Mann, Nick Ryder, et al. 2020. “Language Models Are Few-Shot Learners.” Advances in Neural Information Processing Systems 33: 1877-901.

Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. “BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding.” arXiv Preprint arXiv:1810.04805.

Hu, Edward J, Yelong Shen, Phillip Wallis, et al. 2021. “LoRA: Low-Rank Adaptation of Large Language Models.” arXiv Preprint arXiv:2106.09685.

Kaplan, Jared, Sam McCandlish, Tom Henighan, et al. 2020. “Scaling Laws for Neural Language Models.” arXiv Preprint arXiv:2001.08361.

Ouyang, Long, Jeffrey Wu, Xu Jiang, et al. 2022. “Training Language Models to Follow Instructions with Human Feedback.” Advances in Neural Information Processing Systems 35: 27730-44.

Vaswani, Ashish, Noam Shazeer, Niki Parmar, et al. 2017. “Attention Is All You Need.” Advances in Neural Information Processing Systems.

Wei, Jason, Xuezhi Wang, Dale Schuurmans, et al. 2022. “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.” arXiv Preprint arXiv:2201.11903.