---
name: read-paper
description: Read an academic paper PDF safely by splitting into 4-page chunks. Converts full text to searchable markdown and produces curated notes. ALWAYS use this skill when reading papers — never bypass it with ad-hoc agents. Triggers on paper reading, paper summaries, ingesting papers.
disable-model-invocation: true
argument-hint: "[path/to/paper.pdf or search query]"
---

# Read Academic Paper

Read an academic paper PDF by splitting it into 4-page chunks and extracting structured notes. Also converts the full paper to searchable markdown for detailed queries later. This prevents context window overflow and produces better extraction than reading a full PDF at once.

## Gotchas

1. **NEVER bypass this skill with ad-hoc agents.** When reading a paper, always invoke `/read-paper` — do not launch a general-purpose agent with custom "read this paper" instructions. The 4-page chunking approach produces dramatically better summaries than having an agent skim the full text.

2. **Always check for the published version** before reading a working paper. Author websites usually have the journal version freely available. Published versions have better figures, are authoritative, and produce cleaner extractions.

3. **The sub-agent must read PDF splits, not just the pymupdf4llm markdown.** The markdown loses table formatting, figures, and layout. The splits preserve these. The markdown is a supplement for searching, not a replacement for reading.

4. **Always extract limitations and open questions.** Check the conclusion/discussion section carefully — this is where authors identify what they couldn't do and what future work should address. This content is high-value for lecture slides.

5. **Don't skip the 4-page splitting step.** Reading 50+ pages at once causes the agent to skim. The 4-page chunking forces careful, sequential reading with notes updated after each batch.

**Two outputs per paper:**
1. `paper_summaries/Author_Year.md` — curated notes for slide-building (from sub-agent)
2. `paper_markdown/Author_Year.md` — full-text searchable markdown (from pymupdf4llm)

**IMPORTANT: This entire skill runs inside a sub-agent** to protect the main conversation's context window from being filled with dense paper text. The orchestrator (you) should:
1. Acquire the PDF, convert full text, and split (Steps 1-3) in the main context
2. Delegate the reading and extraction (Step 4) to a sub-agent via the Agent tool
3. After the sub-agent finishes, read the resulting summary and present a brief summary to the user

## Step 1: Acquire the PDF (main context)

**If a local file path is provided:**
- Verify the file exists
- Copy it into the lecture's `reference_papers/` folder if it's not already there
- **Check if this is a working paper** (NBER, SSRN, double-spaced, "Working Paper" in header). If so, attempt to find the published version (see below) before proceeding.
- Proceed to Step 2

**If a search query is provided:**
1. Use WebSearch to find the paper
2. **Always prefer the published version** over working papers. Search in this order:
   a. **Authors' personal/academic websites** — most economists post final PDFs. These are usually the published journal version, freely available.
   b. **Journal website** — if the paper is published, try the DOI link. Many are open-access or have free PDFs.
   c. **NBER/SSRN** — use only as a fallback if no published version is freely available.
3. If the published version requires a journal login, inform the user and ask if they can provide it. Do not silently fall back to the working paper without noting this.
4. Download the PDF using `curl` (preferred) or WebFetch
5. Save to the appropriate `lectures/class_NN/reference_papers/` directory with a clear filename (e.g., `Author_Year_Journal.pdf`)
6. Proceed to Step 2

**Why prefer published versions:** Published papers have better-formatted figures (journal typesetting vs. double-spaced manuscripts), are the authoritative version, and produce cleaner figure extractions for slides.

**CRITICAL: Always preserve the original PDF.** Never delete or overwrite the source file.

## Step 2: Convert Full Text to Markdown (main context)

Convert the entire paper to searchable markdown using `pymupdf4llm`. This runs instantly and produces a complete text file for ad-hoc queries later (e.g., "did this paper mention monitoring?").

```bash
python3 -c "
import pymupdf4llm, os

pdf_path = '$PDF_PATH'
lecture_dir = os.path.dirname(os.path.dirname(pdf_path))  # up from reference_papers/
md_dir = os.path.join(lecture_dir, 'paper_markdown')
os.makedirs(md_dir, exist_ok=True)

basename = os.path.splitext(os.path.basename(pdf_path))[0]
md_path = os.path.join(md_dir, f'{basename}.md')

md_text = pymupdf4llm.to_markdown(pdf_path)

# Check if conversion produced actual text (not just image placeholders)
text_lines = [l for l in md_text.split('\n') if l.strip() and 'intentionally omitted' not in l]
text_ratio = len('\n'.join(text_lines)) / max(len(md_text), 1)

if text_ratio > 0.2:
    with open(md_path, 'w') as f:
        f.write(md_text)
    print(f'Full markdown saved: {md_path} ({len(md_text):,} chars)')
else:
    print(f'SCANNED PDF — text extraction failed (text ratio: {text_ratio:.0%}). Full-text markdown not available; use PDF splits for reading.')
"
```

If pymupdf4llm is not installed: `pip install pymupdf4llm`

## Step 3: Split the PDF (main context)

Run this Python script to split into 4-page chunks:

```bash
python3 -c "
from PyPDF2 import PdfReader, PdfWriter
import os, sys

input_path = '$PDF_PATH'
output_dir = os.path.join(os.path.dirname(input_path), 'splits_' + os.path.splitext(os.path.basename(input_path))[0])
os.makedirs(output_dir, exist_ok=True)

reader = PdfReader(input_path)
total = len(reader.pages)
prefix = os.path.splitext(os.path.basename(input_path))[0]

for start in range(0, total, 4):
    end = min(start + 4, total)
    writer = PdfWriter()
    for i in range(start, end):
        writer.add_page(reader.pages[i])
    out_path = os.path.join(output_dir, f'{prefix}_pp{start+1}-{end}.pdf')
    with open(out_path, 'wb') as f:
        writer.write(f)

print(f'Split {total} pages into {-(-total // 4)} chunks in {output_dir}')
"
```

If PyPDF2 is not installed: `pip install PyPDF2`

## Step 4: Delegate Reading to Sub-Agent

Launch a sub-agent with the Agent tool. Pass it:
- The path to the splits directory
- The total number of splits
- The notes output path (in `paper_summaries/`)
- The full markdown path (from Step 2), if created
- The structured extraction template (below)

**Sub-agent prompt template:**

```
Read all PDF splits in [splits_dir] and produce structured notes at [paper_summaries/Author_Year.md].

There are [N] splits. Read them in batches of 3 (up to 12 pages at a time).
For each batch, read the 3 split PDFs using the Read tool, then update the notes file.
Do NOT read all splits at once — read 3, write notes, then read the next 3.

For papers under ~15 pages, you may read all splits in one batch.

A full-text markdown conversion is also available at [paper_markdown/Author_Year.md] (if it exists).
You can use Grep to search it for specific terms, but prefer the PDF splits for primary reading
since the markdown may lose table formatting and figures.

**STRICT RULE — NO SHORTCUTS:** You MUST read every PDF split via the Read tool, in
the batched cadence described above. Do NOT skim the first split, decide the markdown
"looks clean," and use markdown grepping for the rest. The markdown is a search
supplement only — it is NEVER a substitute for reading the PDFs. If you find yourself
about to write "no need to open additional PDF splits because the markdown contained
[anything] cleanly," stop and read the splits instead. Token cost is not your concern;
faithful extraction is. Your final report must explicitly state how many splits you
read with the Read tool — if that number is less than [N], the task is incomplete.

Extract information along these dimensions:

1. Research question — What is the paper asking and why does it matter?
2. Setting & data — What data, time period, sample size, unit of observation?
3. Method — Identification strategy, econometric approach, key specifications
4. Main findings — Key results, coefficient estimates, effect sizes
5. Key figures & tables — Which figures/tables are most important for lectures? Note figure numbers and what they show
6. Quotable passages — Sentences worth putting on a slide verbatim (with page numbers)
7. Lecture relevance — How does this connect to behavioral economics topics?
8. Limitations and open questions — What do the authors themselves identify as limitations or future work? (Check the conclusion carefully)

Format the final notes file with this header:

# Paper Notes: [Author(s) (Year)]

**Full citation:** [APA format]
**DOI/URL:** [link]
**Pages:** [total]
**Read date:** [YYYY-MM-DD]
**Full-text markdown:** [path to paper_markdown/ file, or "N/A (scanned PDF)"]

## One-paragraph summary
[Plain English summary suitable for a slide motivation]

## Key takeaway for lectures
[The single most important thing from this paper for teaching]

[... structured extraction sections ...]
```

## Step 5: Present Results (main context)

After the sub-agent completes:
1. Read the summary from `paper_summaries/`
2. Present a **brief summary** (5-10 lines) to the user: citation, one-paragraph summary, key takeaway, and number of pages/figures extracted
3. Tell the user where both files are saved

## Querying Papers Later

When the user asks a specific question about a paper:
1. First check `paper_summaries/` — the answer may already be in the curated notes
2. If not found, search `paper_markdown/` using Grep — this has the complete text
3. Only re-read the PDF as a last resort

## When NOT to Split

- Papers under ~15 pages: the sub-agent can read directly with the Read tool
- Quick triage: read just the first split (pages 1-4) for abstract + intro
- Slides/presentations: read directly in main context

## Quick Reference

| Step | Where | Action |
|------|-------|--------|
| **Acquire** | Main | Download or locate PDF, copy to `reference_papers/` |
| **Convert** | Main | `pymupdf4llm` → full text to `paper_markdown/` |
| **Split** | Main | 4-page chunks into `splits_<name>/` |
| **Read** | Sub-agent | 3 splits at a time, update notes after each batch |
| **Write** | Sub-agent | Build curated notes in `paper_summaries/` |
| **Present** | Main | Read summary, show brief results to user |
