---
name: first-look
description: Diagnostic protocol for newly-pulled datasets and puzzling results. Run BEFORE interpreting any pattern, correlation, or regression — look at distributions, mass points, missing-value codes, and read the package/data documentation. Use proactively when (a) loading any new dataset, (b) seeing a correlation, coefficient, or sign that doesn't match expectations, or (c) the user says something like "why is X" about a quantitative result.
argument-hint: "[path to data file OR description of puzzle]"
---

# First-Look Diagnostics

Run this BEFORE interpreting any quantitative result. The single biggest failure mode is jumping to substantive explanations ("maybe it's tariffs in 1880-1950!") when the real answer is distributional ("56% of the column is literally zero"). Always look at the data before theorizing.

## When to trigger (proactively, without being asked)

1. **User loads or references a new dataset** — even if they just want to "check" something, do the diagnostic pass first and surface anything notable.
2. **A result is puzzling** — unexpected sign, unexpected magnitude, correlations that contradict theory, "why is X higher than Y." Before reaching for historical/substantive stories, verify the data.
3. **Before running a regression or correlation on a column you haven't inspected.** If you haven't seen `.describe()` on that column in this session, you haven't earned the right to interpret a coefficient on it.
4. **When using an unfamiliar package** — read its source/prompts/config before trusting your mental model of what it does.

## The protocol

### Step 1 — Describe every variable you're about to use

```python
df.info()
df.describe(include='all').T
```

Note: dtype, count (is there missingness?), mean vs. median (skewness), min/max (are they the expected scale bounds?), std (any near-zero variance?).

For each numeric column you care about, also check **the share of mass points**:

```python
for col in cols_of_interest:
    s = df[col]
    print(f"{col}: zero={(s==0).mean():.1%}  max={(s==s.max()).mean():.1%}  "
          f"nan={s.isna().mean():.1%}  median={s.median()}")
```

A column with >5% mass at 0 or at the scale maximum is a **structural feature** you must diagnose before using the column — it usually means a categorical decision ("absent" vs. "present," "censored" vs. "observed") was folded into a continuous-looking variable.

### Step 2 — Plot the distribution (or at least look at quantiles)

One histogram per column of interest. If matplotlib isn't available, print quantiles:

```python
df[col].quantile([0, .01, .05, .25, .5, .75, .95, .99, 1])
```

Look for: mass points, bimodality, long tails, implausible values, unit errors (is that 1000 dollars or 1 million?).

### Step 3 — Hunt for sentinel missing codes

Check for common "missing" encodings masquerading as data:
- `-999`, `-99`, `99`, `9999`
- Empty strings `""` and whitespace
- Strings `"NA"`, `"N/A"`, `"null"`, `"."`
- Year `1900` or `1901` as "unknown" in historical data
- `0` when the variable cannot legitimately be 0 (e.g., age, count of something required)

```python
df[col].value_counts(dropna=False).head(20)
```

### Step 4 — Read the documentation for any package or data source

**This is the step I keep skipping. Don't skip it.**

- **Python packages**: find the installed location (`pkg.__file__`), then read the actual source — especially config dataclasses, prompt templates, and `README`/docstrings. For LLM-wrapper packages, read the prompt template literally.
- **Data files**: look for a codebook, README, or data dictionary in the same directory. Grep for `.md`, `.txt`, `.pdf` files near the data.
- **Published datasets**: check if the source paper documents coding conventions (e.g., what does a 0 mean — "no" or "not asked"?).

If there's a prompt template or config that the package uses, read it. Don't assume you know what a scale means — confirm it.

### Step 5 — Sanity-check against expectations

Write down (to the user, in one sentence) what you expected to see and what you actually saw. If they match, say so briefly and move on. If they don't, **do not theorize until you've explained the mismatch distributionally**:

- Is the mismatch concentrated in a subset of rows?
- Does it disappear if you drop mass points or missing codes?
- Is one group systematically over/under-represented?
- Is there a data-entry or encoding issue?

Only after the distributional story is clear should you reach for substantive explanations.

## The cardinal rule

**When a result surprises you, your first move is `.describe()`, not a theory.**

If you find yourself writing "one possible explanation is..." before you've looked at the distribution of the variables involved, stop and go back to Step 1.

## Common failure modes (things I have actually done)

- Interpreted a correlation matrix without checking that one column was 56% zeros.
- Wrote a historical explanation for a sign reversal without checking how many observations were on each side of zero.
- Trusted my mental model of a package's scale convention without reading the prompt template that was literally shipped with the package.
- Suggested "demean the residuals" as a workaround when the real issue was a mass point that needed a zero-inflated model or a selection-on-presence analysis.
- Claimed a pipeline was correct because a log line said "100% match" without checking what the match rate was measuring.

## Output format when invoked explicitly

If the user runs `/first-look <path>`, produce a short report:

```
## First-Look Report: <filename>

**Shape**: N rows × K columns
**Key columns inspected**: [list]

**Flags** (things worth knowing before analysis):
- <column>: <issue, e.g. "34% mass at 0 — likely 'absent' category per codebook">
- <column>: <issue>

**Documentation found**: <paths to codebooks, READMEs, prompt templates>

**Expected vs. observed**: <one sentence on whether the data matches naive expectations>

**Recommended next step**: <one concrete action>
```

Keep it under 20 lines unless the user asks for more detail.