01 · Hereditary lineage society · founded 1890
Daughters of the American Revolution
A complete admission record of the National Society Daughters of the American Revolution from its founding in 1890 through the eve of WWII — every member's name, birthplace, parents, husband, the qualifying Revolutionary War Patriot ancestor, and the multi-generation lineage chain proving that descent. Scraped from Ancestry collection 3174, which indexes the original 166-volume DAR-published Lineage Books set.
Background
The National Society Daughters of the American Revolution (DAR) is a hereditary lineage society founded in 1890. Membership requires documenting unbroken descent from a person who provided "patriotic service" to the Continental cause during the American Revolution (1775–1783). Each successful applicant submits a multi-generation lineage chain back to that Patriot ancestor. From 1890 onward the DAR published these submissions as the Lineage Books: a 166-volume set, with roughly 1,000 members per volume, transcribing each application as a paragraph of names, dates, places, and a brief summary of the Patriot's service.
For an economic-history project on social capital, civic identity, and inherited status (Chyn & Haggag), the Lineage Books are valuable because they encode both ends of a 4–6 generation family chain: a 1900s/1910s/1920s American woman on one end, and a verified 18th-century Patriot ancestor on the other. The Patriot service descriptions also include years, units, and place of death — research-quality detail you would otherwise pull from the DAR Genealogical Research System (which we also have, separately, as 132,840 Patriot records).
Field-level coverage
129,719 members, admitted between 1890 and roughly 1939. Approximately 1,000 members per published volume. Coverage of the underlying schema:
| Field | Coverage | Notes |
|---|---|---|
| member_name | 100.0% | Member's full name including title prefix |
| qualifying_ancestor | 91.4% | Patriot ancestor name(s); empty when Ancestry's parser failed on the Relative line |
| residence | 99.4% | Member's birthplace (the field is mis-labeled "residence" by the standard schema but the DAR Lineage Books actually transcribe birth place) |
| member_id | 98.5% | DAR national identification number, sequential by admission |
| lifespan_dates derived | 78.9% | All (YYYY-YYYY) spans found in the lineage narrative |
| service_years derived | 66.7% | Distinct years 1750–1799 mentioned in the Patriot service blurb |
| patriot_birth_year derived | 19.5% | Only when single-Patriot AND the name is followed by a parenthetical lifespan in Comments — conservative attribution heuristic |
| patriot_death_year derived | 19.5% | Same as above |
The DAR has had ~1.4 M members in its history and currently has ~185,000 active members. Our 130 K is the pre-WWII admission cohort — every person admitted while the Lineage Books series was being published.
02 · Every column in the table
What each row contains
Each row in the final dataset is one DAR member. Eight fields come straight from Ancestry's structured detail-page table; four more are derived in our Python pipeline by parsing the freeform Comments narrative for date patterns.
| Field | Description | Example |
|---|---|---|
| Identity | ||
| member_name | Member's full name with title prefix. 79% Mrs., 20% Miss, 0.6% no prefix. | Mrs. Sarah Vacher Williams |
| member_id | DAR national number; sequential by admission order. Useful as a primary key and as a proxy for admission date (low IDs = admitted earlier). | 17148 |
| residence | Member's birthplace. Free-form transcription with inconsistent state spellings ("N. Y.", "New York", "NY"). Normalize before grouping. | Hudson County, New York. |
| Patriot ancestor | ||
| qualifying_ancestor | The Revolutionary War Patriot the membership is claimed through. ~14% of records list multiple Patriots ("X and of Y"). | Dr. John Francis Vacher, and of Capt. Robert Cochran. |
| Source & provenance | ||
| source_id | Ancestry record ID for this DAR entry. Stable; safe primary key. | 10741 |
| source_url | Direct link to the Ancestry detail page. | …/3174/records/10741 |
| society_slug | Standardized society identifier across the lineage-society datasets in this project. | national-society-daughters-of-the-american-revolution |
| raw_text_excerpt | The full Comments narrative (lineage chain + Patriot service description), truncated to ~400 chars for spreadsheet sanity. Use the JSONL for full text. | gender=Female | spouse= | comments=… |
| Date fields derived from Comments | ||
| patriot_birth_year | Patriot's birth year. Only set when the Relative names a single Patriot AND that name is followed by a (YYYY-YYYY) span within ~80 chars in the Comments narrative. Rejects matches that would imply a Patriot born after 1770. | 1751 |
| patriot_death_year | Same attribution rule. | 1807 |
| lifespan_dates | All (YYYY-YYYY) spans found in Comments, semicolon-delimited. May include the Patriot, his spouse, the great-grandparents, and the member's parents. Use for full lifespan recovery. | 1739-1829;1751-1807;1735-1824 |
| service_years | Distinct years 1750–1799 mentioned in Comments, semicolon-delimited. Strongly concentrated in 1775–1783 (the war years) — see the histogram in Distributions. | 1776;1777 |
| Standard-schema columns left blank for this collection | ||
| admission_date | Not exposed by Ancestry for 3174. empty | — |
| admission_year | Not exposed. Use member_id as a rough proxy (sequential). | — |
| source_year | Volume publication year. We do not currently maintain an ID-to-volume map, but the Internet Archive identifiers in the Sources tab let you reconstruct it. | — |
03 · One real row, picked because it's rich
Example record — DAR ID 17148
One real record from our scrape, picked because it's unusually rich: a multi-Patriot member with a full 5-generation lineage chain and detailed military-service descriptions for both Patriots.
Mrs. Sarah Vacher Williams
Daughter of John Van Vorst and Emily Harrimond Bacot, his wife. Granddaughter of John Van Vorst and Sarah Vacher, his wife; Peter Bacot and Mary Eugenia Cochran, his wife. Gr.-granddaughter of Cornelius Van Vorst and Anna Van Horne, his wife; John Francis Vacher and Sarah Potter, his wife; Col. Charles B. Cochran and Mary Thompson, his wife. Gr.-gr.-granddaughter of Robert Cochran and Mary Elliott (1739–1829), his wife.
John Francis Vacher (1751–1807) was appointed surgeon, 1777, and served through the war. He was an original member of the New York Society of the Cincinnati. He was born near Toulon, France, died in New York City, and was buried in St. Paul's churchyard.
Robert Cochran (1735–1824) served in the naval forces of South Carolina under Commodore Gillon. He was appointed, 1776, captain of the armed cruiser "Notre Dame," to procure materials of war and clothing for the army. In his cruises to France he captured many English vessels with valuable cargoes and gained much important information for the commander-in-chief. When Charleston surrendered he was captured, exiled to St. Augustine, Fla., and was a prisoner until the close of the war. For meritorious and important service he received the unanimous thanks of the South Carolina Legislature. He was born in Massachusetts and died in Charleston, S. C.
04 · Book → HTTP → JSONL → CSV
From source to row
What does it look like end-to-end — from a typeset book page to one row in our final CSV? The trail has four hops, all visible on the example record above (Mrs. Sarah Vacher Williams, DAR ID 17148).
The pipeline below is what gets us from the typeset paragraph to one CSV row.
1. Ancestry's pre-built index gives us a recordId
Ancestry parsed the 166-volume set into a structured index in collection 3174, assigning each member a stable Ancestry recordId. Our scraper iterates recordId 1..150,000 sequentially. Member 17148 happens to live at:
GET https://www.ancestry.com/search/collections/3174/records/10741
Note that recordId (10741) is not the same as member_id (17148). The recordId is Ancestry's internal ID; the member_id is the DAR national number visible on the typeset page. The mapping is messy — recordIds are not contiguous (~13% are 404s with no underlying record) and the relationship between recordId and member_id is not monotone.
2. The detail page returns server-rendered HTML
Each detail page is ~150 KB of HTML. The 8 structured fields are encoded as <dt>Label</dt><dd>Value</dd> pairs in the page body. There is no JSON island, no batch endpoint for this collection — verified via Playwright network capture. Every record requires its own HTTP request:
→ HTTP 200, ~150 KB text/html <dt>Source Name</dt><dd>Mrs. Sarah Vacher Williams</dd> <dt>Gender</dt><dd>Female</dd> <dt>Birth Place</dt><dd>Hudson County, New York.</dd> <dt>Spouse</dt><dd>George Herbert Williams</dd> <dt>Relative</dt><dd>Dr. John Francis Vacher, and of Capt. Robert Cochran.</dd> <dt>Father</dt><dd>John Van Vorst</dd> <dt>Mother</dt><dd>Emily Harrimond Bacot</dd> <dt>Comments</dt><dd>Mrs. Sarah Vacher Williams.DAR ID Number: 17148; …</dd>
3. We parse to JSONL with one immediate write per record
The parser pulls each <dt>/<dd> pair, normalizes whitespace, and pulls the DAR ID out of the Comments narrative via regex. One JSONL line per record, written immediately:
{
"Name": "Mrs. Sarah Vacher Williams",
"Gender": "Female",
"Birth Place": "Hudson County, New York.",
"Relative": "Dr. John Francis Vacher, and of Capt. Robert Cochran.",
"Father": "John Van Vorst",
"Mother": "Emily Harrimond Bacot",
"Spouse": "George Herbert Williams",
"Comments": "…full lineage chain + both Patriot biographies…",
"DAR ID Number": "17148",
"recordId": 10741,
"_status": 200,
"_url": "https://www.ancestry.com/search/collections/3174/records/10741"
}
4. JSONL → CSV adds derived date columns
The Comments narrative is the only place dates appear in this collection — no structured year field is exposed. Our converter (05_to_csv.py) post-processes Comments with regex to surface four useful date columns (patriot_birth_year, patriot_death_year, lifespan_dates, service_years). The Vacher Williams record yields:
patriot_birth_year: (blank — multi-Patriot) patriot_death_year: (blank — multi-Patriot) lifespan_dates: 1739-1829;1751-1807;1735-1824 service_years: 1776;1777
For multi-Patriot records like this one, the per-Patriot birth/death-year columns are deliberately left empty — there is no reliable way to attribute a parenthetical span like (1751-1807) to one particular Patriot when the Relative field names two of them. The full set of spans is preserved in lifespan_dates so a researcher can refine downstream.
05 · Geography · service years · birth decades · top patriots
Distributions across the corpus
Four views of the data: where members were born, when their Patriot ancestors served, when those Patriots were born, and which Patriot names recur most. Together they sketch the demographic shape of the collection.
Where members were born
Top 20 birthplaces of DAR members, after normalizing state-name conventions ("N. Y.", "New York", "NY" all map to NY):
Top six states (NY, MA, PA, IL, OH, CT) account for ~47% of all members. The Northeast plus the early-Midwest (IL, OH, IN, MI, IA) dominate. Southern colonial states (VA, SC, NC, GA) appear in moderate counts. The South's lower share reflects both Confederate-era disruption to state-society organization and the demographic facts of who the granddaughters and great-granddaughters of Patriots actually were when the DAR was forming members in the 1890s–1910s.
Patriot service years
Each Patriot biography in the Comments narrative typically names one or more years — the year of commission, the year of capture, the year of regimental service. We extracted every distinct year 1750–1799 from the Comments column. The histogram is exactly what an American history student would predict: a ~1750–1774 baseline (people getting older), then a sharp spike in 1775–1783 (the war), then a gradual drop-off through the 1790s.
Selected years; full histogram in the first-look JSON. The 1776 peak (20,716 mentions) is roughly 60% above the second-most-common year (1777). The pre-1775 mentions are mostly birth-year-of-grandparents or "served in French and Indian War" references; the post-1783 tail is "received pension in 1791," "moved to Ohio in 1795," etc.
Patriot birth decades
For the 25,329 records (19.5% of the corpus) where we could confidently attribute a single Patriot's birth/death years, here is the distribution of birth decades. The shape is exactly what you would expect for a Revolutionary War cohort: a sharp peak in the 1750s (15-25 years old at the start of the war), with the 1740s and 1760s as the secondary peaks.
The thin 1700s, 1710s, 1720s tail captures the older officer corps (people who were ~50+ years old at war start). The handful of 1770s births are typically drummer-boys or late-war militia.
Most-claimed Patriots
Top 20 single-Patriot names by descendant count. Common 18th-century names dominate (John, William, James) — many of these "John Brown"s are different people across multiple states. A future research pass could disambiguate via the residence + service-description blob.
| 1 | John Hart | 83 |
| 2 | John Brown | 69 |
| 3 | John Williams | 56 |
| 3 | John Davis | 56 |
| 3 | James Moore | 56 |
| 6 | Peter Norton | 53 |
| 7 | John Reed | 52 |
| 8 | James Smith | 51 |
| 9 | James Williams | 48 |
| 10 | William White | 46 |
| 10 | John White | 46 |
| 10 | William Brown | 46 |
| 10 | William Smith | 46 |
| 14 | John Thompson | 45 |
| 14 | John Smith | 45 |
| 14 | William Henshaw | 45 |
| 17 | Thomas Marshall | 44 |
| 18 | Samuel Smith | 43 |
| 18 | John Allen | 43 |
| 20 | John Harris | 42 |
| 21 | Samuel Adams | 41 |
Samuel Adams (the Boston Tea Party Patriot) appears at #21 — but this is almost certainly a mix of the Samuel Adams plus other unrelated Samuel Adamses. Name-collision is the dominant story in this leaderboard.
06 · ~6 days · 1 worker · Cloudflare-aware
How we collected it
Unlike most Ancestry collections, 3174 has no batch endpoint — the imageviewer JSON API that gave us 80× speedups on collections 2204 (SAR Applications) and 2221 (Great Registers) returns empty results for 3174. The collection is index-only: there are no scanned images, only typeset book transcriptions. Verification was definitive — three independent tests:
- Every 3174 record's "report-issue" link encodes
imageId=""andindexOnly=true. Compare to 2204, where the same link encodes a realimageIdlike32596_242028-00010. - Calling
/imageviewer/api/record/index-panel-data?dbId=3174&imageId=XreturnsHTTP 200with{"records": []}for every imageId tried — empty string, raw recordIds, even known-good 2204 imageIds. - A Playwright headful capture of every XHR fired by Ancestry's own UI on a 3174 detail page shows no batch endpoint. The image-viewer sidebar that powers 2204's batch path doesn't exist for 3174 because the collection has no images.
So the only path is one HTTP request per record, fetching the ~150 KB detail HTML and parsing the <dt>/<dd> fields. To stay under Cloudflare's bot threshold:
- 1 worker (vs 3 for image-backed collections — detail HTML is more aggressively rate-limited)
- 0.3–0.6 s jitter per request via
polite_get - Connection pool reset every 100 fetches to defeat per-connection bot scoring (
cf_bm, TLS session tickets, HTTP/2 stream history). Cost per reset is one fresh TLS handshake, ~100 ms, basically free. - 24-hour cookie freshness guard — the scraper auto-exits cleanly when cookies pass that age, before Cloudflare's clearance token can drift far enough to trigger a soft block. A macOS launchd daemon watches the local
cookies.jsonand rsyncs to UCLA on every refresh; the running scraper picks up new cookies at the next session reset (~5–7 min) without restart. - Append-only JSONL with immediate flush, plus a defensive trailing-newline guard at writer open-time — defends against losing records to a SIGTERM mid-write.
Sustained rate: ~0.27 records/sec (~24 records/min, ~1,500/hour). Total wall-clock for 150,000 recordIds: ~6 days, with one ~5-min cookie-refresh interruption per day.
07 · 4 done · 2 future · 7 caveats
Status & caveats
Phase status
| Phase | State | Outcome / notes |
|---|---|---|
| 1. Endpoint probe | done | Detail-HTML mode confirmed for 3174. Imageviewer batch endpoint definitively dead via 3-test verification. |
| 2. Detail-page scrape (UCLA) | done | ~6 days at 0.27 r/s, 1 worker, 7 cookie-refresh cycles. 0 soft blocks, 0 IP cooldowns, 6 clean cookie-guard exits. |
| 3. JSONL → CSV | done | 129,719 rows; 4 derived date columns added. |
| 4. First-look analysis | done | Distributions, missing-value patterns, attribution coverage. See the Distributions tab. |
| 5. ID-to-volume map | future | For approximating admission_year from member_id. Each volume's title page records its publication year (e.g., vol 18 = 1905). Reconstruct via the IA scans listed in Sources. |
| 6. Patriot disambiguation | future | "John Brown" appears 69 times. Within-state + service-description hashing should split most ambiguous names. Useful for cross-walk to DAR Genealogical Research System (132,840 Patriots already collected). |
Caveats
- No structured dates. Ancestry's parser for collection 3174 captures Name, Gender, Birth Place, Father, Mother, Spouse, Relative, and a long Comments narrative — but no birth year, death year, marriage year, or admission year as discrete fields. All dates here are extracted from Comments via regex; treat them as approximate and verify on the source for any individual case.
- Birth-place column is a free-form transcription. The same state appears as "N. Y.", "New York", "NY". The Distributions tab presents canonicalized counts; the raw CSV preserves Ancestry's spelling. Some entries are city, state ("Hudson County, New York"); others are just state ("Illinois."). Strip trailing periods before parsing.
- Multi-Patriot membership. 14% of records list multiple Patriot ancestors (~"X and of Y", "X, Y, and Z"). For these, our heuristic intentionally leaves
patriot_birth_year/patriot_death_yearempty rather than guessing which Patriot a given(YYYY-YYYY)span belongs to. The raw spans are preserved inlifespan_dates. - member_id is mostly sequential, but not perfectly so. 1,172 records share an ID with at least one other record — a mix of multi-volume reprints, parser hiccups on Comments where the regex pulled the wrong digits, and a small tail of bogus 7-digit values. For analysis purposes, consider
source_id(Ancestry recordId) the safer key andmember_idthe human-meaningful but imperfect cross-reference. - Patriot-name leaderboard is name-collision-heavy. "John Brown" appears 69 times — these are mostly different John Browns from different states, not 69 descendants of one man. State + service-description hashing would be required to disambiguate before treating any leaderboard row as an actual Patriot.
- Approximately 9% of records have no Patriot recorded. Ancestry's parser failed on ~11,000 entries, typically because the original typeset paragraph used unusual punctuation that broke their column extraction. The Comments narrative for these records is intact; a researcher could extract the Relative manually if needed.
- Books are typeset, not scanned-then-OCR'd. Unlike microfilm-based Ancestry collections, the underlying source for 3174 is the originally printed and indexed DAR Lineage Book set. Field values are 100% clean text from the printed page — no OCR errors, no handwriting transcription noise. The trade-off: no microfilm-style image to inspect on dispute. Use the Internet Archive scans (Sources tab) for visual verification.
08 · Books · archives · cross-walks
Sources & provenance
- National Society Daughters of the American Revolution. Lineage Book of the National Society of the Daughters of the American Revolution. Vols 1–166, published 1890–1939.
- Ancestry.com collection 3174, U.S., Daughters of the American Revolution Lineage Books, 1890-1939. Pre-indexed; no batch endpoint.
- Internet Archive — lineagebookNNdaug series: complete scans of vols 1–67, full-text OCR available. Sample identifier:
lineagebook1817daug(vol 18, 1905; pictured in Source → row). Search prefix: archive.org → DAR Lineage Books. - Internet Archive — lineagebookNNrevogoog series: 28 Google-Books-sourced scans, vols 1–28+.
- Internet Archive — lineagebooknatioNNNdaug series: 8 later-volume scans (vols 28, 31, 33, 34, 35, 48, 134, 136).
- DAR Genealogical Research System (Patriot side, separately scraped in this project): 132,840 records covering ~110K verified Patriots and their service descriptions. Cross-walks to the Patriot names in this collection; see
data/lineage_societies/dar_grs/extracted/members.csv.