Methodology
VRscores links employee records from Revelio Labs with individual records from the L2 voter file to estimate employer partisanship. The sections below outline the data workflow—from raw data through ensemble matching, and public dataset export.
- As of
- 2024
- Linked workers
- 24.5M
- Employers published
- 534k
What VRscores measure
By default, the explorer highlights the two‑party Republican share among matched workers at each employer (Republicans divided by Democrats plus Republicans). We provide both Registered and Imputed two‑party variants; the latter assigns lean to unaffiliated registrants using primary participation, precinct election returns, and demographics. In addition to these two‑party views, the public datasets also include counts and shares for workers who are neither Democratic nor Republican (“Other”/non‑partisan), enabling analyses of overall partisan composition that include non‑partisans when appropriate.
Pipeline overview
We process the data in four steps.
- 1
Acquire & scope data
Load Revelio Labs and L2 datasets, then limit to individuals living/working in metro areas (MSAs).
- Revelio Labs (Apr 2025): ≈129 million positions representing ≈103 million workers; we assign VRIDs (VRscores employer identifiers) for aggregation.
- L2 voter file (Nov 2024): ≈185 million registrants with official or modeled party, turnout history, and home address.
- Restrict both sources to records in valid MSAs because we attempt to match only within MSA commuting zones.
- 2
Standardize variables
Parse and clean variables to prepare to match Revelio Labs positions with L2 voter records.
- Normalize names with the nominally library (lowercase, remove punctuation, split tokens); drop single-character names unless paired with longer tokens.
- Map L2 addresses to MSAs using the HUD USPS crosswalk.
- Partition both datasets by MSA to keep pairwise comparisons tractable.
- 3
Match workers across datasets using two approaches
Run two complementary linkage algorithms inside each MSA and then resolve to one-to-one matches.
- Splink probabilistic matching (Fellegi–Sunter): exact + Damerau-Levenshtein + Jaro-Winkler comparators for names, birth-year windows (±1/±5/±10), gender (imputed based on first name), with term-frequency adjustments. Term frequency adjustments down-weight matches on common names (e.g., Joe Smith).
- Expectation–maximization estimates m/u parameters (blocking on first and last name); candidate pairs with match probability ≥0.1 are retained.
- Fuzzylink semantic linker: block on derived gender and last name, embed names with OpenAI's text-embedding-3-large model, and use GPT-4o-mini to adjudicate uncertain matches.
- These matching algorithms are run per-MSA, split into alphabet bins when needed to reduce pairwise comparisons.
- Choose the single highest-probability link per Revelio Labs user and per L2 voter.
- Full linkage specification lives in our SSRN working paper, which documents the match procedure in greater detail.
- 4
Publish & document
Collapse to parquet, filter small employers, and output to public dataset.
- Collapse to the employer (VRID) level and retain parent metadata; we also remove ≈1,500 invalid employers (e.g., student, unemployed, retired, etc.).
- Drop VRID-years with fewer than 5 matched workers, then roll to MSA, NAICS, and occupation panels retaining the ≥50 worker threshold noted in Appendix tables.