VRscores 2024 Methodology

Methodology

VRscores links employee records from Revelio Labs with individual records from the L2 voter file to estimate employer partisanship. The sections below outline the data workflow—from raw data through ensemble matching, and public dataset export. For a complete technical description, please see our working paper on SSRN.

As of: 2024
Linked workers: 24.5M
Employers published: 534k

What VRscores measure

Our methodology combines large-scale voter registration data with employment records to create the first comprehensive measure of workplace partisanship. We call this measure "VRscores".

By default, the explorer highlights the two‑party Republican share among matched workers at each employer (Republicans divided by Democrats plus Republicans). We provide both Registered and Imputed two‑party variants; the latter assigns lean to unaffiliated registrants using primary participation, precinct election returns, and demographics. In addition to these two‑party views, the public datasets also include counts and shares for workers who are neither Democratic nor Republican (“Other”/non‑partisan), enabling analyses of overall partisan composition that include non‑partisans when appropriate.

Pipeline overview

We process the data in four steps to ensure accuracy and privacy.

1
Acquire & scope data
Load Revelio Labs and L2 datasets, then limit to individuals living/working in metro areas (MSAs).
- Revelio Labs (Apr 2025): ≈129 million positions representing ≈103 million workers; we assign VRIDs (VRscores employer identifiers) for aggregation.
- L2 voter file (Nov 2024): ≈185 million registrants with official or modeled party, turnout history, and home address.
- Restrict both sources to records in valid MSAs because we attempt to match only within MSA commuting zones.
2
Standardize variables
Parse and clean variables to prepare to match Revelio Labs positions with L2 voter records.
- Normalize names with the nominally library (lowercase, remove punctuation, split tokens); drop single-character names unless paired with longer tokens.
- Map L2 addresses to MSAs using the HUD USPS crosswalk.
- Partition both datasets by MSA to keep pairwise comparisons tractable.
3
Match workers across datasets using two approaches
Run two complementary linkage algorithms inside each MSA and then resolve to one-to-one matches.
- Splink probabilistic matching (Fellegi–Sunter): exact + Damerau-Levenshtein + Jaro-Winkler comparators for names, birth-year windows (±1/±5/±10), gender (imputed based on first name), with term-frequency adjustments. Term frequency adjustments down-weight matches on common names (e.g., Joe Smith).
- Expectation–maximization estimates m/u parameters (blocking on first and last name); candidate pairs with match probability ≥0.1 are retained.
- Fuzzylink semantic linker: block on derived gender and last name, embed names with OpenAI's text-embedding-3-large model, and use GPT-4o-mini to adjudicate uncertain matches.
- These matching algorithms are run per-MSA, split into alphabet bins when needed to reduce pairwise comparisons.
- Choose the single highest-probability link per Revelio Labs user and per L2 voter.
- Full linkage specification lives in our SSRN working paper, which documents the match procedure in greater detail.
4
Publish & document
Collapse to parquet, filter small employers, and output to public dataset.
- Collapse to the employer (VRID) level and retain parent metadata; we also remove ≈1,500 invalid employers (e.g., student, unemployed, retired, etc.).
- Drop VRID-years with fewer than 5 matched workers, then roll to MSA, NAICS, and occupation panels retaining the ≥50 worker threshold noted in Appendix tables.

What VRscores measure

Pipeline overview

Acquire & scope data

Standardize variables

Match workers across datasets using two approaches

Publish & document