Methodology
What is this?
JaffeCast is a statistical model that estimates the outcome of an election. Rather than predicting a single result, it simulates the race 1,000 times using a technique called Monte Carlo simulation — each simulation randomly varies turnout, candidate support, and voting patterns within plausible ranges, producing a distribution of possible outcomes. The win probabilities and confidence intervals you see on the dashboard represent the range of those simulations, not a single forecast.
Data Sources
The model ingests three categories of data:
- Early vote returns from official state and county election offices, updated daily during the early voting period. Because state-level reporting can lag one to two days behind actual figures, we supplement with direct scrapes of major county election board websites to capture the most recent available data.
- Demographic data at the precinct level from the U.S. Census Bureau’s American Community Survey, matched to voting precincts across the relevant jurisdiction. Counties without precinct-level matches use county-level composition estimates.
- Public polling from available surveys of the race, each weighted by sample size and recency.
Poll Averaging
Each poll is weighted by the square root of its effective sample size multiplied by a recency decay factor with a 30-day half-life. Polls older than 60 days are dropped entirely. Effective sample size is capped at 1,500 to prevent any single large-sample poll from dominating the average — the statistical precision gained beyond that threshold is marginal relative to other sources of uncertainty. The shorter half-life and hard stop reflect the speed at which opinion shifts in a primary race, consistent with standard polling aggregation practice.
Polls released in the same time window capture overlapping snapshots of the race and are partially redundant. To account for this, a cluster dampening factor is applied: each poll’s weight is divided by the square root of the number of polls with field dates within seven days. Polls that land in isolation keep their full weight; polls that cluster together share influence proportionally.
Some polls report only topline candidate numbers without demographic crosstabs. For these, we construct modeled crosstabs by taking the weighted average of demographic breakdowns from polls conducted in the same period and adjusting them to match the disclosed topline results. These crosstabs are synthetic — the demographic patterns are inferred from contemporaneous surveys, not observed in the original poll. This is a necessary tradeoff: it lets every poll contribute to the demographic model, but the crosstabs themselves are modeled estimates, not reported data.
Polls with small demographic subsamples produce noisy crosstabs that can distort the model. An empirical Bayes shrinkage step blends small-subgroup estimates toward the cross-poll mean, weighted by subgroup sample size. Groups with large samples (e.g., white voters in a 1,600-person poll) are barely affected; groups with few respondents (e.g., Asian voters in a 400-person poll) are pulled substantially toward the consensus of other surveys. This prevents any single poll’s noisy subsample from driving the demographic model.
When multiple internal polls come from the same campaign or affiliated organization, they are composited into a single entry — averaged together and treated as one poll in the weighted average. This prevents a campaign that releases several internals from disproportionately pulling the model toward its candidate. Internal and partisan polls also receive a house-effect adjustment: the sponsoring candidate’s margin is penalized by 2.4 points, consistent with the average partisan lean observed in historical internal polling.
Undecided voters are redistributed proportionally among candidates by default. The simulator allows users to adjust this allocation.
Turnout Projection
The central modeling challenge is that early votes are observed but Election Day turnout is not. The core projection for each county is:
Projected total = Early votes ÷ Estimated early vote share
But the estimated early vote share is not a single number — it’s a composite built from several inputs:
- A historical baseline anchored to official canvass data from the most recent comparable election for the largest counties, with the statewide average as the fallback for remaining counties.
- A cannibalization adjustment that accounts for the structural shift toward early voting observed in recent election cycles.
- Demographic composition coefficients that shift the share based on each county’s racial makeup — historically, whiter counties show higher early vote shares, while counties with larger Black and Hispanic populations see proportionally larger Election Day surges.
- A global turnout adjustment that links overall turnout levels to the early/Election Day split — higher turnout means early votes represent a smaller share of the total.
The resulting share is clamped to a plausible range to prevent extreme projections. Critically, it is not treated as a known quantity — Monte Carlo noise is added each iteration, reflecting genuine uncertainty about how large the Election Day vote will be.
Candidate Support Estimation
Candidate support is estimated per county by applying the poll-derived support matrix to local demographic composition from Census data, with regional adjustments to capture geographic variation not explained by demographics alone. Where the state does not register voters by party, all partisan composition estimates are modeled from precinct-level Census data and observed primary voting patterns.
Monte Carlo Simulation
Each forecast update runs 1,000 independent simulations. Each iteration randomly varies:
- Demographic turnout levels (standard deviation of ±1.5 percentage points, scaled by group)
- Candidate support multipliers (±15% for major candidates, ±20% for minor candidates)
- Early vote share (±6 percentage points standard deviation)
All noise parameters are scaled by data availability: uncertainty is widest at the start of early voting and narrows as more days of data arrive. However, uncertainty never reaches zero — residual uncertainty about Election Day turnout persists until actual results are reported.
Win probabilities are the fraction of simulations won by each candidate. A variance ceiling mechanically prevents any candidate’s win probability from exceeding approximately 75% unless the projected vote lead is large relative to the confidence interval width. This guards against false precision in a race with limited polling.
Confidence intervals represent the 5th and 95th percentiles of simulated outcomes. Where applicable, the model also tracks runoff probability if no candidate exceeds the majority threshold required to win outright.
Bias Mitigation
- The model contains no priors favoring any candidate. Support estimates flow entirely from polling data and demographic composition.
- All random number generation uses seeded pseudorandom number generators, ensuring identical inputs produce identical outputs. Results are fully reproducible and auditable.
- Monte Carlo noise is applied symmetrically across all candidates. No candidate receives tighter or wider confidence intervals by design.
- The variance ceiling constrains win probabilities mechanically, regardless of which candidate leads.
- Early vote share baselines are drawn from official canvass data, not assumptions or estimates.
Known Limitations
- Where the state does not provide individual voter party affiliation or ballot selection data, all partisan composition estimates are modeled.
- Early-to-Election-Day turnout splits are inferred from historical primary patterns, which may not perfectly reflect current behavior.
- Some voting precincts lack direct Census matches and use county-level demographic averages.
- Polling may be limited; the support matrix can rely on a small number of surveys with varying methodological quality.
- The momentum adjustment in the simulator reflects a user-specified hypothesis about late-deciding voters, not a predictive model component.
Data Pipeline
Early vote data is fetched daily via automated scraper from official state election APIs and county election board websites. The pipeline validates each day’s reporting (rejecting days where the statewide cumulative total appears to decrease, indicating incomplete uploads), computes monotonic cumulative totals per county, and patches in county-level data where state reporting lags behind direct county data. All data processing is deterministic and logged.
Texas U.S. Senate Democratic Primary, 2026
Early Vote Sources
Early vote returns are sourced from the Texas Secretary of State, covering all 254 counties. Because the SOS can lag one to two days behind actual reporting, we supplement with direct scrapes of 12 major county election board websites — Bexar, Dallas, El Paso, Fort Bend, Harris, Hays, Hidalgo, Jefferson, Tarrant, Travis, Webb, and Williamson — to capture the most recent available figures.
Demographic Coverage
Precinct-level demographic data from the U.S. Census Bureau’s American Community Survey (2020–2024 five-year estimates) is matched to 6,284 voting precincts across 40 counties with an 87.6% match rate. The remaining 778 precincts (~12%) use county-level demographic averages.
Polling
The model incorporates all published surveys of the race.
Turnout Baseline
Early vote share baselines are anchored to verified results from the 2018 Texas Democratic primary — the most recent competitive statewide Democratic primary before 2026 — using official canvass data from the Texas Secretary of State for the 15 largest counties, with the statewide average of 54% as the fallback for remaining counties. A five-percentage-point upward adjustment accounts for the structural shift toward early voting observed in post-2020 Texas elections.
Demographic adjustments reflect observed patterns in Texas primary data: whiter counties historically show higher early vote shares, while counties with larger Black and Hispanic populations tend to see proportionally larger Election Day surges.
Runoff Rules
Under Texas primary law, if no candidate exceeds 50% of the vote, the top two advance to a May 26 runoff. The model tracks this probability.
State-Specific Limitations
- Texas does not register voters by party or disclose ballot selections; all partisan composition estimates are modeled.
- The support matrix relies on a limited number of surveys with varying methodological quality.