General Election Poll Analysis


General Election Poll Analysis

In the spirit of the election season, I will be analyzing US presidential election polling data from 2016 and 2020. I will be answering many questions, including who might win in 2020.

This notebook will be updated until the conclusion of the 2020 election.

Importing and Cleaning the Data

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import random as r
from datetime import datetime
from scipy import stats

Three unique datasets will be utilized in this analysis:

  • 2016 polling data (from Kaggle)
  • 2020 polling data (from FiveThirtyEight)
  • 2016 Election Results by State and County (from Data World)

2016 Polling Data (Trump vs. Clinton)

In [2]:
data_2016 = pd.read_csv("../input/2016-election-polls/presidential_polls.csv")
data_2016 = data_2016[["startdate", "enddate", "state", "pollster", "grade", "samplesize", "population", "adjpoll_clinton", "adjpoll_trump"]]
trump_clinton = data_2016.rename(columns = {"startdate": "start_date", "enddate": "end_date", "grade":"fte_grade", "samplesize":"sample_size", "adjpoll_clinton":"Clinton", "adjpoll_trump":"Trump"})
trump_clinton["start_date"] = pd.to_datetime(trump_clinton["start_date"])
trump_clinton["end_date"] = pd.to_datetime(trump_clinton["end_date"])
trump_clinton = trump_clinton.sort_values(by = ["end_date", "start_date"]) #Arranging the polls from most to least recent
trump_clinton["dem_lead"] = trump_clinton["Clinton"] - trump_clinton["Trump"] #lead of the democratic candidate (negative if they are losing)
In [3]:
trump_clinton.head()
Out[3]:
start_dateend_datestatepollsterfte_gradesample_sizepopulationClintonTrumpdem_lead
30772015-11-072015-11-08South CarolinaPublic Policy PollingB+1290.0v40.6369445.19826-4.56132
65202015-11-072015-11-08South CarolinaPublic Policy PollingB+1290.0v40.4882045.19637-4.70817
99012015-11-072015-11-08South CarolinaPublic Policy PollingB+1290.0v40.6994145.18212-4.48271
32062015-11-092015-11-13VirginiaRoanoke CollegeC+555.0rv49.8153236.2105013.60482
66382015-11-092015-11-13VirginiaRoanoke CollegeC+555.0rv49.0017337.0252411.97649

After renaming some columns, this data is already in the format we want. Only the important columns are extracted.

In [4]:
trump_clinton.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10236 entries, 3077 to 6833
Data columns (total 10 columns):
start_date     10236 non-null datetime64[ns]
end_date       10236 non-null datetime64[ns]
state          10236 non-null object
pollster       10236 non-null object
fte_grade      9102 non-null object
sample_size    10233 non-null float64
population     10236 non-null object
Clinton        10236 non-null float64
Trump          10236 non-null float64
dem_lead       10236 non-null float64
dtypes: datetime64[ns](2), float64(4), object(4)
memory usage: 879.7+ KB

The missing values in the "grade" column represent polls with an unknown accuracy.

2020 Polling Data

In [5]:
data_2020 = pd.read_csv("../input/2020-general-election-polls/president_polls.csv")
data_2020 = data_2020[["poll_id","start_date", "end_date", "state", "pollster", "fte_grade", "sample_size", "population", "answer", "pct"]]
In [6]:
data_2020.head()
Out[6]:
poll_idstart_dateend_datestatepollsterfte_gradesample_sizepopulationanswerpct
0632771/9/201/12/20MichiganEPIC-MRAB+600lvBiden50.0
1632771/9/201/12/20MichiganEPIC-MRAB+600lvTrump44.0
2632771/9/201/12/20MichiganEPIC-MRAB+600lvSanders50.0
3632771/9/201/12/20MichiganEPIC-MRAB+600lvTrump45.0
4632771/9/201/12/20MichiganEPIC-MRAB+600lvWarren48.0
In [7]:
data_2020.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2735 entries, 0 to 2734
Data columns (total 10 columns):
poll_id        2735 non-null int64
start_date     2735 non-null object
end_date       2735 non-null object
state          1435 non-null object
pollster       2735 non-null object
fte_grade      2583 non-null object
sample_size    2735 non-null int64
population     2735 non-null object
answer         2735 non-null object
pct            2735 non-null float64
dtypes: float64(1), int64(2), object(7)
memory usage: 213.8+ KB

Again, some missing values exist in the "grade" column. The missing values in the "state" column represent national polls:

In [8]:
data_2020["state"] = data_2020.state.fillna("U.S.")

We will have to clean this data by spreading the candidates into their own separate columns, and will also remove irrelevant candidates. First, it will be necessary to separate different individual matchups involving Trump into different data tables.

In [9]:
def trump_opponent(data_2020, opp):
    trump_vs = data_2020[(data_2020["answer"] == opp) | (data_2020["answer"] == "Trump")]
    trump_vs = trump_vs.pivot_table(values = "pct", index = ["poll_id", "start_date", "end_date", "state", "pollster", "fte_grade", "sample_size", "population"], columns = "answer")
    trump_vs = trump_vs.dropna(axis = 0, how = "any") #Drops the Trump polls against any opponent that isn't our opp parameter
    trump_vs = trump_vs.reset_index().drop(columns = ["poll_id"])
    trump_vs["start_date"] = pd.to_datetime(trump_vs["start_date"])
    trump_vs["end_date"] = pd.to_datetime(trump_vs["end_date"]) 
    trump_vs["dem_lead"] = trump_vs[opp] - trump_vs["Trump"] 
    trump_vs = trump_vs.sort_values(by = ["end_date", "start_date"]) #Arranging the polls from most to least recent
    return trump_vs

Trump vs. Biden

In [10]:
trump_biden = trump_opponent(data_2020, "Biden")

trump_biden.head()
Out[10]:
answerstart_dateend_datestatepollsterfte_gradesample_sizepopulationBidenTrumpdem_lead
02018-11-272018-11-28OhioPublic Policy PollingB648v48.045.2500002.750000
12018-12-162018-12-17U.S.Harris Insights & AnalyticsC+1001rv42.036.6666675.333333
22019-01-042019-01-07North CarolinaPublic Policy PollingB750rv49.045.3333333.666667
32019-01-192019-01-21U.S.Public Policy PollingB760rv53.041.66666711.333333
42019-01-242019-01-26MichiganGlengariff Group Inc.B/C600lv53.341.60000011.700000

Trump vs. Sanders

In [11]:
trump_sanders = trump_opponent(data_2020, "Sanders")

trump_sanders.head()
Out[11]:
answerstart_dateend_datestatepollsterfte_gradesample_sizepopulationSandersTrumpdem_lead
02018-11-272018-11-28OhioPublic Policy PollingB648v47.045.2500001.750000
12018-12-162018-12-17U.S.Harris Insights & AnalyticsC+1001rv38.036.6666671.333333
22019-01-042019-01-07North CarolinaPublic Policy PollingB750rv48.045.3333332.666667
42019-01-162019-01-21New HampshirePraecones AnalyticaB/C593rv54.141.03333313.066667
32019-01-192019-01-21U.S.Public Policy PollingB760rv51.041.6666679.333333

Trump vs. Warren

In [12]:
trump_warren = trump_opponent(data_2020, "Warren")

trump_warren.head()
Out[12]:
answerstart_dateend_datestatepollsterfte_gradesample_sizepopulationTrumpWarrendem_lead
02018-11-272018-11-28OhioPublic Policy PollingB648v45.25000043.0-2.250000
12019-01-042019-01-07North CarolinaPublic Policy PollingB750rv45.33333346.00.666667
32019-01-162019-01-21New HampshirePraecones AnalyticaB/C593rv41.03333353.512.466667
22019-01-192019-01-21U.S.Public Policy PollingB760rv41.66666748.06.333333
42019-01-242019-01-26MichiganGlengariff Group Inc.B/C600lv41.60000045.84.200000

Trump vs. Buttigieg

In [13]:
trump_buttigieg = trump_opponent(data_2020, "Buttigieg")

trump_buttigieg.head()
Out[13]:
answerstart_dateend_datestatepollsterfte_gradesample_sizepopulationButtigiegTrumpdem_lead
02019-03-192019-03-20U.S.Harris Insights & AnalyticsC+1001rv27.035.75-8.75
22019-03-272019-03-28U.S.Public Policy PollingB846rv45.041.004.00
12019-03-282019-03-30NevadaEmerson CollegeA-719rv48.051.22-3.22
32019-04-032019-04-04U.S.Harris Insights & AnalyticsC+1002rv28.037.00-9.00
42019-04-112019-04-14U.S.Emerson CollegeA-914rv48.649.55-0.95

Now, we have 2020 polling data in the exact same format as our 2016 polling data.

2016 Election Results

Finally, let's take a look at the results of the 2016 election. We will get the percentage of Trump and Clinton voters for each state (not including third-party candidates), as well as manually add a row displaying the national results.

In [14]:
results_2016 = pd.read_csv("../input/2020-general-election-polls/nytimes_presidential_elections_2016_results_county.csv")
results_2016 = results_2016.groupby("State").sum()[["Clinton", "Trump"]]
results_2016.loc["U.S."] = [65853514, 62984828] #Adding a row for the national result
results_2016["Clinton_pct"] = 100 * results_2016["Clinton"] / (results_2016["Clinton"] + results_2016["Trump"])
results_2016["Trump_pct"] = 100 * results_2016["Trump"] / (results_2016["Clinton"] + results_2016["Trump"]) #percentages
results_2016["dem_lead"] = results_2016["Clinton_pct"] - results_2016["Trump_pct"]
results_2016["index"] = list(range(0,50))
results_2016["state"] = results_2016.index
results_2016 = results_2016.set_index("index")
In [15]:
results_2016.head()
Out[15]:
ClintonTrumpClinton_pctTrump_pctdem_leadstate
index
0718084.01306925.035.46078164.539219-29.078439Alabama
1888374.0972900.047.72935152.270649-4.541298Arizona
2378729.0677904.035.84300364.156997-28.313994Arkansas
35488776.02970470.064.88493235.11506829.769864California
41208095.01136354.051.53001848.4699823.060037Colorado
In [16]:
results_2016.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 50 entries, 0 to 49
Data columns (total 6 columns):
Clinton        50 non-null float64
Trump          50 non-null float64
Clinton_pct    50 non-null float64
Trump_pct      50 non-null float64
dem_lead       50 non-null float64
state          50 non-null object
dtypes: float64(5), object(1)
memory usage: 2.7+ KB

Note that there may be slight inaccuracies in comparison to the actual results because we do not have data for independant candidates. Therefore, the Trump and Clinton percentages add up to 100% which is slightly misleading.

What Happened in 2016? (The Polling Averages)

This section will analyze both national and battleground state polls in 2016 by comparing them to the actual results to determine how we should treat the 2020 polls this far into the election cycle. We will compare the averages of all polls, historically reliable polls only, and polls of likely voters for each state.

In [17]:
def trump_vs_clinton(trump_clinton, state, results_2016, reliable = False, likely_voters = False):
    
    #getting polls for the specified state / U.S. and filtering if necessary
    match_up = trump_clinton
    match_up = match_up[match_up["state"] == state]
    
    if reliable == True:
        match_up = match_up[match_up["fte_grade"].isin(["A+", "A", "A-"])]
    if likely_voters == True:
        match_up = match_up[match_up["population"] == "lv"]
    
    #Accounting for repeated polls which have the same end date
    
    match_up = match_up.groupby(["end_date", "pollster", "fte_grade", "population"]).mean().reset_index()
    match_up.index = match_up["end_date"]
    
    #A rolling average of democrat lead/deficit in the past 14 days
    
    if state == "U.S.":
        match_up["average_lead"] = match_up["dem_lead"].rolling("14D", min_periods = 0).mean()
    else:
        match_up["average_lead"] = match_up["dem_lead"].rolling("30D", min_periods = 0).mean()
    
    #Plotting the time series
    
    polls_vs_final =  [match_up.iloc[-1]["average_lead"], results_2016[results_2016["state"] == state].iloc[0]["dem_lead"]]
    polls_df = pd.DataFrame(polls_vs_final)
    polls_df[1] = ["Final/Current Polling Average", "Actual Results"]
    
    plt.subplots(figsize = (9,6))
    plt.subplot(1,2,1)
    plt.plot(match_up["end_date"], match_up["average_lead"])
    plt.xlabel("Date")
    plt.xticks(rotation = 90)
    plt.ylabel("Lead or Deficit vs. Trump (%)")
    plt.title(f"Trump vs. Clinton in {state}")  
    plt.subplot(1,2,2)
    plt.bar(polls_df[1], polls_df[0])
    plt.xticks(rotation = 90)
    plt.title("Poll Accuracy Chart")
    plt.show()
    
    if reliable == True:
        return f"Percentage Points of Trump Underestimation in {state} From Historically Reliable Polls: {round(polls_vs_final[0] - polls_vs_final[1], 2)}%"
    if likely_voters == True:
        return f"Percentage Points of Trump Underestimation in {state} From Polls of Likely Voters: {round(polls_vs_final[0] - polls_vs_final[1], 2)}%"
    return f"Percentage Points of Trump Underestimation in {state} From All Polls: {round(polls_vs_final[0] - polls_vs_final[1], 2)}%"

Nationally

In [18]:
trump_vs_clinton(trump_clinton, "U.S.", results_2016)
/opt/conda/lib/python3.6/site-packages/pandas/plotting/_matplotlib/converter.py:103: FutureWarning: Using an implicitly registered datetime converter for a matplotlib plotting method. The converter was registered by pandas on import. Future versions of pandas will require you to explicitly register matplotlib converters.

To register the converters:
	>>> from pandas.plotting import register_matplotlib_converters
	>>> register_matplotlib_converters()
  warnings.warn(msg, FutureWarning)
Out[18]:
'Percentage Points of Trump Underestimation in U.S. From All Polls: 0.03%'
In [19]:
trump_vs_clinton(trump_clinton, "U.S.", results_2016, reliable = True)
Out[19]:
'Percentage Points of Trump Underestimation in U.S. From Historically Reliable Polls: 0.06%'
In [20]:
trump_vs_clinton(trump_clinton, "U.S.", results_2016, likely_voters = True)
Out[20]:
'Percentage Points of Trump Underestimation in U.S. From Polls of Likely Voters: 0.04%'

In 2016, the national polling was right on the money, as the average of all polls the last two weeks before election day only underestimated Trump's chances by 0.03%. We can also see that the time of the election was the time of one of Clinton's weakest leads at just over 2%.

Of course, the US presidential election is decided by the electoral college, not the national popular vote, so it would be helpful to look at battleground states (states that may go either way and decide the election) and other states where Trump performed better than expected.

Florida

In [21]:
trump_vs_clinton(trump_clinton, "Florida", results_2016)
Out[21]:
'Percentage Points of Trump Underestimation in Florida From All Polls: 1.1%'
In [22]:
trump_vs_clinton(trump_clinton, "Florida", results_2016, reliable = True)
Out[22]:
'Percentage Points of Trump Underestimation in Florida From Historically Reliable Polls: 1.81%'
In [23]:
trump_vs_clinton(trump_clinton, "Florida", results_2016, likely_voters = True)
Out[23]:
'Percentage Points of Trump Underestimation in Florida From Polls of Likely Voters: 1.0%'

The polling averages only barely underestimated Trump's chances in Florida, with historically reliable polls being least accurate by a slight margin and polls of likely voters being the most accurate. Only the historically reliable polls had Clinton winning Florida, though we can see her numbers went down quickly in those polls right before the election.

Pennsylvania

In [24]:
trump_vs_clinton(trump_clinton, "Pennsylvania", results_2016)
Out[24]:
'Percentage Points of Trump Underestimation in Pennsylvania From All Polls: 5.07%'
In [25]:
trump_vs_clinton(trump_clinton, "Pennsylvania", results_2016, reliable = True)
Out[25]:
'Percentage Points of Trump Underestimation in Pennsylvania From Historically Reliable Polls: 5.97%'
In [26]:
trump_vs_clinton(trump_clinton, "Pennsylvania", results_2016, likely_voters = True)
Out[26]:
'Percentage Points of Trump Underestimation in Pennsylvania From Polls of Likely Voters: 5.09%'

Trump's chances in Pennsylvania were underestimated by huge margins, especially by historically reliable polls, despite Clinton having mostly positive momentum towards the end.

Ohio

In [27]:
trump_vs_clinton(trump_clinton, "Ohio", results_2016)
Out[27]:
'Percentage Points of Trump Underestimation in Ohio From All Polls: 6.07%'
In [28]:
trump_vs_clinton(trump_clinton, "Ohio", results_2016, reliable = True)
Out[28]:
'Percentage Points of Trump Underestimation in Ohio From Historically Reliable Polls: 6.7%'
In [29]:
trump_vs_clinton(trump_clinton, "Ohio", results_2016, likely_voters = True)
Out[29]:
'Percentage Points of Trump Underestimation in Ohio From Polls of Likely Voters: 6.07%'

Trump was already projected to win Ohio by a small margin, but he ended up winning by over 8%! Again, the historically reliable polls did the worst job of predictions. We can see, however, that Clinton had a lot of negative momentum towards the end of the Ohio polling.

Michigan

In [30]:
trump_vs_clinton(trump_clinton, "Michigan", results_2016)
Out[30]:
'Percentage Points of Trump Underestimation in Michigan From All Polls: 4.45%'
In [31]:
trump_vs_clinton(trump_clinton, "Michigan", results_2016, reliable = True)
Out[31]:
'Percentage Points of Trump Underestimation in Michigan From Historically Reliable Polls: 4.0%'
In [32]:
trump_vs_clinton(trump_clinton, "Michigan", results_2016, likely_voters = True)
Out[32]:
'Percentage Points of Trump Underestimation in Michigan From Polls of Likely Voters: 4.45%'

Michigan is yet another state that polls failed to predict accurately, though the historically reliable polls did a better job here than the averages of all polls.

Wisconsin

In [33]:
trump_vs_clinton(trump_clinton, "Wisconsin", results_2016)
Out[33]:
'Percentage Points of Trump Underestimation in Wisconsin From All Polls: 6.1%'
In [34]:
trump_vs_clinton(trump_clinton, "Wisconsin", results_2016, reliable = True)
Out[34]:
'Percentage Points of Trump Underestimation in Wisconsin From Historically Reliable Polls: 3.44%'
In [35]:
trump_vs_clinton(trump_clinton, "Wisconsin", results_2016, likely_voters = True)
Out[35]:
'Percentage Points of Trump Underestimation in Wisconsin From Polls of Likely Voters: 6.03%'

Wisconsin was one of the most unexpected states that Trump won, though the historically reliable polls did much better here that the average of all polls. Interestingly, however, Clinton had positive momentum in the historically reliable polls and negative momentum in polls of likely voters, as well as the average of all polls.

Minnesota

In [36]:
trump_vs_clinton(trump_clinton, "Minnesota", results_2016)
Out[36]:
'Percentage Points of Trump Underestimation in Minnesota From All Polls: 4.54%'
In [37]:
trump_vs_clinton(trump_clinton, "Minnesota", results_2016, reliable = True)
Out[37]:
'Percentage Points of Trump Underestimation in Minnesota From Historically Reliable Polls: 3.86%'
In [38]:
trump_vs_clinton(trump_clinton, "Minnesota", results_2016, likely_voters = True)
Out[38]:
'Percentage Points of Trump Underestimation in Minnesota From Polls of Likely Voters: 4.54%'

Clinton did manage to win Minnesota, but by much closer margins than expected. Again, the historically reliable polls did better here and there was negative momentum for Clinton across the board.

Since the national polls estimated the results of the popular vote perfectly, yet polls underestimated Trump's chances in battleground states, there must have been some states which overestimated Trump's chances. Note that these states don't play a huge factor in the election because they always decisively vote in a certain direction, so we have a smaller sample size of polls. Let's take a look at the polls of the most populated states in the country.

California

In [39]:
trump_vs_clinton(trump_clinton, "California", results_2016)
Out[39]:
'Percentage Points of Trump Underestimation in California From All Polls: -5.31%'
In [40]:
trump_vs_clinton(trump_clinton, "California", results_2016, reliable = True)
Out[40]:
'Percentage Points of Trump Underestimation in California From Historically Reliable Polls: 2.54%'
In [41]:
trump_vs_clinton(trump_clinton, "California", results_2016, likely_voters = True)
Out[41]:
'Percentage Points of Trump Underestimation in California From Polls of Likely Voters: -5.31%'

Trump's chances were overestimated in this state by the average of all polls, yet interestingly underestimated by historically reliable polls.

Illinois

In [42]:
trump_vs_clinton(trump_clinton, "Illinois", results_2016)
Out[42]:
'Percentage Points of Trump Underestimation in Illinois From All Polls: -1.98%'
In [43]:
trump_vs_clinton(trump_clinton, "Illinois", results_2016, reliable = True)
Out[43]:
'Percentage Points of Trump Underestimation in Illinois From Historically Reliable Polls: 7.03%'
In [44]:
trump_vs_clinton(trump_clinton, "Illinois", results_2016, likely_voters = True)
Out[44]:
'Percentage Points of Trump Underestimation in Illinois From Polls of Likely Voters: -1.98%'

Similar patterns exist here as the California polls. Again, note that the sample size of polls is very small.

New York

In [45]:
trump_vs_clinton(trump_clinton, "New York", results_2016)
Out[45]:
'Percentage Points of Trump Underestimation in New York From All Polls: -3.92%'
In [46]:
trump_vs_clinton(trump_clinton, "New York", results_2016, reliable = True)
Out[46]:
'Percentage Points of Trump Underestimation in New York From Historically Reliable Polls: -1.26%'
In [47]:
trump_vs_clinton(trump_clinton, "New York", results_2016, likely_voters = True)
Out[47]:
'Percentage Points of Trump Underestimation in New York From Polls of Likely Voters: -3.92%'

Trump was overestimated across the board in New York polls.

Texas

In [48]:
trump_vs_clinton(trump_clinton, "Texas", results_2016)
Out[48]:
'Percentage Points of Trump Underestimation in Texas From All Polls: -1.24%'
In [49]:
trump_vs_clinton(trump_clinton, "Texas", results_2016, reliable = True)
Out[49]:
'Percentage Points of Trump Underestimation in Texas From Historically Reliable Polls: -7.29%'
In [50]:
trump_vs_clinton(trump_clinton, "Texas", results_2016, likely_voters = True)
Out[50]:
'Percentage Points of Trump Underestimation in Texas From Polls of Likely Voters: -1.24%'

Though Trump carried Texas pretty easily, he was overestimated by 7.29% in historically reliable polls! Clinton also seemed to pick up positive momentum towards the end of polling in this state. As we will later see, Texas could be a closer race in 2020, so this information is important.

What Happened in 2016? (The Best Pollsters)

Now, which individual pollsters did the best job calling the election nationally and in each battleground state the two weeks leading up to the election? Furthermore, did pollsters who released more polls predict the national and battleground states more accurately?

In [51]:
def get_best_pollsters(trump_clinton, state, results_2016):
    
    #Getting polls only from the final two weeks in a given state
    
    final_polls = trump_clinton[(trump_clinton["end_date"] >= "2016-10-17") & (trump_clinton["state"] == state)]
    
    #Getting a variable to represent the true results in each state
    
    final_polls = pd.merge(final_polls, results_2016, on = "state", how = "inner")
    
    #Getting the average results in the state for each pollster and the difference from actual results
    
    by_pollster = final_polls.groupby(["pollster", "dem_lead_y"]).mean().reset_index()[["pollster", "dem_lead_x", "dem_lead_y"]]
    by_pollster["trump_underestimation"] = by_pollster["dem_lead_x"] - by_pollster["dem_lead_y"]
    graph_pollsters = by_pollster.sort_values("trump_underestimation", ascending = False)
    
    #Getting the number of polls from each pollster
    
    num_polls = final_polls.groupby("pollster").size().reset_index()
    
    #Finally, plotting our results
    
    plt.subplots(figsize = (10,10))
    sns.barplot(x = graph_pollsters["trump_underestimation"], y = graph_pollsters["pollster"])
    plt.xlabel("Percentage Point Difference From Actual Result (more positive number = more Trump underestimation)")
    plt.ylabel(None)
    plt.show()
    
    #Table of most to least accurate pollsters in terms of magnitude of inaccuracy
    
    best_pollsters = by_pollster
    best_pollsters["trump_underestimation"] = abs(by_pollster["dem_lead_x"] - by_pollster["dem_lead_y"])
    best_pollsters = pd.merge(best_pollsters, num_polls, on = "pollster") 
    best_pollsters["Number of Polls"] = best_pollsters[0]
    best_pollsters["Pct Pts Inaccuracy"] = best_pollsters["trump_underestimation"]
    best_pollsters["Pollster"] = best_pollsters["pollster"]
    best_pollsters = best_pollsters.sort_values("trump_underestimation")[["Pollster", "Pct Pts Inaccuracy", "Number of Polls"]]
    
    #Linear Regression of polls released the final two weeks vs final inaccuracy
    
    sns.lmplot(data = best_pollsters, x = "Number of Polls", y = "Pct Pts Inaccuracy")
    plt.title(f"Number of Polls Released vs. Inaccuracy in {state}")
    print(stats.linregress(best_pollsters["Number of Polls"], best_pollsters["Pct Pts Inaccuracy"]))

Nationally

In [52]:
get_best_pollsters(trump_clinton, "U.S.", results_2016)
LinregressResult(slope=-0.06573106522415247, intercept=3.439767617022699, rvalue=-0.4427447646920583, pvalue=0.020738758210948508, stderr=0.026623740019566387)

The Times-Picayune/Lucid, which was tied for releasing the most polls in the final two weeks leading up to the election, got an average that predicted the exact popular vote.

There existed a moderate correlation between number of polls released the final two weeks before the election and inaccuracy (r = -0.44). The probability that this occurred by chance is 2.07% (p = 0.0207).

Florida

In [53]:
get_best_pollsters(trump_clinton, "Florida", results_2016)
LinregressResult(slope=-0.13147978698989995, intercept=3.469762870475614, rvalue=-0.3146283867304236, pvalue=0.16480262845076044, stderr=0.09100156207579228)

Rasumussen predicted the Florida results very precisely, and released more polls than any other pollster. We can also see a balance of Trump underestimation and overestimation, which makes sense due to our findings in Florida from before. Also, it would be best to never listen to Saint Leo's polling!

There existed a moderate correlation between number of polls released the final two weeks before the election and inaccuracy (r = -0.31). The probability that this occurred by chance is 16.5%.

Pennsylvania

In [54]:
get_best_pollsters(trump_clinton, "Pennsylvania", results_2016)
LinregressResult(slope=-0.099129985051439, intercept=6.32755812516366, rvalue=-0.2561272889451045, pvalue=0.4471278447280237, stderr=0.12470793046078664)

Google Consumer Surveys was the only pollster that overestimated Trump's chances in this state, and only did so by a very small margin.

There existed a small correlation between number of polls released the final two weeks before the election and inaccuracy (r = -0.26). The probability that this occurred by chance is 44.7%.

Ohio

In [55]:
get_best_pollsters(trump_clinton, "Ohio", results_2016)
LinregressResult(slope=-0.08017725291005297, intercept=6.523640526235356, rvalue=-0.20400494703097885, pvalue=0.6608320049705139, stderr=0.17206589226208563)

Again, Google Consumer Surveys gave the best prediction for Ohio, though SurveyMonkey was a very close second.

There existed a small correlation between number of polls released the final two weeks before the election and inaccuracy (r = -0.20). The probability that this occurred by chance is 66.1%.

Michigan

In [56]:
get_best_pollsters(trump_clinton, "Michigan", results_2016)
LinregressResult(slope=-0.20194192626262614, intercept=5.849446780818, rvalue=-0.4194809137866781, pvalue=0.34883172949273195, stderr=0.19543498832070583)

Another state where every pollster underestimated Trump's chances, though SurveyMonkey was very close.

There existed a moderate correlation between number of polls released the final two weeks before the election and inaccuracy (r = -0.42). The probability that this occurred by chance is 34.5%.

Wisconsin

In [57]:
get_best_pollsters(trump_clinton, "Wisconsin", results_2016)
LinregressResult(slope=-0.2955132222222221, intercept=7.773837035240288, rvalue=-0.35037889678811374, pvalue=0.39482730839639807, stderr=0.3224936999037691)

This time, Google Consumer Surveys was way off and SurveyMonkey was right on the money as they also performed the most polls.

There existed a moderate correlation between number of polls released the final two weeks before the election and inaccuracy (r = -0.35). The probability that this occurred by chance is 39.4%.

Minnesota

In [58]:
get_best_pollsters(trump_clinton, "Minnesota", results_2016)
LinregressResult(slope=0.11080276587301588, intercept=3.6443364768603694, rvalue=0.28955936462226245, pvalue=0.6365403351291186, stderr=0.2114642351523154)

Another good prediction by Google Consumer Surveys.

There existed a small positive correlation between number of polls released the final two weeks before the election and inaccuracy (r = 0.29). The probability that this occurred by chance is 63.6%.

Conclusions

Now, from the polling information above, what did we learn? How might we use this to interpret the 2020 polls if we assume voter and polling behavior follows similar patterns?

No comments:

Post a Comment