Kyle Waters, August 2023
The Pennsylvania State Capitol in Harrisburg, PA
The "Surprisingly Popular" Algorithm (SPA) comes from the 2017 paper "A Solution to the single-question crowd wisdom problem" published in Nature by Drazen Prelec, H. Sebastian Seung, and John McCoy.
SPA is a novel method to extract the truth from a crowd even when the majority opinion is wrong. The key insight comes from the fact that the expert often possesses an additional signal beyond just the right answer: they will know not just the correct answer but also what the layperson will believe. In the author's own words: "the genius is you let a more knowledgeable minority reveal itself through predictions that the majority of people will disagree with them."
The paper showcases SPA's performance across diverse domains and a range of question types. But the easiest way to grok the SPA is to consider the world of a Yes/No question. The leading example in the paper is the following question: "Is Philadelphia the capital of Pennsylvania?" It is a question that more often than not, people will get wrong. They incorrectly believe Philly is the capital (it is Harrisburg). In other words, a democratic vote will most likely yield the wrong answer of "Yes". But, by applying the SPA, we can expect to get the right answer of "No" under certain conditions.
In this notebook, we simulate the SPA for this verifiable binary question (Y/N) under certain assumptions of the probability distributions of respondents' beliefs. We will see when SPA works, and when it can break down.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from tqdm import tqdm
# Assumed underlying probabilities of the population
# Let's assume this is the question: Is Philadelphia the capital of Pennsylvania? (correct answer = No)
# p_yes is the percentage of people who will vote 'Yes'
# In other words, let's assume only 30% of people know the right answer
p_yes = 0.70
# Now, we enter the conditional world, what % of people agree with the answer, given Philly truly is the capital or not
# p_yes_given_yes is the percentage of people who will agree with an answer 'Yes' in the world where Philly really is the capital
# After all, what else would it be? It's the rational Schelling point.
# Let's say this is 90%
p_yes_given_yes = 0.90
# p_yes_given_no is the percentage of people who will agree with an answer of 'No' in the world where Philly is _not_ the capital
# This is less believable because Philadelphia is a historically significant city, has a big economy, and is much bigger than Pittsburgh or
# other cities in the state
# Let's say this is 60%
p_yes_given_no = 0.60
# Number of respondents
n = 100
def generate_sample_data(p_yes, p_yes_given_no, p_yes_given_yes, n):
# Generate the 'Answer' column
answers = np.random.choice(['Yes', 'No'], size=n, p=[p_yes, 1-p_yes])
# Generate the 'Prediction' column based on the given probabilities
predictions = []
for answer in answers:
if answer == 'No':
prediction = np.random.choice(['Yes', 'No'], p=[p_yes_given_no, 1-p_yes_given_no])
else:
prediction = np.random.choice(['Yes', 'No'], p=[p_yes_given_yes, 1-p_yes_given_yes])
predictions.append(prediction)
# Turn into a numpy array
crowd_agreement = np.array(predictions)
# Create the DataFrame
df = pd.DataFrame({'Answer': answers, 'Agreement with Answer': crowd_agreement})
return df
def surprisingly_popular(df):
# Calculate the actual percentage of people who gave each answer
actual_percentages = df['Answer'].value_counts(normalize=True)
# Calculate the predicted percentage for each answer
agreement_percentages = df['Agreement with Answer'].value_counts(normalize=True)
# Calculate the difference between actual and agreement percentages
differences = actual_percentages - agreement_percentages
# Return the answer with the highest positive difference
return differences.idxmax()
# Get the Surprisingly Popular answer
result = surprisingly_popular(df)
print(f"The Surprisingly Popular answer is: {result}")
aggregation_answer = df["Answer"].value_counts(normalize=True)
aggregation_answer.plot(kind='bar',title="Personal Votes from the Responders")
plt.xlabel('Actual Answer/Response')
plt.ylabel('Percentage')
plt.tight_layout()
print(f"{aggregation_answer[0]*100}% percentage of voters said Yes, {aggregation_answer[1]*100}% said No")
aggregation_agreement = df["Agreement with Answer"].value_counts(normalize=True)
aggregation_agreement.plot(kind='bar',title="People will agree with your answer")
plt.xlabel('Agreement with Answer')
plt.ylabel('Percentage')
plt.tight_layout()
print(f"{aggregation_agreement[0]*100}% percentage of voters said Yes, {aggregation_agreement[1]*100}% said No")
spa_answer = (aggregation_answer - aggregation_agreement).idxmax()
if spa_answer=='No':
print(f"The SPA says the answer is 'No' because {aggregation_answer[1]*100}% of voters actually answered 'No' when the predicted answer of 'No' was {aggregation_agreement[1]*100}%, therefore it is suprisingly more popular than predicted.")
else:
print(f"The SPA says the answer is 'Yes' because {aggregation_answer[0]*100}% of voters actually answered 'Yes' when the predicted answer of 'Yes' was {aggregation_agreement[0]*100}%, therefore it is suprisingly more popular than predicted.")
Here, we alter the assumptions of the underlying distributions to see how this changes the odds of getting the right answer.
# First, let's vary the percentage of people who actually know the right answer (we assumed 30% above)
# Let's assume True world is 'No'
n = 100
p_yes_given_yes = 0.90
p_yes_given_no = 0.60
percent_knowing_truth = [0.01,0.05,0.1,0.15,0.2,0.25]
for p_truth in percent_knowing_truth:
df = generate_sample_data(1-p_truth, p_yes_given_no, p_yes_given_yes, n)
result = surprisingly_popular(df)
print(f"The Surprisingly Popular answer is: {result}")
aggregation_answer = df["Answer"].value_counts(normalize=True)
aggregation_agreement = df["Agreement with Answer"].value_counts(normalize=True)
spa_answer = (aggregation_answer - aggregation_agreement).idxmax()
if spa_answer=='No':
print(f"The SPA says the answer is 'No' because {aggregation_answer[1]*100}% of voters actually answered 'No' when the predicted answer of 'No' was {aggregation_agreement[1]*100}%, therefore it is suprisingly more popular than predicted.")
else:
print(f"The SPA says the answer is 'Yes' because {aggregation_answer[0]*100}% of voters actually answered 'Yes' when the predicted answer of 'Yes' was {aggregation_agreement[0]*100}%, therefore it is suprisingly more popular than predicted.")
As we see above, if an insufficient number of people know the truth, then SPA will fail.
# Now, let's assume that the conditional worlds are more murky
# What happens when there is no strong reason to believe one answer or the other?
# Put differently, what happens when the "inside knowledge" that people are drawing on goes away?
# What we are referring to here is the fact that Philadelphia is historically important and big
# Imagine now there are no exceptionally large cities in Pennsylvania
# Then it's effectively a toss up of any given city presented, there is no other information to draw on
n = 100
p_yes = 0.70
p_yes_given_yes = [0.80,0.70,0.60,0.50]
p_yes_given_no = [0.60,0.50,0.50,0.50]
for i in range(len(p_yes_given_yes)):
df = generate_sample_data(p_yes, p_yes_given_no[i], p_yes_given_yes[i], n)
result = surprisingly_popular(df)
print(f"The Surprisingly Popular answer is: {result}")
aggregation_answer = df["Answer"].value_counts(normalize=True)
aggregation_agreement = df["Agreement with Answer"].value_counts(normalize=True)
spa_answer = (aggregation_answer - aggregation_agreement).idxmax()
if spa_answer=='No':
print(f"The SPA says the answer is 'No' because {aggregation_answer['No']*100}% of voters actually answered 'No' when the predicted answer of 'No' was {aggregation_agreement['No']*100}%, therefore it is suprisingly more popular than predicted.")
else:
print(f"The SPA says the answer is 'Yes' because {aggregation_answer['Yes']*100}% of voters actually answered 'Yes' when the predicted answer of 'Yes' was {aggregation_agreement['Yes']*100}%, therefore it is suprisingly more popular than predicted.")
When there is no information about the possible counterfactual worlds to draw on (probabilites of agreement essentially a coin toss at 50%), we see that the answer just reverts back to the majority answer most of the time.
# Now, let's see how the number of respondents changes things.
# We'll run 100 simulations for each n and see what % come back as "No" correctly
p_yes = 0.70
p_yes_given_yes = 0.90
p_yes_given_no = 0.60
responders = range(5,101)
results = pd.DataFrame()
for n in tqdm(responders):
number_no = 0
for simulation in range(100):
df = generate_sample_data(p_yes, p_yes_given_no, p_yes_given_yes, n)
result = surprisingly_popular(df)
# Record the result
if result == "No":
number_no +=1
# Record result of all simulations
results.loc[n,"Number of Correct Simulations"]=number_no
results["Number of Correct Simulations"].plot(title="% of Correct Simulations with Varying Size of Responder Pool", xlabel="Responders",y="% with SPA correctly finding 'No' Answer")
As we see above, with around 50 respondents we can expect to get very good results most of the time.
# Finally, let's flex the number of respondents and the underlying probability distributions
# We'll run 100 simulations for each n - probability pairing and see what % come back as "No" correctly
# We'll do 5 different "states of the world"
# 1. Baseline (original assumptions)
# 2. Not too many people know the right answer but there are strong beliefs about the counterfactuals
# 3. Not too many people know the right answer and there aren't strong beliefs about the counterfactuals
# 4. More people know the right answer but there aren't strong beliefs about the counterfactuals
# 5. More people know the right answer and there are strong beliefs about the counterfactuals
p_yes = [0.70,0.85,0.85,0.60,0.60]
p_yes_given_yes = [0.90,0.90,0.50,0.50,0.90]
p_yes_given_no = [0.60,0.60,0.50,0.50,0.60]
responders = range(5,101)
results = pd.DataFrame()
for n in tqdm(responders):
for world_state_i in range(len(p_yes)):
number_no = 0
for simulation in range(100):
df = generate_sample_data(p_yes[world_state_i],
p_yes_given_no[world_state_i],
p_yes_given_yes[world_state_i],
n)
result = surprisingly_popular(df)
# Record the result
if result == "No":
number_no +=1
# Record result of all simulations for that n and state of the world pairing
column_name = f"{(1-p_yes[world_state_i])*100}% Knowing Truth, {p_yes_given_no[world_state_i]*100}% Believe in No given Truth is Yes, {p_yes_given_yes[world_state_i]*100}% Believe in Yes given Truth is Yes"
results.loc[n,column_name]=number_no
results.columns = [f"world state {i+1}" for i in range(5)]
results.plot(figsize=(10,10),title="Number of Correct Simulations (answer='No') with Varying Size of Responder Pool & Probabilities", xlabel="Responders")
The plot above reveals some fascinating trends.
If the conditions are good; enough people know the truth and beliefs about the different states of the world are clear, then we see convergence quickly around 40 respondents (world states 1, 5)
The main driving factor is whether respondents can find any good reason to believe one state of the world over the other, if there is no inside knowledge to pull from, then the majority vote just takes over eventually (world states 3, 4). Note the Supplementary information in the Nature paper points this out: "Note that if all respondents simply predict that 50% of the sample will endorse each of the two possible answers, then the surprisingly popular answer is the same as that obtained by majority rule."
A sufficient number of people do actually need to know the truth though. Even if there is variation in the counterfactual distributions, a small % of people knowing the truth will make it hard to converge in finding the right answer, or may never even converge (world state 2)
We've also assumed here critically that the respondee answers are independent of one another. If respondees' answers are correlated with each other, this can negatively impact the performance of SPA.
Essentially, there needs to be a good answer to the question, "Why would someone believe that?"
Q: Is Philadelphia the capital of Pennsylvania? -> someone would believe Yes because they are drawing on local knowledge of history, culture.
"Do you think people will agree with your answer?" needs to be answerable with some underlying signals.
The SPA assumes respondents arrive at answers independently, what happens if this doesn't hold?
# Here, let's try and mess with the assumption of independent answers
# We'll have there be a chance that respndent i's answer be dependent on the majority vote up until that point
def generate_sample_data_not_independent(p_yes, p_yes_given_no, p_yes_given_yes, n, p_influenced):
# The first person makes a guess based on the true distribution
answers = np.random.choice(['Yes', 'No'], size=1, p=[p_yes, 1-p_yes])
# But now, each subsequent answer has a chance of being dependent on the previous one
# Let's assume there is some chance that the person's answer is strongly influenced by the previous ones
# This will be p_influenced
for _ in range(n-1):
if np.random.rand() < p_influenced:
# Influenced by previous answers (you pick whatever is the most frequent up to that point)
majority_answer = 'Yes' if (answers == 'Yes').mean() > 0.5 else 'No'
answers = np.append(answers, majority_answer)
else:
# Uninfluenced answer
answers = np.append(answers, np.random.choice(['Yes', 'No'], size=1, p=[p_yes, 1-p_yes]))
# Generate the 'Prediction' column based on the given probabilities
predictions = []
for answer in answers:
if answer == 'No':
prediction = np.random.choice(['Yes', 'No'], p=[p_yes_given_no, 1-p_yes_given_no])
else:
prediction = np.random.choice(['Yes', 'No'], p=[p_yes_given_yes, 1-p_yes_given_yes])
predictions.append(prediction)
# Turn into a numpy array
crowd_agreement = np.array(predictions)
# Create the DataFrame
df = pd.DataFrame({'Answer': answers, 'Agreement with Answer': crowd_agreement})
return df
# Now, let's run simulations with this data sampling issue for various probabilities of being influenced
# We'll assume the baseline assumptions of the probability distributions
p_yes = 0.70
p_yes_given_yes = 0.90
p_yes_given_no = 0.60
responders = range(5,101)
results = pd.DataFrame()
# chances of being influenced from low (almost independent) to high (dependency)
p_influenced = [0.01, 0.05, 0.10, 0.25, 0.50, 0.75]
for n in tqdm(responders):
for i in range(len(p_influenced)):
number_no = 0
for simulation in range(100):
df = generate_sample_data_not_independent(p_yes, p_yes_given_no, p_yes_given_yes, n, p_influenced[i])
result = surprisingly_popular(df)
# Record the result
if result == "No":
number_no +=1
# Record result of all simulations
results.loc[n,f"Chance of answer being influenced: {p_influenced[i]*100}%"]=number_no
results.plot(title="% of Correct Simulations with Varying Size of Responder Pool & Chance of Dependent Answers",
xlabel="Responders",
ylabel="% with SPA correctly finding 'No' Answer",
figsize=(10,10))