Simulating the "Surprisingly Popular" Algorithm¶

Kyle Waters, August 2023

The Pennsylvania State Capitol in Harrisburg, PA

The "Surprisingly Popular" Algorithm (SPA) comes from the 2017 paper "A Solution to the single-question crowd wisdom problem" published in Nature by Drazen Prelec, H. Sebastian Seung, and John McCoy.

SPA is a novel method to extract the truth from a crowd even when the majority opinion is wrong. The key insight comes from the fact that the expert often possesses an additional signal beyond just the right answer: they will know not just the correct answer but also what the layperson will believe. In the author's own words: "the genius is you let a more knowledgeable minority reveal itself through predictions that the majority of people will disagree with them."

The paper showcases SPA's performance across diverse domains and a range of question types. But the easiest way to grok the SPA is to consider the world of a Yes/No question. The leading example in the paper is the following question: "Is Philadelphia the capital of Pennsylvania?" It is a question that more often than not, people will get wrong. They incorrectly believe Philly is the capital (it is Harrisburg). In other words, a democratic vote will most likely yield the wrong answer of "Yes". But, by applying the SPA, we can expect to get the right answer of "No" under certain conditions.

In this notebook, we simulate the SPA for this verifiable binary question (Y/N) under certain assumptions of the probability distributions of respondents' beliefs. We will see when SPA works, and when it can break down.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from tqdm import tqdm

Defining the underlying probability distributions¶

# Assumed underlying probabilities of the population
# Let's assume this is the question: Is Philadelphia the capital of Pennsylvania? (correct answer = No)

# p_yes is the percentage of people who will vote 'Yes'
# In other words, let's assume only 30% of people know the right answer
p_yes = 0.70

# Now, we enter the conditional world, what % of people agree with the answer, given Philly truly is the capital or not
# p_yes_given_yes is the percentage of people who will agree with an answer 'Yes' in the world where Philly really is the capital
# After all, what else would it be? It's the rational Schelling point. 
# Let's say this is 90%    
p_yes_given_yes = 0.90 

# p_yes_given_no is the percentage of people who will agree with an answer of 'No' in the world where Philly is _not_ the capital
# This is less believable because Philadelphia is a historically significant city, has a big economy, and is much bigger than Pittsburgh or
# other cities in the state
# Let's say this is 60%
p_yes_given_no = 0.60

# Number of respondents
n = 100

Generate the sample dataset of responses¶

def generate_sample_data(p_yes, p_yes_given_no, p_yes_given_yes, n):

    # Generate the 'Answer' column
    answers = np.random.choice(['Yes', 'No'], size=n, p=[p_yes, 1-p_yes])

    # Generate the 'Prediction' column based on the given probabilities
    predictions = []
    for answer in answers:
        if answer == 'No':
            prediction = np.random.choice(['Yes', 'No'], p=[p_yes_given_no, 1-p_yes_given_no])
        else:
            prediction = np.random.choice(['Yes', 'No'], p=[p_yes_given_yes, 1-p_yes_given_yes])
        predictions.append(prediction)

    # Turn into a numpy array
    crowd_agreement  = np.array(predictions)

    # Create the DataFrame
    df = pd.DataFrame({'Answer': answers, 'Agreement with Answer': crowd_agreement})

    return df

Define the SPA Algorithm¶

def surprisingly_popular(df):
    # Calculate the actual percentage of people who gave each answer
    actual_percentages = df['Answer'].value_counts(normalize=True)

    # Calculate the predicted percentage for each answer
    agreement_percentages = df['Agreement with Answer'].value_counts(normalize=True)

    # Calculate the difference between actual and agreement percentages
    differences = actual_percentages - agreement_percentages

    # Return the answer with the highest positive difference
    return differences.idxmax()

Aggregate the Results¶

# Get the Surprisingly Popular answer
result = surprisingly_popular(df)
print(f"The Surprisingly Popular answer is: {result}")

The Surprisingly Popular answer is: No

Visualizations to help understand the responses¶

aggregation_answer = df["Answer"].value_counts(normalize=True)
aggregation_answer.plot(kind='bar',title="Personal Votes from the Responders")
plt.xlabel('Actual Answer/Response')
plt.ylabel('Percentage')
plt.tight_layout()
print(f"{aggregation_answer[0]*100}% percentage of voters said Yes, {aggregation_answer[1]*100}% said No")

66.0% percentage of voters said Yes, 34.0% said No

aggregation_agreement = df["Agreement with Answer"].value_counts(normalize=True)
aggregation_agreement.plot(kind='bar',title="People will agree with your answer")
plt.xlabel('Agreement with Answer')
plt.ylabel('Percentage')
plt.tight_layout()
print(f"{aggregation_agreement[0]*100}% percentage of voters said Yes, {aggregation_agreement[1]*100}% said No")

77.0% percentage of voters said Yes, 23.0% said No

Now to find the SPA result, we find the difference between the answers to the right question and the popular question¶

spa_answer = (aggregation_answer - aggregation_agreement).idxmax()
if spa_answer=='No':
    print(f"The SPA says the answer is 'No' because {aggregation_answer[1]*100}% of voters actually answered 'No' when the predicted answer of 'No' was {aggregation_agreement[1]*100}%, therefore it is suprisingly more popular than predicted.")
else:
    print(f"The SPA says the answer is 'Yes' because {aggregation_answer[0]*100}% of voters actually answered 'Yes' when the predicted answer of 'Yes' was  {aggregation_agreement[0]*100}%, therefore it is suprisingly more popular than predicted.")

The SPA says the answer is 'No' because 34.0% of voters actually answered 'No' when the predicted answer of 'No' was 23.0%, therefore it is suprisingly more popular than predicted.

Running Simulations to see when the SPA holds¶

Here, we alter the assumptions of the underlying distributions to see how this changes the odds of getting the right answer.

# First, let's vary the percentage of people who actually know the right answer (we assumed 30% above)
# Let's assume True world is 'No'

n = 100
p_yes_given_yes = 0.90 
p_yes_given_no  = 0.60 

percent_knowing_truth = [0.01,0.05,0.1,0.15,0.2,0.25]

for p_truth in percent_knowing_truth:
    df = generate_sample_data(1-p_truth, p_yes_given_no, p_yes_given_yes, n)
    result = surprisingly_popular(df)
    print(f"The Surprisingly Popular answer is: {result}")

    aggregation_answer = df["Answer"].value_counts(normalize=True)
    aggregation_agreement = df["Agreement with Answer"].value_counts(normalize=True)
    spa_answer = (aggregation_answer - aggregation_agreement).idxmax()
    if spa_answer=='No':
        print(f"The SPA says the answer is 'No' because {aggregation_answer[1]*100}% of voters actually answered 'No' when the predicted answer of 'No' was {aggregation_agreement[1]*100}%, therefore it is suprisingly more popular than predicted.")
    else:
        print(f"The SPA says the answer is 'Yes' because {aggregation_answer[0]*100}% of voters actually answered 'Yes' when the predicted answer of 'Yes' was  {aggregation_agreement[0]*100}%, therefore it is suprisingly more popular than predicted.")

The Surprisingly Popular answer is: Yes
The SPA says the answer is 'Yes' because 100.0% of voters actually answered 'Yes' when the predicted answer of 'Yes' was  92.0%, therefore it is suprisingly more popular than predicted.
The Surprisingly Popular answer is: Yes
The SPA says the answer is 'Yes' because 97.0% of voters actually answered 'Yes' when the predicted answer of 'Yes' was  84.0%, therefore it is suprisingly more popular than predicted.
The Surprisingly Popular answer is: No
The SPA says the answer is 'No' because 13.0% of voters actually answered 'No' when the predicted answer of 'No' was 11.0%, therefore it is suprisingly more popular than predicted.
The Surprisingly Popular answer is: No
The SPA says the answer is 'No' because 16.0% of voters actually answered 'No' when the predicted answer of 'No' was 14.000000000000002%, therefore it is suprisingly more popular than predicted.
The Surprisingly Popular answer is: No
The SPA says the answer is 'No' because 25.0% of voters actually answered 'No' when the predicted answer of 'No' was 16.0%, therefore it is suprisingly more popular than predicted.
The Surprisingly Popular answer is: No
The SPA says the answer is 'No' because 26.0% of voters actually answered 'No' when the predicted answer of 'No' was 20.0%, therefore it is suprisingly more popular than predicted.

As we see above, if an insufficient number of people know the truth, then SPA will fail.

# Now, let's assume that the conditional worlds are more murky
# What happens when there is no strong reason to believe one answer or the other?
# Put differently, what happens when the "inside knowledge" that people are drawing on goes away?
# What we are referring to here is the fact that Philadelphia is historically important and big
# Imagine now there are no exceptionally large cities in Pennsylvania
# Then it's effectively a toss up of any given city presented, there is no other information to draw on

n = 100
p_yes = 0.70

p_yes_given_yes = [0.80,0.70,0.60,0.50]
p_yes_given_no  = [0.60,0.50,0.50,0.50]

for i in range(len(p_yes_given_yes)):
    df = generate_sample_data(p_yes, p_yes_given_no[i], p_yes_given_yes[i], n)
    result = surprisingly_popular(df)
    print(f"The Surprisingly Popular answer is: {result}")

    aggregation_answer = df["Answer"].value_counts(normalize=True)
    aggregation_agreement = df["Agreement with Answer"].value_counts(normalize=True)
    spa_answer = (aggregation_answer - aggregation_agreement).idxmax()
    if spa_answer=='No':
        print(f"The SPA says the answer is 'No' because {aggregation_answer['No']*100}% of voters actually answered 'No' when the predicted answer of 'No' was {aggregation_agreement['No']*100}%, therefore it is suprisingly more popular than predicted.")
    else:
        print(f"The SPA says the answer is 'Yes' because {aggregation_answer['Yes']*100}% of voters actually answered 'Yes' when the predicted answer of 'Yes' was  {aggregation_agreement['Yes']*100}%, therefore it is suprisingly more popular than predicted.")

The Surprisingly Popular answer is: No
The SPA says the answer is 'No' because 34.0% of voters actually answered 'No' when the predicted answer of 'No' was 30.0%, therefore it is suprisingly more popular than predicted.
The Surprisingly Popular answer is: No
The SPA says the answer is 'No' because 31.0% of voters actually answered 'No' when the predicted answer of 'No' was 28.000000000000004%, therefore it is suprisingly more popular than predicted.
The Surprisingly Popular answer is: Yes
The SPA says the answer is 'Yes' because 70.0% of voters actually answered 'Yes' when the predicted answer of 'Yes' was  56.99999999999999%, therefore it is suprisingly more popular than predicted.
The Surprisingly Popular answer is: Yes
The SPA says the answer is 'Yes' because 78.0% of voters actually answered 'Yes' when the predicted answer of 'Yes' was  50.0%, therefore it is suprisingly more popular than predicted.

When there is no information about the possible counterfactual worlds to draw on (probabilites of agreement essentially a coin toss at 50%), we see that the answer just reverts back to the majority answer most of the time.

# Now, let's see how the number of respondents changes things.
# We'll run 100 simulations for each n and see what % come back as "No" correctly

p_yes = 0.70
p_yes_given_yes = 0.90 
p_yes_given_no  = 0.60 

responders = range(5,101)
results = pd.DataFrame()

for n in tqdm(responders):
    number_no = 0
    for simulation in range(100):
        df = generate_sample_data(p_yes, p_yes_given_no, p_yes_given_yes, n)
        result = surprisingly_popular(df)
        # Record the result
        if result == "No":
            number_no +=1
    # Record result of all simulations
    results.loc[n,"Number of Correct Simulations"]=number_no

100%|██████████| 96/96 [00:25<00:00,  3.70it/s]

results["Number of Correct Simulations"].plot(title="% of Correct Simulations with Varying Size of Responder Pool", xlabel="Responders",y="% with SPA correctly finding 'No' Answer")

<AxesSubplot:title={'center':'% of Correct Simulations with Varying Size of Responder Pool'}, xlabel='Responders'>

As we see above, with around 50 respondents we can expect to get very good results most of the time.

# Finally, let's flex the number of respondents and the underlying probability distributions
# We'll run 100 simulations for each n - probability pairing and see what % come back as "No" correctly
# We'll do 5 different "states of the world"

# 1. Baseline (original assumptions)
# 2. Not too many people know the right answer but there are strong beliefs about the counterfactuals
# 3. Not too many people know the right answer and there aren't strong beliefs about the counterfactuals
# 4. More people know the right answer but there aren't strong beliefs about the counterfactuals
# 5. More people know the right answer and there are strong beliefs about the counterfactuals

p_yes           = [0.70,0.85,0.85,0.60,0.60]
p_yes_given_yes = [0.90,0.90,0.50,0.50,0.90] 
p_yes_given_no  = [0.60,0.60,0.50,0.50,0.60] 

responders = range(5,101)
results = pd.DataFrame()

for n in tqdm(responders):
    for world_state_i in range(len(p_yes)):
        number_no = 0
        for simulation in range(100):
            df = generate_sample_data(p_yes[world_state_i], 
                                      p_yes_given_no[world_state_i], 
                                      p_yes_given_yes[world_state_i], 
                                      n)
            result = surprisingly_popular(df)
            # Record the result
            if result == "No":
                number_no +=1

        # Record result of all simulations for that n and state of the world pairing
        column_name = f"{(1-p_yes[world_state_i])*100}% Knowing Truth, {p_yes_given_no[world_state_i]*100}% Believe in No given Truth is Yes, {p_yes_given_yes[world_state_i]*100}% Believe in Yes given Truth is Yes"
        results.loc[n,column_name]=number_no

  0%|          | 0/96 [00:00<?, ?it/s]100%|██████████| 96/96 [02:20<00:00,  1.46s/it]

results.columns = [f"world state {i+1}" for i in range(5)]
results.plot(figsize=(10,10),title="Number of Correct Simulations (answer='No') with Varying Size of Responder Pool & Probabilities", xlabel="Responders")

<AxesSubplot:title={'center':"Number of Correct Simulations (answer='No') with Varying Size of Responder Pool & Probabilities"}, xlabel='Responders'>

The plot above reveals some fascinating trends.

If the conditions are good; enough people know the truth and beliefs about the different states of the world are clear, then we see convergence quickly around 40 respondents (world states 1, 5)
The main driving factor is whether respondents can find any good reason to believe one state of the world over the other, if there is no inside knowledge to pull from, then the majority vote just takes over eventually (world states 3, 4). Note the Supplementary information in the Nature paper points this out: "Note that if all respondents simply predict that 50% of the sample will endorse each of the two possible answers, then the surprisingly popular answer is the same as that obtained by majority rule."
A sufficient number of people do actually need to know the truth though. Even if there is variation in the counterfactual distributions, a small % of people knowing the truth will make it hard to converge in finding the right answer, or may never even converge (world state 2)

We've also assumed here critically that the respondee answers are independent of one another. If respondees' answers are correlated with each other, this can negatively impact the performance of SPA.

Essentially, there needs to be a good answer to the question, "Why would someone believe that?"

Q: Is Philadelphia the capital of Pennsylvania? -> someone would believe Yes because they are drawing on local knowledge of history, culture.

"Do you think people will agree with your answer?" needs to be answerable with some underlying signals.

What happens when the independence assumption breaks down?¶

The SPA assumes respondents arrive at answers independently, what happens if this doesn't hold?

# Here, let's try and mess with the assumption of independent answers
# We'll have there be a chance that respndent i's answer be dependent on the majority vote up until that point

def generate_sample_data_not_independent(p_yes, p_yes_given_no, p_yes_given_yes, n, p_influenced):

    # The first person makes a guess based on the true distribution
    answers = np.random.choice(['Yes', 'No'], size=1, p=[p_yes, 1-p_yes])

    # But now, each subsequent answer has a chance of being dependent on the previous one
    # Let's assume there is some chance that the person's answer is strongly influenced by the previous ones
    # This will be p_influenced
    for _ in range(n-1):
        if np.random.rand() < p_influenced:
            # Influenced by previous answers (you pick whatever is the most frequent up to that point)
            majority_answer = 'Yes' if (answers == 'Yes').mean() > 0.5 else 'No'
            answers = np.append(answers, majority_answer)
        else:
            # Uninfluenced answer
            answers = np.append(answers, np.random.choice(['Yes', 'No'], size=1, p=[p_yes, 1-p_yes]))
        
    # Generate the 'Prediction' column based on the given probabilities
    predictions = []
    for answer in answers:
        if answer == 'No':
            prediction = np.random.choice(['Yes', 'No'], p=[p_yes_given_no, 1-p_yes_given_no])
        else:
            prediction = np.random.choice(['Yes', 'No'], p=[p_yes_given_yes, 1-p_yes_given_yes])
        predictions.append(prediction)

    # Turn into a numpy array
    crowd_agreement  = np.array(predictions)

    # Create the DataFrame
    df = pd.DataFrame({'Answer': answers, 'Agreement with Answer': crowd_agreement})

    return df

# Now, let's run simulations with this data sampling issue for various probabilities of being influenced
# We'll assume the baseline assumptions of the probability distributions

p_yes = 0.70
p_yes_given_yes = 0.90 
p_yes_given_no  = 0.60 

responders = range(5,101)
results = pd.DataFrame()
# chances of being influenced from low (almost independent) to high (dependency)
p_influenced = [0.01, 0.05, 0.10, 0.25, 0.50, 0.75]

for n in tqdm(responders):
    for i in range(len(p_influenced)):
        number_no = 0
        for simulation in range(100):
            df = generate_sample_data_not_independent(p_yes, p_yes_given_no, p_yes_given_yes, n, p_influenced[i])
            result = surprisingly_popular(df)
            # Record the result
            if result == "No":
                number_no +=1
        # Record result of all simulations
        results.loc[n,f"Chance of answer being influenced: {p_influenced[i]*100}%"]=number_no

100%|██████████| 96/96 [04:22<00:00,  2.73s/it]

results.plot(title="% of Correct Simulations with Varying Size of Responder Pool & Chance of Dependent Answers", 
             xlabel="Responders",
             ylabel="% with SPA correctly finding 'No' Answer",
             figsize=(10,10))

<AxesSubplot:title={'center':'% of Correct Simulations with Varying Size of Responder Pool & Chance of Dependent Answers'}, xlabel='Responders', ylabel="% with SPA correctly finding 'No' Answer">