Competent Jurors and Weak Learners
The Condorcet Jury Theorem is a political science theorem from 1785 that states that a majority vote is better than any single vote if each vote is independent and has a probability of success that’s greater than 0.5 (a requirement for “competent” jurors).
As an example, say there’s a jury of three with individual vote probability of success of 0.8. Then the probability of majority vote success is
1 - CDF[BinomialDistribution[3, .8], 1] = 0.896. Just stick that in WolframAlpha if you want to see for yourself.
Of course, there’s a proof for this as well, but what’s interesting is its connection to the AdaBoost Algorithm.
The AdaBoost Algorithm (Adaptive Boosting) is a type of boosting that uses many weak learners, like a decision tree with depth equal to one (sometimes called a decision stump), and combines them into a strong learner.
The algorithm is as follows:
- Take a uniform resample of your training data
- Fit the first weak learner on this resample and assign weights to each observation it tried to predict with a lower weight for observations it easily predicted and a higher weight for the observations it didn’t. A weight is also assigned to this weak learner for how well it performed
- Now resample your training data again but the resample will be influenced by the weights assigned to each observation from the previous step. Higher weighted observations means they’re more likely to be sampled.
- Now fit a second weak learner to this data and do what you did in step 2 - assign weights to observations as well as a weight for how well the stump performed like in step 2.
- Repeat this process until you’ve reached the maximum number of weak learners you want to use.
- The final model will be a weighted sum over all the weak learners (remember that we assigned weights to each weak learner based on their performance).
Can you spot the similarities between the Condorcet Jury Theorem and the AdaBoost Algorithm yourself? In AdaBoost, these weak learners that are slightly better than random guessing are the jury members who are “competent”, meaning the probability they vote correctly is greater than 0.5. And since the errors of each weak learner are independent from each other, this corresponds to jury member votes being independent of each other. More generally, the relationship holds for any ensembled algorithm of weak learners.
I don’t know about you, but I thought this relationship between a three hundred year old political science theorem and a modern machine learning method was fascinating.