Notes on the book Superforecasting by Tetlock and Gardner

They did a study to see how good at forecasting people are, and what makes people a good forecaster. They did one study first ("EPJ") with only experts, and then another study (the "Good Judgement Project") with all sorts of people. Some people did really well, the 'superforecasters'.

The Good Judgement Project won a 5-team contest sponsored by the US's IARPA to forecast some things (p 17).

Chapter 1: "An optimistic skeptic"

p. 1: an example of a superforecaster in their study: "For years, Bill worked for the US Department of Agriculture in Arizona--“part pick-and-shovel work, part spreadsheet”--but now he lives in Kearney, Nebraska. Bill is a native Cornhusker. He grew up in Madison, Nebraska, a farm town where his parents owned and published the Madison Star-Mail, a newspaper with lots of stories about local sports and county fairs. He was a good student in high school and he went on to get a bachelor of science degree from the University of Nebraska. From there, he went to the University of Arizona. He was aiming for a PhD? in math, but he realized it was beyond his abilities--“I had my nose rubbed in my limitations” is how he puts it--and he dropped out. It wasn’t wasted time, however. Classes in ornithology made Bill an avid bird-watcher, and because Arizona is a great place to see birds, he did fieldwork part-time for scientists, then got a job with the Department of Agriculture and stayed for a while. Bill is fifty-five and retired, although he says if someone offered him a job he would consider it. So he has free time. And he spends some of it forecasting. Bill has answered roughly three hundred questions like “Will Russia officially annex additional Ukrainian territory in the next three months?” and “In the next year, will any country withdraw from the eurozone?” They are questions that matter. And they’re difficult. Corporations, banks, embassies, and intelligence agencies struggle to answer such questions all the time. “Will North Korea detonate a nuclear device before the end of this year?” “How many additional countries will report cases of the Ebola virus in the next eight months?” “Will India or Brazil become a permanent member of the UN Security Council in the next two years?” Some of the questions are downright obscure, at least for most of us. “Will NATO invite new countries to join the Membership Action Plan (MAP) in the next nine months?” “Will the Kurdistan Regional Government hold a referendum on national independence this year?” “If a non-Chinese telecommunications firm wins a contract to provide Internet services in the Shanghai Free Trade Zone in the next two years, will Chinese citizens have access to Facebook and/or Twitter?” When Bill first sees one of these questions, he may have no clue how to answer it. “What on earth is the Shanghai Free Trade Zone?” he may think. But he does his homework. He gathers facts, balances clashing arguments, and settles on an answer."

p.3 superforecasters are about the top 2% in their study

p.3 superforecasters include "engineers and lawyers, artists and scientists, Wall Streeters and Main Streeters, professors and students. We will meet many of them, including a mathematician, a filmmaker, and some retirees eager to share their underused talents."

p.4 superforecasting is "a skill that can be cultivated. This book will show you how."

p.4-5 he did a study once that showed that "the average expert was roughly as accurate as a dart-throwing chimpanzee." But this is only the AVERAGE expert in the study -- there were still some good forecasters who did better. He is sad that many people misrepresent his study as claiming that "all expert forecasts are useless".

p.5 "It was easiest to beat chance on the shortest-range questions that only required looking one year out, and accuracy fell off the further out experts tried to forecast—approaching the dart-throwing-chimpanzee level three to five years out. That was an important finding. It tells us something about the limits of expertise in a complex world—and the limits on what it might be possible for even superforecasters to achieve."

p.6-13 some things can't be predicted, such as the long-term future of chaotic dynamical systems, eg long-term predictions about whether it will rain on some particular day. Note that even in that case, short-term predictions can be reliable.

p. 18 superforecasters "aren’t gurus or oracles with the power to peer decades into the future, but they do have a real, measurable skill at judging how high-stakes events are likely to unfold three months, six months, a year, or a year and a half in advance. ".

p. 18 superforecaster "habits of thought can be learned and cultivated by any intelligent, thoughtful, determined person. It may not even be all that hard to get started. One result that particularly surprised me was the effect of a tutorial covering some basic concepts that we’ll explore in this book and are summarized in the Ten Commandments appendix. It took only about sixty minutes to read and yet it improved accuracy by roughly 10% through the entire tournament year. Yes, 10% may sound modest, but it was achieved at so little cost."

p.18 a modest improvement of forecasting is valuable; "A world-class poker player we will meet soon...said...the difference between heavyweights and that the heavyweights know the difference between a 60⁄40 bet and a 40⁄60 bet."

p.19 experts who had good forecasting had "modest but real foresight"

p. 20 "Why are they so good? That question runs through chapters 5 through 9. When you meet them it’s hard not to be struck by how smart they are, so you might suspect it’s intelligence that makes all the difference. It’s not. They’re also remarkably numerate. Like Bill Flack, many have advanced degrees in mathematics and science. So is the secret arcane math? No. Even superforecasters who are card-carrying mathematicians rarely use much math. They also tend to be newsjunkies who stay on top of the latest developments and regularly update their forecasts, so you might be tempted to attribute their success to spending endless hours on the job. Yet that too would be a mistake. Superforecasting does require minimum levels of intelligence, numeracy, and knowledge of the world, but anyone who reads serious books about psychological research probably has those prerequisites. So what is it that elevates forecasting to superforecasting? As with the experts who had real foresight in my earlier research, what matters most is how the forecaster thinks. I’ll describe this in detail, but broadly speaking, superforecasting demands thinking that is open-minded, careful, curious, and—above all—self-critical. It also demands focus. The kind of thinking that produces superior judgment does not come effortlessly. Only the determined can deliver it reasonably consistently, which is why our analyses have consistently found commitment to self-improvement to be the strongest predictor of performance. "

footnote 7 to chapter 1: "For islands of professionalism in a sea of malpractice, see the forecasting concepts and tools reviewed in Nate Silver, The Signal and the Noise: Why So Many Predictions Fail—but Some Don’t (New York: Penguin Press, 2012); J. Scott Armstrong, ed., Principles of Forecasting: A Handbook for Researchers and Practitioners (Boston: Kluwer, 2001); and Bruce Bueno de Mesquita, The Predictioneer’s Game (New York: Random House, 2009). Expanding these islands has proven hard. There is often little transfer of classroom statistical concepts, like regression toward the mean, to problems that students later encounter in life. See D. Kahneman and A. Tversky, “On the Study of Statistical Intuitions,” Cognition 11 (1982): 123–41. This poses a big challenge for the efforts of the Good Judgment Project to train people to think like superforecasters. "

Chapter 2: Illusions of Knowledge ("the psychology that convinces us we know things we really don’t")

p. 24-41: even experts, even experts who are experts both in their subject matter and in human irrationality, are fooled by the Kahneman and Tversky's System 1 ('blink') irrationality.

p. 35: one source of irrationality: the availability heuristic

p. 36-37: one source of irrationality: the urge to explain

p. 39: "...declarations of high confidence mainly tell you that an individual has constructed a coherent story in their mind, not necessarily that the story is true." -- Daniel Kahneman, Thinking, Fast and Slow, p. 212

p. 41: ""blink" versus "think".. is a false dichotomy" The real question is " to blend them in evolving situations"

p. 43: intuition is more reliable in some contexts than others. When "you work in a world full of valid cues you can unconsciously register for future use", intuition is more reliable. Example: a firefighter gusesing on whether a building is about to collapse. Example when intuition is unreliable: predicting the stock market.

p. 43: respect your intuition, but then double-check it, and be prepared to accept the opposite result if you are wrong

p. 44: today, forecasting is like medicine was before randomized controlled trials and statistical analysis of them was applied to check whether treatments actually had any efficacy

footnote 14 to chapter 2: "If you know cognitive psychology, you know that the heuristics-and-biases school of thought has not gone unchallenged. Skeptics are impressed by how stunningly accurately System 1 can perform. People automatically and seemingly optimally synthesize meaningless photons and sound waves into language we infuse with meaning (Steven Pinker, How the Mind Works, New York: Norton, 1997). There is dispute over how often System 1 heuristics lead us astray (Gerd Gigerenzer and Peter Todd, Simple Heuristics that Make Us Smart, New York: Oxford University Press, 1999) and how hard it is to overcome WYSIATI illusions via training or incentives (Philip Tetlock and Barbara Mellers, “The Great Rationality Debate: The Impact of the Kahneman and Tversky Research Program, Psychological Science 13, no. 5 [2002]: 94–99). Psychology has yet to piece together the mosaic. It is, however, my view that the heuristics-and-biases perspective still provides the best first-order approximation of the errors that real-world forecasters make and the most useful guidance on how to help forecasters bring their error rates down. "

Chapter 3: Keeping score ("what it takes to test forecasting as rigorously as modern medicine tests treatments")

p. 55: a US CIA analyst had a team that came to consensus on a forecast of 'serious possibility' of something. Then one of them discovered that the people reading their report thought that meant a low probability, whereas they thought it meant a high probability. Then they asked the others on their team and found out that they disagreed amongst themselves on what 'serious possibility meant!'. A similar miscommunication happened when US President Kennedy asked for an assessment of whether the Bay of Pigs invasion would work; the military said it had a 'fair chance' of success, by whichthey meant a low probability but Kennedy read as a high probability (the mission failed, to disasterous effect). This shows that you need to use numerical probabilites, not English words.

p. 57: but it's hard to get people to use numerical probabilities, because most people will judge a forecaster as 'wrong' whenever they are on the wrong side of 50% of the result (eg if they say it has a 40% chance and then it happens). This is illogical because you can't judge a probabilistic forecaster on a single instance, you need many instances. But it happens. Faced with this, there is a high political cost to making numerical forecasts, even if you are a good forecaster. Using English words allows people to maintain ambiguity about which side of 50% they are actually predicting.

p. 52, 53, 59, 60, 62, 64, 65, 68: in order to judge forecasters, we need each forecaster doing forecasts on many assertions, and all of the forecasters we are comparing must be forecasting on the same assertions, and the assertions must have explictly defined terms and explicit timeframes, and we must use numbers for probability instead of weasel words like 'serious possibility'. In addition to comparing against other forecasters, we should compare against very simple algorithms, such as 'always predict no change' or 'preict the recent rate of change'. We want to assess both 'calibration' (how well do the probabilities given by the forecasters match the actual probabilities of their being correct?) and 'resolution' (someone who predicts an 80% chance that something will happen, and is right 80% of the time, has more resolution than someone who predicts a 60% chance that something will happen, and is right 60% of the time, even though they are both perfectly calibrated). The Brier score is a good way to do this. . The Brier score is just the sum of the squared error of the predictions (eg if you predict an 80% chance that something happens, the prediction is 0.8 and the target is 1, so the squared error for that prediction is (.8 - 1)^2 = 0.04). Note that some things are inherently more difficult to forecast than others (eg the result of a series of pennyflips vs the result of a series of questions like 'what is 1+1'); so in order to know if a Brier score is good or bad you need to compare forecasters on the same set of assertions.

p. 68: "...the average expert was roughly as accurate as a dart-throwing chimpanzee. But...averages can obscure...In the EPJ results, there were two statistically distinguishable groups of experts. The first failed to do better than random guessing, and in their longer-range forecasts even managed to lose to the chimp. The second group beat the chimp, though not by a wide margin, and they still had plenty of reason to be humble. Indeed, they only barely beat simple algorithms like “always predict no change” or “predict the recent rate of change.”...why did one group do better than the other? It wasn’t whether they had PhDs? or access to classified information. Nor was it what they thought—whether they were liberals or conservatives, optimists or pessimists. The critical factor was how they thought. One group tended to organize their thinking around Big Ideas, although they didn’t agree on which Big Ideas were true or false. Some were environmental doomsters (“We’re running out of everything”); others were cornucopian boomsters (“We can find cost-effective substitutes for everything”). Some were socialists (who favored state control of the commanding heights of the economy); others were free-market fundamentalists (who wanted to minimize regulation). As ideologically diverse as they were, they were united by the fact that their thinking was so ideological. They sought to squeeze complex problems into the preferred cause-effect templates and treated what did not fit as irrelevant distractions. Allergic to wishy-washy answers, they kept pushing their analyses to the limit (and then some), using terms like “furthermore” and “moreover” while piling up reasons why they were right and others wrong. As a result, they were unusually confident and likelier to declare things “impossible” or “certain.” Committed to their conclusions, they were reluctant to change their minds even when their predictions clearly failed. They would tell us, “Just wait.” The other group consisted of more pragmatic experts who drew on many analytical tools, with the choice of tool hinging on the particular problem they faced. These experts gathered as much information from as many sources as they could. When thinking, they often shifted mental gears, sprinkling their speech with transition markers such as “however,” “but,” “although,” and “on the other hand.” They talked about possibilities and probabilities, not certainties. And while no one likes to say “I was wrong,” these experts more readily admitted it and changed their minds....I dubbed the Big Idea experts “hedgehogs” and the more eclectic experts “foxes.” Foxes beat hedgehogs. And the foxes didn’t just win by acting like chickens, playing it safe with 60% and 70% forecasts where hedgehogs boldly went with 90% and 100%. Foxes beat hedgehogs on both calibration and resolution. Foxes had real foresight. Hedgehogs didn’t.”

(my note: given his examples, a better term for hedgehog might just be 'ideologue'; perhaps he is avoiding this because he means this to apply past politics and thinks that 'ideologue' has too much of a political connotation, but i still like it better than 'hedgehog' because i can never remember which is which is the fox vs hedgehog story)

4.71 "the hedgehog’s one Big Idea doesn’t improve his foresight. It distorts it. And more information doesn’t help because it’s all seen through the same tinted glasses. It may increase the hedgehog’s confidence, but not his accuracy. That’s a bad combination. The predictable result? When hedgehogs in the EPJ research made forecasts on the subjects they knew the most about—their own specialties—their accuracy declined."

p. 72. "...the EPJ data...revealed an inverse correlation between fame and accuracy: the more famous an expert was, the less accurate he was.. editors, producers, and the public...go looking for hedgehogs, who just happen to be bad forecasters. Animated by a Big Idea, hedgehogs tell tight, simple, clear stories that grab and hold audiences. As anyone who has done media training knows, the first rule is “keep it simple, stupid”. Better still, hedgehogs are confident. With their one-perspective analysis, hedgehogs can pile up reasons why they are right – “furthermore”, “moreover” – without considering other perspectives and the pesky doubts and caveats they raise. And so, as EPJ showed, hedgehogs are likelier to say something definitely will or won’t happen. For many audiences, that’s satisfying...Foxes don’t fare so well in the media. They’re less confident, less likely to say something is “certain” or “impossible”, and are likelier to settle on shades of “maybe”. And their stories are complex, full of “howevers” and “on the other hands”, because they look at problems one way, then another, and another. This aggregation of many perspectives is bad TV. But it’s good forecasting. Indeed, it’s essential.”

p. 73 "In 1906 the legendary British scientist Sir Francis Galton went to a country fair and watched as hundreds of people individually guessed the weight that a live ox would be after it was “slaughtered and dressed.” Their average guess—their collective judgment—was 1,197 pounds, one pound short of the correct answer, 1,198 pounds. It was the earliest demonstration of a phenomenon popularized by—and now named for—James Surowiecki’s bestseller The Wisdom of Crowds. Aggregating the judgment of many consistently beats the accuracy of the average member of the group, and is often as startlingly accurate as Galton’s weight-guessers. The collective judgment isn’t always more accurate than any individual guess, however. In fact, in any group there are likely to be individuals who beat the group. But those bull’s-eye guesses typically say more about the power of luck—chimps who throw a lot of darts will get occasional bull’s-eyes—than about the skill of the guesser. That becomes clear when the exercise is repeated many times. There will be individuals who beat the group in each repetition, but they will tend to be different individuals. Beating the average consistently requires rare skill."

p.73 "Some reverently call it the miracle of aggregation but it is easy to demystify. The key is recognizing that useful information is often dispersed widely, with one person possessing a scrap, another holding a more important piece, a third having a few bits, and so on. ...Hundreds of people added valid information, creating a collective pool far greater than any one of them possessed. Of course they also contributed myths and mistakes, creating a pool of misleading clues as big as the pool of useful clues. But there was an important difference between the two pools. All the valid information pointed in one direction—toward 1,198 pounds—but the errors had different sources and pointed in different directions. Some suggested the correct answer was higher, some lower. So they canceled each other out."

p 74 "How well aggregation works depends on what you are aggregating. Aggregating the judgments of many people who know nothing produces a lot of nothing...aggregating the judgments of an equal number of people who know lots about lots of different things is most effective... Aggregations of aggregations can also yield impressive results. A well-conducted opinion survey aggregates a lot of information about voter intentions, but combining surveys—a “poll of polls”—turns many information pools into one big pool. That’s the core of what Nate Silver, Sam Wang, and other statisticians did in the presidential election of 2012. And a poll of polls can be further aggregated with other data sources. PollyVote? is a project of an academic consortium that forecasts presidential elections by aggregating diverse sources, including election polls, the judgments of a panel of political experts, and quantitative models developed by political scientists. In operation since the 1990s, it has a strong record, often sticking with the eventual winner when the polls turn and the experts change their minds."

p 74 "Now look at how foxes approach forecasting. They deploy not one analytical idea but many and seek out information not from one source but many... they aggregate."

p 75 "Consider a guess-the-number game in which players must guess a number between 0 and 100. The person whose guess comes closest to two-thirds of the average guess of all contestants wins. That’s it. And imagine there is a prize: the reader who comes closest to the correct answer wins a pair of business-class tickets for a flight between London and New York. The Financial Times actually held this contest in 1997, at the urging of Richard Thaler, a pioneer of behavioral economics. If I were reading the Financial Times in 1997, how would I win those tickets? I might start by thinking that because anyone can guess anything between 0 and 100 the guesses will be scattered randomly. That would make the average guess 50. And two-thirds of 50 is 33. So I should guess 33. At this point, I’m feeling pretty pleased with myself. I’m sure I’ve nailed it. But before I say “final answer,” I pause, think about the other contestants, and it dawns on me that they went through the same thought process as I did. Which means they all guessed 33 too. Which means the average guess is not 50. It’s 33. And two-thirds of 33 is 22. So my first conclusion was actually wrong. I should guess 22. Now I’m feeling very clever indeed. But wait! The other contestants also thought about the other contestants, just as I did. Which means they would have all guessed 22. Which means the average guess is actually 22. And two-thirds of 22 is about 15. So I should … See where this is going? Because the contestants are aware of each other, and aware that they are aware, the number is going to keep shrinking until it hits the point where it can no longer shrink. That point is 0. So that’s my final answer. And I will surely win. My logic is airtight. And I happen to be one of those highly educated people who is familiar with game theory, so I know 0 is called the Nash equilibrium solution. QED. The only question is who will come with me to London. Guess what? I’m wrong. In the actual contest, some people did guess 0, but not many, and 0 was not the right answer. It wasn’t even close to right. The average guess of all the contestants was 18.91, so the winning guess was 13. How did I get this so wrong? It wasn’t my logic, which was sound. I failed because I only looked at the problem from one perspective—the perspective of logic. Who are the other contestants? Are they all the sort of people who would think about this carefully, spot the logic, and pursue it relentlessly to the final answer of 0? If they were Vulcans, certainly. But they are humans. Maybe we can assume that Financial Times readers are a tad smarter than the general public, and better puzzle solvers, but they can’t all be perfectly rational. Surely some of them will be cognitively lazy and fail to realize that the other contestants are working through the problem just as they are. They will settle on 33 as their final answer. Maybe some others will spot the logic and get to 22, but they may not keep thinking, so they will stop there. And that’s just what happened—33 and 22 were the most popular answers."

p79 "Stepping outside ourselves and really getting a different view of reality is a struggle. But foxes are likelier to give it a try. Whether by virtue of temperament or habit or conscious effort, they tend to engage in the hard work of consulting other perspectives."

In other words, over a few different quotes he seems to be saying that superforecasters consider multiple points of view and multiple sources of information, which seems to include all of directly considering multiple points of view, trying not to assume that all of the relevant actors think like they do, and possibly try to seek guesses from multiple ideological camps and aggregate them.

footnote 22 to Chapter 3: "... If a student is told to speak in support of a Republican candidate, an observer will tend to see the student as pro-Republican even if the student only did what she was told to do—and even if the observer is the one who gave the order! Stepping outside ourselves and seeing things as others do is that hard. See Lee Ross, “The Intuitive Psychologist and His Shortcomings: Distortions in the Attribution Process,” in Advances in Experimental Social Psychology, ed. Leonard Berkowitz, vol. 10 (New York: Academic Press, 1977), 173–220; Daniel T. Gilbert, “Ordinary Personology,” in The Handbook of Social Psychology, vol. 2, ed. Daniel T. Gilbert, Susan T. Fiske, and Gardner Lindzey (New York: Oxford University Press, 1998): 89–150. "

footnote 23 to Chapter 3: "...we observed markedly less overconfidence among forecasters in the IARPA tournaments with open competition and public leaderboards than we did in the earlier EPJ research that guaranteed all forecasters anonymity. One result: the hedgehog-fox distinction mattered far less in the IARPA tournaments."

Chapter 4: Superforecasters (IARPA's forecasting tournament)

p 90 "get a couple of hundred ordinary people to forecast geopolitical events. You see how often they revise their forecasts and how accurate those forecasts prove to be and use that information to identify the forty or so who are the best. Then you have everyone make lots more forecasts. This time, you calculate the average forecast of the whole group—“the wisdom of the crowd”—but with extra weight given to those forty top forecasters. Then you give the forecast a final tweak: You “extremize” it, meaning you push it closer to 100% or zero. If the forecast is 70% you might bump it up to, say, 85%. If it’s 30%, you might reduce it to 15% (9)" (footnote 9: "Credit for this insight goes to two colleagues at the University of Pennsylvania, Lyle Ungar and Jonathan Baron. Lyle is responsible for all algorithms deployed by our project, with the exception of “L2E” developed by David Scott at Rice University.")

p 91 "the extremizing tweak is based on a pretty simple insight: When you combine the judgments of a large group of people to calculate the “wisdom of the crowd” you collect all the relevant information that is dispersed among all those people. But none of those people has access to all that information. One person knows only some of it, another knows some more, and so on. What would happen if every one of those people were given all the information? They would become more confident— raising their forecasts closer to 100% or zero. If you then calculated the “wisdom of the crowd” it too would be more extreme. Of course it’s impossible to give every person all the relevant information—so we extremize to simulate what would happen if we could."

p 92 "In my EPJ research, I had asked experts to make only one forecast per question and scored it later. By contrast, the IARPA tournament encouraged forecasters to update their forecasts in real time. So if a question with a closing date six months in the future opened, a forecaster could make her initial judgment—say, a 60% chance the event will happen by the six-month deadline—then read something in the news the next day that convinces her to move her forecast to 75%. For scoring purposes, those will later be counted as separate forecasts. ... Over four years, nearly five hundred questions about international affairs were asked of thousands of GJP’s forecasters, generating well over one million judgments about the future. "

p 91 the forecasts "...beat those of every other group and method available, often by large margins."

p 91 the forecasts they "...even beat those of professional intelligence analysts inside the government who have access to classified information..." (the authors of Superforecasters don't claim to have been told this directly; rather, this assertion is based on an article by a journalist who claims to have a source inside the government who told him this: p 95 "in November 2013, the Washington Post editor David Ignatius reported that “a participant in the project” had told him that the superforecasters “performed about 30 percent better than the average for intelligence community analysts who could read intercepts and other secret data.”")

p 91 another example of a superforecaster: "Doug Lorch doesn’t look like a threat to anyone. He looks like a computer programmer, which he was, for IBM. He is retired now. He lives in a quiet neighborhood in Santa Barbara with his wife, an artist who paints lovely watercolors. His Facebook avatar is a duck. Doug likes to drive his little red convertible Miata around the sunny streets, enjoying the California breeze, but that can only occupy so many hours in the day. Doug has no special expertise in international affairs, but he has a healthy curiosity about what’s happening. He reads the New York Times. He can find Kazakhstan on a map. So he volunteered for the Good Judgment Project. Once a day, for an hour or so, his dining room table became his forecasting center, where he opened his laptop, read the news, and tried to anticipate the fate of the world. In the first year, Doug answered 104 questions like “Will Serbia be officially granted European Union candidacy by 31 December 2011” and “Will the London Gold Market Fixing price of gold (USD per ounce) exceed $1,850 on 30 September 2011?”"

p 93 how Lorch scored: "At the end of the first year, Doug’s overall Brier score was 0.22, putting him in fifth spot among the 2,800 competitors in the Good Judgment Project. Remember that the Brier score measures the gap between forecasts and reality, where 2.0 is the result if your forecasts are the perfect opposite of reality, 0.5 is what you would get by random guessing, and 0 is the center of the bull’s-eye. So 0.22 is prima facie impressive, given the difficulty of the questions. Consider this one, which was first asked on January 9, 2011: “Will Italy restructure or default on its debt by 31 December 2011?” We now know the correct answer is no. To get a 0.22, Doug’s average judgment across the elevenmonth duration of the question had to be no at roughly 68% confidence—not bad given the wave of financial panics rocking the eurozone during this period. And Doug had to be that accurate, on average, on all the questions. In year 2, Doug joined a superforecaster team and did even better, with a final Brier score of 0.14, making him the best forecaster of the 2,800 GJP volunteers. He also beat by 40% a prediction market in which traders bought and sold futures contracts on the outcomes of the same questions. He was the only person to beat the extremizing algorithm. And Doug not only beat the control group’s “wisdom of the crowd,” he surpassed it by more than 60%..."

p94 how other superforecasters scored: in addition to Doug Lorch and Bill Flack, "There were 58 others among the 2,800 volunteers who scored at the top of the charts in year 1. They were our first class of superforecasters. At the end of year 1, their collective Brier score was 0.25, compared with 0.37 for all the other forecasters—and that gap grew in later years so that by the end of the four-year tournament, superforecasters had outperformed regulars by over 60%. Another gauge of how good superforecasters were is how much further they could see into the future. Across all four years of the tournament, superforecasters looking out three hundred days were more accurate than regular forecasters looking out one hundred days. In other words, regular forecasters needed to triple their foresight to see as far as superforecasters. "

p96 through p104 then considers the question that, given that in any large population, by chance, some people will do well, is that happening here? They figure that, if it were chance, then the population of superforecasters would differ in different years (which they label 'regression to the mean'), that is, a low correlation between forecast score across years.

p101 "...regression to the mean is an indispensable tool for testing the role of luck in performance. “Slow reversion is consistent with activities dominated by skill,” Mauboussin noted, “while rapid reversion comes from luck being the more dominant influence.”15" (footnote 15: "Michael J. Mauboussin, The Success Equation: Untangling Skill and Luck in Business, Sports, and Investing (Boston: Harvard Business Review Press, 2012), p. 73.")

p103 " years 2 and 3 we saw the opposite of regression to the mean: the superforecasters as a whole, including Doug Lorch, actually increased their lead over all other forecasters. "

p103 "But that result should make attentive readers suspicious. It suggests there was little or no luck behind the superforecasters’ results.... If chance is playing a significant role, why aren’t we observing significant regression of superforecasters as a whole toward the overall mean? An offsetting process must be pushing up superforecasters’ performance numbers. And it’s not hard to guess what that was: after year 1, when the first cohort of superforecasters was identified, we congratulated them, anointed them “super,” and put them on teams with fellow superforecasters. Instead of regressing toward the mean, their scores got even better. This suggests that being recognized as “super” and placed on teams of intellectually stimulating colleagues improved their performance enough to erase the regression to the mean we would otherwise have seen. In years 3 and 4, we harvested fresh crops of superforecasters and put them to work in elite teams. That gave us more apples-to-apples comparisons. The new cohorts continued to do as well or better than they did in the previous year, again contrary to the regression hypothesis.

p104 " The correlation between how well individuals do from one year to the next is about 0.65, modestly higher than that between the heights of fathers and sons. So we should still expect considerable regression toward the mean. And we observe just that. Each year, roughly 30% of the individual superforecasters fall from the ranks of the top 2% next year. But that also implies a good deal of consistency over time: 70% of superforecasters remain superforecasters. The chances of such consistency arising among coin-flip guessers (where the year-to-year correlation is 0) is less than 1 in 100,000,000, but the chances of such consistency arising among forecasters (where year-to-year correlation is 0.65) is far higher, about 1 in 3. (17)" footnote 17: "And approximately 90% of all “active” superforecasters, those answering at least fifty questions per year, landed in the top 20% performance category—so, when they did fall, they rarely fell far. This suggests that the skill/luck ratio for superforecasters may well be greater than that for regular forecasters. Exactly estimating skill/luck ratios is tricky, however. The values shift across samples of forecasters, periods of history, and types of questions. If I had to hazard a guess for just the active superforecasters across all four years, it would be at minimum 60⁄40 and possibly as high as 90⁄10"

p104 "we can conclude that the superforecasters were not just lucky. Mostly, their results reflected skill. "

Chapter 5: Supersmart? (no)

p109 "Regular forecasters scored higher on intelligence and knowledge tests than about 70% of the population. Superforecasters did better, placing higher than about 80% of the population. Note three things. First, the big jumps in intelligence and knowledge are from the public to the forecasters, not from forecasters to superforecasters. Second, although superforecasters are well above average, they did not score off-the-charts high and most fall well short of so-called genius territory, a problematic concept often arbitrarily defined as the top 1%, or an IQ of 135 and up. So it seems intelligence and knowledge help but they add little beyond a certain threshold.."

p114 After an explanation of what Fermi estimation is and how to do it (create a formula for the unknown quantity in terms of other unknown (but, probably, easier to guess) quantities, then guess those), the authors assert without much evidence that it's part of superforecasting: "I shared Levitin’s discussion of Fermi estimation with a group of superforecasters and it drew a chorus of approval. Sandy Sillman told me Fermi estimation was so critical to his job as a scientist working with atmospheric models that it became “a part of my natural way of thinking.” That’s a huge advantage for a forecaster, as we shall see. (7)" Footnote 7: "Superforecasters see Fermi-izing—daring to be wrong—as essential to what they do. Consider the superforecaster who goes by the screen name Cobbler. He is a software engineer in Virginia who knew little about Nigeria when he was asked in 2012 whether its government would enter into formal negotiations with the jihadist group Boko Haram. He began with the outside view and estimated the success rates of past efforts to negotiate with terrorist groups in general as well as with Boko Haram in particular. He averaged his two estimates (0% success rate of negotiating with Boko Haram and a 40% guess for negotiations with insurgencies in general). He then shifted to the inside view and assessed each side’s options. The government wants to stay on good terms with moderate Islamists who want to be power brokers between the government and the terrorists. Boko Haram might also have an interest at least in appearing to negotiate. He also noted many rumors of pending talks. But he balanced all that against the ferocity of Boko Haram—and guesstimated 30%. He then averaged the outside and inside views, yielding a 25% estimate, and scheduled that number to decline as the deadline approached. Net result of all these dare-to-be-wrong guesstimates: a top 10% Brier score on a question that triggered a lot of false-positive spiking in response to rumors of pending talks. Or consider superforecaster Regina Josephs, who tackled a question about the risks of another lethal outbreak of bird flu in China, a stretch for a political risk analyst whose free-spirited career includes stints in digital media and training the US Olympic women’s fencing team but no background in epidemiology. She also began with the outside view: how often has the casualty toll from bird flu exceeded the threshold value (about 80%)? But the flu season was already one-quarter over—so she cut that to 60%. She then took the inside view, noting improved public health policies and better warning indicators. All that brought her down to 40%, a number she lowered with time. Net result: not a spectacular score but better than 85% of forecasters. Or consider Welton Chang, a former military officer with combat experience in Iraq, who estimated the likelihood of Aleppo falling to the Free Syrian Army in 2013 by first taking the outside view: how long does it take even a clearly militarily superior attacker to take urban areas as large as Aleppo? Short answer: 10% to 20% base rate chance of success. Welton then took the inside view and found that the Free Syrian Army did not remotely qualify as a superior force, so he ratcheted the chances down. Net result: one of the top 5% of Brier scores on that question. It is amazing how many arbitrary assumptions underlie pretty darn good forecasts. Our choice is not whether to engage in crude guesswork; it is whether to do it overtly or covertly."

Chapter 6: Superquants? (no)

Chapter 7: Supernewsjunkies? (no)

Chapter 8: Perpetual Beta

Chapter 9: Superteams

Chapter 10: The Leader's Dilemma ("an apparent contradiction between the demands of good judgment and effective leadership")

Chapter 11: Are They Really So Super? (response "to what I think are the two strongest challenges to my research")

Chapter 12: What's Next?