Business

Can the Wisdom of Crowds Help Fix Social Media’s Trust Issue?

A new study finds that small groups of laypeople can match or surpass the work of professional fact checkers—and they can do it at scale.

Social media misinformation outrage cycles tend to go through familiar phases. There’s the initial controversy over some misleading story that goes viral, then the platform’s response. Then someone asks “What about Fox News?” Finally, someone points out that the real problem, as far as social media is concerned, is the algorithms that determine who sees what. Those algorithms are primarily optimized for engagement, not accuracy. False and misleading stories can be more engaging than true ones, so absent some intervention by the platform, that’s what people are going to see. Fixing the algorithm, the argument goes, would be a better way to deal with the problem than taking down viral misinformation after the fact.

But fix it how? To change the ranking to favor true stories over false ones, say, the platforms would need a way to systematically judge everything that gets shared, or at least everything that gets shared a nontrivial amount. The current prevailing approach to false material involves punting the judgment to some outside party. Facebook, for example, partners with organizations like Factcheck.org to determine whether a given link merits a warning label. Twitter builds its fact-checks by linking to external sources. That could never be scaled up to the level of the algorithm. There aren’t enough professional fact checkers in the world to go over every article that might get posted on social media. Research has found that this creates an “implied truth effect”: If you only check a subset of content, some users will assume any article that isn’t labeled must therefore be accurate, even if it simply was never checked.

A new paper published in Science Advances suggests a promising solution to these issues: fact-checking by the crowd. In the study, a team of researchers led by David Rand, a professor at MIT, set out to show whether groups of random laypeople could approximate the results of professional fact checkers. Using a set of 207 articles that had been flagged for fact-checking by Facebook’s AI, they had three professional fact checkers score them on several dimensions to produce an overall score from 1 (totally false) to 7 (totally trustworthy). Then they recruited about 1,100 ordinary people from Amazon Mechanical Turk, divided them into groups equally balanced between self-identified Democrats and Republicans, and had them do the same thing, but with a twist: While the fact checkers read the entire article and did their own research to verify the claims, the laypeople only looked at the headline and first sentence of each story.

Amazingly, that was enough for the crowd to match and even exceed the fact checkers’ performance.

To measure the crowd’s performance, the researchers first measured the correlation between the scores assigned by the three fact checkers themselves. (The correlation came out to .62—high, but far from uniform agreement. When judging stories on a binary true/false scale, however, at least two out of three fact checkers agreed with each other more than 90 percent of the time.) Then they measured the correlation between the crowd-assigned scores, on the one hand, and the average of the three fact checkers’ scores, on the other. The basic idea was that the average of the professionals’ ratings represents a better benchmark of accuracy than any one fact checker alone. And so if the laypeople ratings correlated with the average fact checker score as closely as the individual fact checkers agreed with each other, it would be fair to say that the crowd was performing as well as or better than a professional. The question: How many laypeople would you need to assemble to hit that threshold?

The study found that with a group of just eight laypeople, there was no statistically significant difference between the crowd performance and a given fact checker. Once the groups got up to 22 people, they actually started significantly outperforming the fact checkers. (These numbers describe the results when the laypeople were told the source of the article. When they didn’t know the source, the crowd did slightly worse.) Perhaps most important, the lay crowds outperformed the fact checkers most dramatically for stories categorized as “political,” because those stories are where the fact checkers were most likely to disagree with each other. Political fact-checking is really hard.

It might seem impossible that random groups of people could surpass the work of trained fact checkers—especially based on nothing more than knowing the headline, first sentence, and publication. But that’s the whole idea behind the wisdom of the crowd: get enough people together, acting independently, and their results will beat the experts’.

“Our sense of what is happening is people are reading this and asking themselves, ‘How well does this line up with everything else I know?’” said Rand. “This is where the wisdom of crowds comes in. You don’t need all the people to know what’s up. By averaging the ratings, the noise cancels out and you get a much higher resolution signal than you would for any individual person.”

This isn’t the same thing as a Reddit-style system of upvotes and downvotes, nor is it the Wikipedia model of citizen-editors. In those cases, small, nonrepresentative subsets of users self-select to curate material, and each one can see what the others are doing. The wisdom of crowds only materializes when groups are diverse and the individuals are making their judgments independently. And relying on randomly assembled, politically balanced groups, rather than a corps of volunteers, makes the researchers’ approach much harder to game. (This also explains why the experiment’s approach is different from Twitter’s Birdwatch, a pilot program that enlists users to write notes explaining why a given tweet is misleading.)

The paper’s main conclusion is straightforward: Social media platforms like Facebook and Twitter could use a crowd-based system to dramatically and cheaply scale up their fact-checking operations without sacrificing accuracy. (The laypeople in the study were paid $9 per hour, which translated to a cost of about $.90 per article.) The crowd-sourcing approach, the researchers argue, would also help increase trust in the process, since it’s easy to assemble groups of laypeople that are politically balanced and thus harder to accuse of partisan bias. (According to a 2019 Pew survey, Republicans overwhelmingly believe fact checkers “tend to favor one side.”) Facebook has already debuted something similar, paying groups of users to “work as researchers to find information that can contradict the most obvious online hoaxes or corroborate other claims.” But that effort is designed to inform the work of the official fact-checking partners, not augment it.

Scaled up fact-checking is one thing. The far more interesting question is how platforms should use it. Should stories labeled false be banned? What about stories that might not have any objectively false information in them, but that are nonetheless misleading or manipulative?

The researchers argue that platforms should move away from both the true/false binary and the leave it alone/flag it binary. Instead, they suggest that platforms incorporate “continuous crowdsourced accuracy ratings” into their ranking algorithms. Instead of having a single true/false cutoff, and treating everything above it one way and everything below it another, platforms should instead incorporate the crowd-assigned score proportionally when determining how prominently a given link should be featured in user feeds. In other words, the less accurate the crowd judges a story to be, the more it gets downranked by the algorithm.

“You want to be assigning content some score on this continuous slider of totally accurate to pants-on-fire false,” Rand said. “What I would do if I was them is, the worse it is the more you demote it. Rather than just flagging a few items and saying, ‘These things are false so we push them to the bottom of the pile and we leave everything else alone.’”

Perhaps the most appealing part of this proposal, after its scalability, is its potential to tackle the enormous category of material shared online that isn’t technically false, but is nonetheless misleading. In the experiment, participants didn’t just say whether an article was true or false; they were asked to score it along seven dimensions, including reliability, objectivity, and bias. That creates room for more subtle judgments that can place content on a spectrum of trustworthiness, rather than trying to police an ephemeral boundary between information and misinformation.

This approach would have its limitations. Because the experiment only looks at articles, it’s not clear how well the same approach would work for video content, which can be a major vector for viral falsehoods. It also doesn’t apply to posts that don’t include a link. The crowds in the study appear to match the results of professional fact checkers, but professional fact checkers make plenty of mistakes. Flawless ratings are impossible, but perhaps there’s an even better, scalable approach out there that hasn’t yet been tried. Plus, incorporating any metric of accuracy into ranking algorithms might look like giving unaccountable social platforms even more power over public discourse.

The thing is, platforms are already in the business of deciding which content to show. The criticism of the status quo, in which algorithms appear to prioritize engagement too heavily, implies that those algorithms should instead turn the dial up on some other metric. Presumably, that metric would be some version of quality. Accuracy is only one of many ways to measure quality, of course, but it’s an important one. It’s also a rare area of broad agreement; that is, most everyone agrees that it’s better for people to be exposed to trustworthy material, they just disagree over the boundaries of each category. Which could be another argument for letting users be the ones to decide.


More Great WIRED Stories

Products You May Like

Articles You May Like

Airchat Is Silicon Valley’s Latest Obsession
Google Workers Protest Cloud Contract With Israel’s Government
Eric Schmidt Warned Against China’s AI Industry. Emails Show He Also Sought Connections to It
Crypto FOMO Is Back. So Are the Scams
No One Actually Knows How AI Will Affect Jobs

Leave a Reply