Ever since President Donald Trump accused Google of rigging its search results against him, the company has denied having any political bias in its system.
Although individual Google employees lean liberal on the political spectrum, there is no proof that the search engine’s results are purposely skewed toward any particular ideology.
Other bias complaints have surfaced. Regulators and competitors like Yelp have criticized Google for surfacing its own services, like maps, jobs postings, business reviews and travel information over information from other websites. Last year, the EU slapped Google with a $2.7 billion antitrust fine for its shopping results, while U.S. Attorney General Jeff Sessions is reportedly open to investigating whether tech companies, including Google, are stifling competition.
In an effort to demystify how it runs its search engine, the company invited CNBC to sit in on an internal meeting where search executives discussed whether or not to approve one particular change: whether to put images next to some kinds of search results.
The proposed change was small and hyper-specific, and Google’s decision was predictably data-driven. Ultimately, the meeting revealed both the grand complexity and the incremental simplicity of how Google shapes its search product.
Here’s what we learned.
People sometimes anthropomorphize Google search, believing that it “understands” their query, like when they type “movie volleyball island” and Google delivers results about Tom Hanks’ “Cast Away.”
But the search engine has no idea what those words mean: It’s merely looking for pages where those words, their synonyms, or even common misspellings appear, and surfacing the most relevant ones.
Google programs, known as web crawlers, scour the internet to gather information from hundreds of billions of webpages. Google then stores that data in a massive, constantly changing index, taking note of signals like freshness and where the page was created. When you enter something into the search bar, it’s fed through a set of rules and processes casually known as Google’s search “algorithm.” This process compares your query against information in the index and decides which pages to place at the top, all in a fraction of a second.
One of ranking factors that set Google apart when it launched 20 years ago is PageRank, named for co-founder Larry Page. PageRank judges a page’s relevance by how many others link to it — the idea is if a lot of people on the web find a page useful enough to link to it, it’s probably more relevant than a page that everybody ignores. PageRank is still a factor that Google’s algorithm uses today, but there are many others as well.
The company intentionally doesn’t reveal all the factors steering its ranking system, in part because it doesn’t want people to try to use that information to game the system — there’s a lot of traffic and money at stake in appearing at the top of search results. This secrecy also helps Google stay ahead of potential competitors.
When Google is considering a search algorithm change, a team tests the proposed adjustment with a small percentage of real users to see how they interact with it in the wild, as well as with a group of contractors called “search quality raters.”
Google contracts about 10,000 of these raters around the world, and while they cannot directly affect search results, their opinions help Google’s search team evaluate whether a given tweak should go through or not. Raters typically see old and new results side by side, and determine which are better.
“Better” is not a purely subjective term. It’s defined by a published document of search quality rater guidelines, which describe how raters should judge a page that shows up in their results. Particular attention is paid to a page’s expertise, authoritativeness and trustworthiness.
“You can view the rater guidelines as where we want the search algorithm to go,” Ben Gomes, Google’s vice president of search, assistant and news, told CNBC. “They don’t tell you how the algorithm is ranking results, but they fundamentally show what the algorithm should do.”
Google in July made some significant changes to the guidelines that, among other things, required raters to consider the reputation of a page’s author. As a result, pages with no clear author may now be ranked as lower quality.
In 2017, Google ran 31,584 side-by-side experiments with its raters and subsequently launched 2,453 search changes. While changes can have enormous effects on how any given website is ranked — and thus the amount of people who see it — regular Google search users often don’t notice the changes at all.
In the meeting that CNBC observed, the team had tested a new format for mobile searches that would display a photo from a webpage alongside its link. The hypothesis was that images like the one below would help users better determine which link would get them to the page most relevant to what they were looking for.
Google wouldn’t reveal how many queries it showed raters or real users but said it was enough to achieve “statistical significance.” For side-by-side experiments for raters, that typically means hundreds of queries. In this case, Google’s rater evaluation asked people if they thought that the images added next to the links helped them understand the results.
The team presented their data to Gomes and Google fellow Pandu Nayak, who leads ranking. The whole process, from the introduction of the idea to the conclusion of the meeting, took roughly 20 minutes.
The team ran through various data points, like what percent of users clicked through a picture-link and then quickly clicked back (a bad sign), or whether there was a significant increase in the time until they made their first interaction with the results (also bad).
They showed some examples of queries where the pictures weren’t helpful. The results for “Pomona College,” for example, provided generic pictures of students.
However, the data ultimately showed that 91 percent of the time, raters found that image results were useful. In the live experiment, real users clicked through the pictures, too. Weighing that positive feedback against a slight increase in latency (how long it took the results page to load), Gomes and Nayak approved the tweak.
“For any change, the question is always, on balance, is it more useful than not?” Gomes said. In this case, it was.
The meeting wasn’t exciting. There were no passionate debates or philosophical explorations of whether Google should be showing users more images. The data drove the decision.
That’s by design.
“We’ve got a rigorous process of testing things out,” Gomes said. “And we’re really driven by metrics — that’s the core of how we operate.”
Google has the hard data to direct why it approves changes.
But its process of choosing experiments in the first place is less straightforward.
Google listens to user feedback, including from big, ugly screw-ups, like when people discovered that Google was linking a white supremacist website as the first result for “Did the Holocaust happen?” When there’s a glaring problem, Google doesn’t just eliminate the bad search result and consider its work done. More often, it tries to figure out how to change both its algorithm and its rater guidelines to avoid similar mistakes.
Other times, ideas for algorithm changes come from broad company directives or priorities. For example, some employees have long argued that Google search results should be more personalized, Nayak said. Right now, there is very little search personalization and what exists is focused on a user’s location or immediate context from a prior search. (If you Googled something related to baseball followed by “The Giants,” the results wouldn’t surface the football team, for example.)
But after a lot of effort to test personalization, Google has found that it seldom actually improves results.
“A query a user comes with usually has so much context that the opportunity for personalization is just very limited,” Nayak says.
By not personalizing search results, Google has been able to escape a lot of the criticism that Facebook and Twitter have received for creating “filter bubbles,” where people see only information they were already predisposed to believe or like. (Google’s video product, YouTube, has not been able to avoid this criticism, particularly in how it recommends related videos. The two algorithms are totally separate and not created or maintained by the same team.)
Personalization could cause people to lose trust in Google, too. While Google doesn’t personalize most of its search rankings, its advertisements are extremely personalized because of the vast swathes of data its collects (Google allows users to manage privacy settings around what data it collects, but its methods have been misleading in the past).
In all, Google’s search results have come a long way from its original “10 blue links” to other websites. As voice search becomes increasingly important for Google and other big tech companies, it has relied more on its Knowledge Graph, a database of more than a billion entities with 70 billion connections between them, and on “featured snippets,” which surface answers extracted from webpages at the top of search results. There are much higher stakes to getting those answers wrong.
For all its user testing, Google knows mistakes will still appear, sometimes because of intentional vandalism, sometimes because of a problem with the algorithm, sometimes because results reflect societal biases.
“We are under no illusion that search is perfect,” Nayak said. “But we have an absolute commitment to addressing the challenges that we have and continuing to improve it. That’s what people are here to do.”