AI Can Pass Standardized Tests—But It Would Fail Preschool

Artificial intelligence researchers have long dreamed of building a computer as knowledgeable and communicative as the one in Star Trek, which could interact with humans in natural (i.e., human) language. Last week, we seemed to boldly go toward that ideal. The New York Times reported that a team at the Allen Institute for Artificial Intelligence (AI2) had achieved “an artificial-intelligence milestone.” AI2’s program, Aristo, not only passed, but excelled on a standardized eighth-grade science test. The machine, the Times heralded, “is ready for high school science. Maybe even college.”



Melanie Mitchell is professor of computer science at Portland State University and External Professor at the Santa Fe Institute. Her book, Artificial Intelligence: A Guide for Thinking Humans, will be published in October by Farrar, Straus, and Giroux.

Or maybe not. Aristo isn’t the first AI system to shine on a test designed to gauge human knowledge and reasoning abilities. In 2015 one system matched a four-year-old’s performance on an IQ test, prompting the BBC headline, “AI had IQ of four-year-old child.” Another group reported their system could solve SAT geometry questions “as well as the average American 11th-grade student.” More recently, Stanford researchers created a question-answering test that prompted the New York Post to announce that “AI systems are beating humans in reading comprehension.” The truth is that while these systems perform well on specific language processing tests, they can only take the test. None come anywhere close to matching humans in reading comprehension or other general abilities the test was designed to measure.

The problem is that today’s machines, which excel at certain narrow tasks, still lack what we might call common sense. This includes the vast, and mostly unconscious, background knowledge that we use to understand the situations we encounter and the language we communicate with. Common sense also includes our ability to apply this knowledge quickly and flexibly to new circumstances.

The goal of endowing machines with common sense is as old as the field of AI itself, and is, I would venture, AI’s hardest open problem. Beginning in the 1990s, research on common sense took a back seat to statistical, data-driven AI approaches—especially in the form of neural networks and “deep learning.” But researchers have recently found that deep learning systems lack the robustness and generality of human learning, primarily because they lack our broad knowledge and flexible reasoning capabilities. Giving machines humanlike common sense is now at the top of AI’s To-Do list.

Open-ended question-answering, like that of the Star Trek computer, is still too hard for current AI systems, so researchers make progress by creating programs that can perform well on “benchmarks”—particular datasets that represent a specific task. Aristo’s benchmark consists of a set of multiple-choice questions from the New York State Regents Exam in science. A sample question:

Which equipment will best separate a mixture of iron filings and black pepper?
(a) magnet (b) filter paper (c) triple-beam balance (d) voltmeter

Aristo’s creators believe that developing AI systems to answer such questions is one of the best ways to push the field forward. “While not a full test of machine intelligence,” they note, these questions “do explore several capabilities strongly associated with intelligence, including language understanding, reasoning, and use of common-sense knowledge.”

Aristo is a complicated system that combines several AI methods. However, the component that accounts for almost all of the system’s success is a deep neural network that has been trained to be a so-called language model—a mechanism that, given a sequence of words, can predict what the next word will be. “I was driving way too fast when I was stopped by the …” What’s the next word? Maybe “police.” Probably not “grapefruit.” Given a sequence of words, a language model computes the probability that each of the hundreds of thousands of words in its vocabulary will be the next one in the sequence.

Aristo’s language model was trained on word sequences from millions of documents (including all of English Wikipedia). After training with this vast collection of English, the neural network has presumably learned some useful things about language in general. At this point the network can be “fine-tuned” to learn to answer multiple-choice questions. When it takes the Regents exam, its input is the question plus the four possible answers; the output is the probability that each answer is correct. The network returns the highest-probability answer as its guess.

Aristo was tested on 119 questions from the eighth-grade exam, and was correct on over 90 percent of them, a remarkable performance. It was also correct on over 83 percent of 12th-grade questions. While the Times reported that Aristo “passed the test,” the AI2 team noted that the actual tests New York students take include questions that refer to diagrams, as well as “direct answer” questions, neither of which Aristo was able to handle.

This is exciting progress, but we must keep in mind that a high score on a particular dataset does not always mean that a machine has actually learned the task its human programmers intended. Sometimes the data used to train and test a learning system has subtle statistical patterns—I’ll call these giveaways—that allow the system to perform well without any real understanding or reasoning.

For example, one neural-network language model—similar to the one Aristo uses—was reported in 2019 to capably determine whether one sentence logically implies another. However, the reason for the high performance was not that the network understood the sentences or their connecting logic; rather it relied on superficial syntactic properties such as how much the words in one sentence overlapped those in the second sentence. When the network was given sentences for which it could take advantage of these syntactic properties, its performance plummeted.

Dozens of papers have been published over the last few years revealing the existence of subtle giveaways in benchmark datasets used to evaluate machine-learning systems. This has led some researchers to question the extent to which deep learning systems are exhibiting “true understanding,” or merely responding to superficial cues in the data.

The Aristo team argued that its Regents Exam questions are less likely to be vulnerable to such giveaways than the more commonly used “crowdsourced” question-answering datasets. They note that “many of the benchmark questions intuitively appear to require reasoning to answer,” and that Aristo’s excellent performance “suggests that the machine has indeed learned something about language and the world, and how to manipulate that knowledge.”

But to what extent is reasoning, comprehension, or knowledge of science actually needed to answer these questions? For example, consider the sample question above. The Aristo team asserts, “To answer this kind of question robustly, it is not sufficient to understand magnetism. Aristo also needs to have some model of ‘black pepper’ and ‘mixture’ because the answer would be different if the iron filings were submerged in a bottle of water.”

I’ll make a competing hypothesis: Given Aristo’s language model, no such knowledge or reasoning is needed to answer this specific question; instead, the language model will have captured statistical associations between words that allow it to answer the question without any real understanding whatsoever. To illustrate, consider the following four sentences.

1. Magnet will best separate a mixture of iron filings and black pepper.
2. Filter paper will best separate a mixture of iron filings and black pepper.
3. Triple-beam balance will best separate a mixture of iron filings and black pepper.
4. Voltmeter will best separate a mixture of iron filings and black pepper.

A language model can input each of these sentences and output the sentence’s “probability”—how well the sentence fits the word associations the model has learned—and choose the option with the highest probability. As a very rough simulation, I typed a version of each of these sentences into Google (making sure it found no exact matches) and looked at how many “hits” each received. Indeed, the sentence beginning with “magnet” got the most hits. My crude language model answered the question correctly without any intelligence other than word associations on the web.

Products You May Like

Articles You May Like

We Asked AI to Take Us On a Tour of Our Cities. It Was Chaos
Waymo Is Suing People Who Allegedly Smashed and Slashed Its Robotaxis
Huge Microsoft Outage Linked to CrowdStrike Takes Down Computers Around the World
The Global CrowdStrike Outage Triggered a Surprise Return to Cash
Spotify, Stop Trying to Become a Social Media App

Leave a Reply