AI Beat Humans at Reading! Maybe Not

Microsoft and Alibaba claimed software could read like a human. There's more to the story than that.
This image may contain Sphere and Plant
Getty Images

News spread Monday of a remarkable breakthrough in artificial intelligence. Microsoft and Chinese retailer Alibaba independently announced that they had made software that matched or outperformed humans on a reading-comprehension test devised at Stanford. Microsoft called it a “major milestone.” Media coverage amplified the claims, with Newsweek estimating “millions of jobs at risk.”

Those jobs seem safe for a while. Closer examination of the tech giants’ claims suggests their software hasn’t yet drawn level with humans, even within the narrow confines of the test used.

The companies’ based their boasts on scores for human performance provided by Stanford. But researchers who built the Stanford test, and other experts in the field, say that benchmark isn’t a good measure of how a native English speaker would score on the test. It was calculated in a way that favors machines over humans. A Microsoft researcher involved in the project says “people are still much better than machines” at understanding the nuances of language.

The milestone that wasn’t demonstrates the slipperiness of comparisons between human and machine intelligence. AI software is getting better all the time, spurring a surge of investment into research and commercialization. But claims from tech companies that they have beaten human in areas such as understanding photos or speech come loaded with caveats.

In 2015, Google and Microsoft both announced that their algorithms had surpassed humans at classifying the content of images. The test used involves sorting photos into 1,000 categories, 120 of which are breeds of dog; that’s well-suited for a computer, but tricky for humans. More generally, computers still lag adults and even small children at interpreting imagery, in part because they don’t have common-sense understanding of the world. Google still censors searches for “gorilla” in its Photos product to avoid applying the term to photos of black faces, for example.

In 2016, Microsoft announced that its speech recognition was as good as humans, calling it an “historic achievement.” A few months later, IBM reported humans were better than Microsoft had initially measured on the same test. Microsoft made a new claim of human parity in 2017. So far, that still stands. But it is based on tests using hundreds of hours of telephone calls between strangers recorded in the 1990s, a relatively controlled environment. The best software still can’t match humans at understanding casual speech in noisy conditions, or when people speak indistinctly, or with different accents.

In this week’s announcements, Microsoft and Alibaba said they had matched or beaten humans at reading and answering questions about a text. The claim was based on a challenge known as SQuAD, for Stanford Question Answering Dataset. One of its creators, professor Percy Liang, calls it a “fairly narrow” test of reading comprehension.

Machine-learning software that takes on SQuAD must answer 10,000 simple questions about excerpts from Wikipedia articles. Researchers build their software by analyzing 90,000 sample questions, with the answers attached.

Questions such as “Where do water droplets collide with ice crystals to form precipitation?” must be answered by highlighting words in the original text, in this case, “within a cloud.”

Early in January, Microsoft and Alibaba submitted models to Stanford that respectively got 82.65 and 82.44 percent of the highlighted segments exactly right. They were the first to edge ahead of the 82.304 percent score Stanford researchers had termed “human performance.”

But Liang and Pranav Rajpurkar, a grad student who helped create SQuAD, say the score assigned to humans wasn’t intended to be used to for fine-grained or final comparisons between people and machines. And the benchmark is biased in favor of software, because humans and software are scored in different ways.

The test’s questions and answers were generated by providing Wikipedia excerpts to workers on Amazon’s Mechanical Turk crowdsourcing service. To be credited with a correct answer, software programs have to match one of three answers to each question from crowd workers.

The human performance score used as a benchmark by Microsoft and Alibaba was created by using some of the Mechanical Turk answers to create a kind of composite human. One of the three answers for each question was picked to fill the role of test-taker; the other two were used as the “correct” responses it was checked against. Scoring human performance by comparing against two rather than three reference answers reduces the chance of a match, effectively handicapping humans compared to software.

Liang and Rajpurkar say one reason they designed SQuAD that way in 2016 was because, at the time, they didn’t intend to create a system to definitively adjudicate battles between humans and machines.

Nearly two years later, two multi-billion dollar companies chose to treat it like that anyway. Alibaba’s news release credited its software with “topping humans for the first time in one of the world’s most-challenging reading comprehension tests.” Microsoft’s said it had made “AI that can read a document and answer questions about it as well as a person.”

Using the Mechanical Turk workers as the standard for human performance also raises questions about how much people paid a rate equivalent to $9 an hour care about getting right answers.

Yoav Goldberg, a senior lecturer at Bar Ilan University in Israel, says the SQuAD human-performance scores substantially underestimate how a native English speaker likely would perform on a simple reading-comprehension test. The percentages are best thought of as a measure of the consistency of the crowdsourced questions and answers, he says. "This measures the quality of the dataset, not the humans," Goldberg says.

In response to questions from WIRED, Microsoft provided a statement from research manager Jianfeng Gao, saying that “with any industry standard, there are potential limitations and weaknesses implied.” He added that “overall, people are still much better than machines at comprehending the complexity and nuance of language.” Alibaba didn’t respond to a request for comment.

Rajpurkar of Stanford says Microsoft and Alibaba’s research teams should still be credited with impressive research results in a challenging area. He is also working on calculating a fairer version of the SQuAD human performance score. Even if machines come out on top now or in the future, mastering SQuAD would still fall a long way short of showing software can read like humans. The test is too simple, says Liang of Stanford. “Current methods are relying too much on superficial cues, and not understanding anything,” he says.

Software that defeats humans at games such as chess or Go can also be considered both impressive and limited. The number of valid positions on a Go board outnumbers the count of atoms in the universe. The best AI software can’t beat humans at many popular videogames.

Oren Etzioni, CEO of the Allen Institute for AI, advises both excitement and sobriety about the prospects and capabilities of his field. “The good news is that on these narrow tasks, for the first time, we see learning systems in the neighborhood of humans,” he says. Narrowly talented systems can still be highly useful and profitable in areas such as ad targeting or home speakers. Humans are hopeless at many tasks easy for computers such as searching large collections of text, or numerical calculations.

For all that, AI still has a long way to go. “We also see results that show how narrow and brittle these systems are,” Etzioni says. “What we would naturally mean by reading, or language understanding, or vision is really much richer or broader.”

Machine Smarts
  • More than two years after mislabeling black people as gorillas, Google Photos does not allow "gorilla" as a tag.
  • Researchers are working to develop measures of how fast artificial intelligence is improving.
  • Descriptions of a Facebook experiment involving chatbots were highly exaggerated.