GAMES PEOPLE PLAY

People are terrible judges of talent. Can algorithms do better?

People are terrible judges of talent. Can algorithms do better?
Image: Fernando Molina for Quartz
We may earn a commission from links on this page.

On a recent visit to the Pymetrics headquarters in Manhattan’s Flatiron district, company co-founder and CEO Frida Polli was sitting on a couch, looking casual but chic in a puffy vest and long-sleeved shirt. Outside the glass doors of the meeting room was a standard startup scene: bright lights, open office, young people in hoodies and jeans, an Australian shepherd named Taco trotting around greeting visitors.

Polli has long blond hair and light blue eyes, which she tends to close periodically as she talks—a habit that makes her appear to be choosing her words so carefully that she needs to block out any distractions.

As a cognitive scientist turned entrepreneur who’s held fellowships with Harvard Medical School and the Massachusetts Institute of Technology, Polli possesses the kind of pedigree and gravitas that tends to open professional doors. She has the put-together but attainable look of a startup leader who is unafraid of hard work. But these are exactly the types of observations she wants to uproot from the frameworks we commonly use to evaluate people in the working world.

“Companies have a whole set of prejudices,” she says. “‘I want someone from Princeton, I want someone who’s worked in this industry before.’ Have you even reality-checked that those things are important?”

Polli has. And her conclusions helped turn her into a steadfast if unlikely messenger of the idea that resumes and personal polish are outdated ways of judging a person’s qualifications. Her findings also formed the basis of Pymetrics, the 140-employee, venture-backed company she co-founded in 2013 with Julie Yoo, a former colleague from MIT.

Pymetrics is one of a bevy of young businesses on a mission to overhaul the hiring process with the help of artificial intelligence. The AI piece of the elevator pitch is bound to raise eyebrows in an era rife with warnings about the inequalities that can be perpetuated by biased algorithms. Yet Polli argues that Pymetrics, which uses a combination of neuroscience-based tests and machine learning to match job-seekers with openings, can make hiring not only more efficient, but also more equitable.

“The whole idea behind Pymetrics is that instead of using a resume, you are looking at people’s cognitive, social, and emotional aptitudes,” Polli says. “It’s also much more future-facing and potential-oriented, rather than backwards-facing and sort of only talking about your past experiences. It’s a much more holistic, hopeful view of someone than, Oh, this is what you’ve done, and this is all you can do.”

Bias, in Polli’s view, is a human problem. True, it can pop up in the algorithms that humans create to sort through job applications. But it’s at least as much of a risk in the people who review resumes and conduct interviews, who are naturally prone to make unfair judgments based on everything from a person’s name (pdf) and gender to their appearance and speech patterns. Algorithms, Polli suggests, are at least more trainable.

“It’s hard to remove bias from algorithms, but it is possible,” she says. “It is not possible to remove bias from humans.”

Those assertions are up for debate. But at least one thing is clear: Companies love the idea of algorithms, which promise to evaluate talent at far greater speed and lower cost than regular flesh-and-blood recruiting and hiring processes allow. Pymetrics’ client list includes big names like Unilever, Nielsen, LinkedIn, Accenture, KraftHeinz, MasterCard, and Boston Consulting Group. Venture capital is betting on Pymetrics, too. The company’s primary backers are General Atlantic, Jazz Venture Partners, Khosla Ventures, Salesforce Ventures, and Workday Ventures, and it has thus far raised $56.6 million in funding. At the same time, the AI-driven hiring tools provided by companies like HireVue are also becoming increasingly widespread.

And so the salient question at the moment isn’t whether companies should use machine learning to filter job candidates. It’s already happening. More relevant is the matter of whether the talent revolution already underway is a fair one—and what more can be done to ensure that algorithms alleviate, rather than deepen, the longstanding problems in hiring (Quartz membership exclusive).

Image for article titled People are terrible judges of talent. Can algorithms do better?

Testing for potential

Polli’s decision to switch careers in her late 30s was partly about personal fulfillment. As a cognitive neuroscientist, she says, “I was really excited by the science we were doing, but not that excited about its lack of real-world application.”

Her decision to leave academia and enroll in Harvard’s MBA program was also born of necessity. “I was a postdoc at MIT, making $37,000 a year,” she recalls. After she and her husband divorced, he lost his job, which meant she had to figure out a way to support herself and her daughter on her own.

“I was in a position where I was like, I need to change,” she says. “I can’t do this for the long-term. Even though I was highly educated, I was in a situation where I was like, I need to adapt.”

Polli makes this point to underscore the fact that we’re all vulnerable to changes in circumstance that necessitate career shifts—whether that means a warehouse worker’s job getting automated or an executive position made redundant in a merger. Meanwhile, many big employers have long lists of job openings, in some cases seeking workers to do things that have only been made possible by recent advances in technology.

And so Polli argues that companies that discount job candidates because they lack a particular degree or haven’t held a certain title in the past are being both unjust and impractical. Resumes merely give a sense of what people have done in the past; job-seekers may indeed have other talents they simply haven’t had a chance to apply yet, or skills that are easily transferable to a new field, or a capacity to learn specific technical tasks on the job. But ferreting those things out is an inexact science at best.

That’s where Pymetrics comes in. When a company signs up as a client, its current employees take an online test meant to measure attributes like memory, attention span, altruism, skepticism, and risk appetite. Then Pymetrics identifies patterns among the top performers in various roles. People on the sales team might tend to be the high-risk, high-reward type, for example, while the web developers might be highly focused and detail-oriented.

That information provides a basis for Pymetrics’ algorithm to evaluate job-seekers who take the test as they apply for a given role. The employer gets a detailed overview explaining why Pymetrics has determined the candidate is or is not a good match for a particular job. The candidate, meanwhile, maintains ownership over their data—meaning that it won’t get shared elsewhere without their permission. If the candidate applies for another job with a company that also uses Pymetrics, however, they won’t have to take the same test over again.

Frida Polli headshot
Frida Polli
Image: Pymetrics

Pymetrics hasn’t entirely replaced resumes, Polli says. Most of the company’s clients still check out candidates’ job history during the early stages of the application process. “But what we hope to do is skew it so people are not over-indexing on resume factoids that, at the end of the day, don’t have a ton of predictive value,” she says.

Polli recalls one company that used Pymetrics to hire salespeople, and wound up bringing on a new employee who had previously worked as a hairdresser and had zero experience in their new area. The ex-hairdresser turned out to be a top performer. “It’s those types of diamonds in the rough, or people you wouldn’t be considering otherwise,” who stand to benefit most from a new process, Polli says. (Pymetrics is in the process of developing technology that would aim to reduce bias during interviews as well.)

Pymetrics is not alone in questioning the value of resumes. Google’s former senior vice president of people operations, Laszlo Bock, has said that GPA, academic test scores, and even obtaining a college degree have nothing to do with how well people performed at Google. And writing for the Harvard Business Review in 2014, three professors of finance and economics argued that because resumes highlight criteria like elite universities and previous work experience, “they’re biased toward applicants from more wealthy backgrounds. These families usually have better connections and networks, can provide better education opportunities, and can afford to pay reputable universities’ tuition fees.”

Ultimately, Pymetrics aims to have its product benefit companies, which get exposed to talented candidates who might not have otherwise caught their attention, and job-seekers, who stand a better chance of landing in a role that they’ll enjoy and succeed in. There’s no way, she argues, to game the Pymetrics test. That’s because there are no right and wrong answers; there’s only a spectrum on which candidates fall for different traits, and a knowledge of how specific bands on each spectrum are correlated with success in various roles.

“What we’re really trying to say to job candidates is Be yourself, and when you’re actually yourself, warts and all, that’s when you’ll find your best fit,” she says. She notes that when she witnessed the recruiting process in full swing while attending Harvard’s MBA program, she was struck by how many of her classmates seemed to have set their sights on jobs that had little to do with their own specific strengths and interests.

“All these smart kids who’d gotten it in their head, I want to be an investment banker, they’d read up on how to ace the interview, get it, and then be like, I hate my job,” she recalls. That kind of conflict might have been rooted out if anyone had stopped to examine their fundamental fitness for the role.

Image for article titled People are terrible judges of talent. Can algorithms do better?

Building a character profile

The Pymetrics test consists of a dozen brief computer games, and takes about 25 minutes to complete. The games themselves are drawn from cognitive-science literature and have been used by researchers over the last several decades to measure things like memory and attention.

The games aren’t exactly fun—not to this test-taker, at least. One game flashes a string of numbers in front of me to test my memory, adding an extra digit each time. Another asks me to decide how much imaginary money I want to give to an imaginary partner. One exercise has me distinguish between smiley faces with bigger and smaller grins. Another has me click the space bar to inflate a balloon; the goal is to get the balloon as big as I can before it pops.

I can imagine the qualities that the tests are meant to measure; the last one, for example, seems related to users’ ability to learn from their mistakes. But I’m also bored—which means that I wind up just clicking the space bar impatiently to get through to the end of the exercise. I wonder whether the results might actually reveal more about how easily annoyed I am than about my talent for learning on the fly.

To be fair, since I’m not actually applying for a job, the test-taking process feels fairly low-stakes. Someone trying to land a position might well take the games more seriously. But the experience still makes me wonder about whether the Pymetrics games would be effective across a range of contexts—a question I pose to Suresh Venkatasubramanian, a computer scientist and professor at the School of Computing at the University of Utah. His research focuses on the subject of algorithmic fairness.

Venkatasubramanian, who previously consulted with HireVue and served on its expert advisory board but is no longer connected with the company, wasn’t able to comment on Pymetrics’ tests specifically. But he says that in general, such tests assume that there are “universal properties of people that can be assessed by these tests that are relatively consistent across cultures, backgrounds, and across test conditions.”

The trouble with this kind of claim, he says, is that as standardized tests like the SAT have shown, socioeconomic factors can wind up influencing different people’s test outcomes. “Kids in certain households have access to puzzles all the time,” he notes. They might grow up to be the kind of people who are “more predisposed to do those puzzles,” and therefore fare better on problem-solving games.

“I struggle with AI in hiring because I think there might be a case to be made for being more thoughtful about how we do hiring,” says Venkatasubramanian, noting that the persistence of old-boy networks means there is “definitely an argument to be made that we need to reform the process.”

“But if the solution to reform is merely to automate it, it’s not clear that’s addressing the root problem,” he adds. “Strong claims need strong evidence. I haven’t seen that yet.”

Pymetrics has made an effort to make sure the test accommodates people of different abilities and backgrounds. There are versions of the test that work for people with dyslexia, ADHD, and colorblindness, and it’s available in 20 different languages. But there are still a lot of variations among test-takers that some employment-rights advocates worry could wind up skewing the results of the test. Older job applicants, for example, may be less familiar with computer games, as Bloomberg Law points out, and therefore less fluent in all that clicking, pointing, and tapping.

“There are factors that I don’t think Pymetrics is responsible for, but that nonetheless tend to correlate to other social structures that could undermine, to a degree, claims of being an equalizer,” Venkatasubramanian says. In other words: Designing a truly universal test of abilities is easier said than done.

One thing the language of the Pymetrics test certainly does emphasize is that there’s no wrong way to be. If the results suggest you’re not much of a planner, you’re not disorganized, you’re “improvisational.” If you slow down after making a mistake, you’re “contemplative”; if the mistake rolls off your back, you simply “move quickly.” Test-takers don’t get to see how their scores match up against what their potential employers are looking for, but they do see how they score along the spectrum of each trait.

Polli says the philosophy behind the test is that “what’s truly to your advantage is to be yourself … and find where my constellation of traits makes me best suited.” Still, when I look at my own test results, I’m not sure I feel the thrill of self-recognition. On cognitive abilities, like my efficiency at planning or my level of control over my attention, I mostly score in the middle. Socially, the test shows that I take a lot of ambiguous risks, but I suspect that has more to do with my lack of investment in the outcome of the computer games than how I act in real-world scenarios. I also score extremely high in altruism because I gave all my fake money to an imaginary partner, but I’m not sure I’d be quite so generous if actual dollars were on the line.

At the same time, I have to admit that a resume and cover letter probably offer potential employers even less insight into my true character. Gaining a nuanced understanding of another person’s strengths and weaknesses is, after all, a complicated endeavor. Research suggests that personality is unstable, as is the very construct of a self, and that our behavior is highly dependent on the specifics of the situations that we find ourselves in. Still, employers have to start somewhere. And Pymetrics could well be a better initial proxy than paperwork.

Image for article titled People are terrible judges of talent. Can algorithms do better?

How Pymetrics tries to combat discrimination

Polli is well aware that algorithms can perpetuate bias and discrimination. But she says Pymetrics is having the opposite effect. According to Polli, in the first year after implementing the service, Unilever hired almost 20% more people of color in the roles for which it used Pymetrics. She also says the company increased its socioeconomic diversity—a point that Unilever management only realized after making its hires and suddenly getting, as Polli describes it, “an influx of people saying, I need you to compensate me now for relocation because I don’t have enough money in my back account to move cross-country.”

In interviews, Polli often emphasizes that her company audits its own algorithms for gender and ethnic bias with a process that it’s posted on GitHub, where it’s available for anyone to review. Using a reference group composed of 50,000 people, the company runs a test on each algorithm to check whether it “is going to have a difference in outcomes that is statistically significant for gender or ethnicity,” Polli says. If there’s a difference, Pymetrics labels the algorithm as biased and finds an alternative.

The specific test that Pymetrics uses to check algorithms for bias is the four-fifths rule, a guideline used by the US Equal Employment Opportunity Commission to check whether a particular employment practice could have a disparate or adverse impact against protected groups. The rule, as explained by the EEOC, identifies “a selection rate for any race, sex, or ethnic group which is less than four-fifths (4/5ths) or eighty percent (80%) of the selection rate for the group with the highest selection rate” as potentially discriminatory.

The AI experts interviewed for this story agree that while the four-fifths rule is one way of testing algorithms for bias, passing the test doesn’t necessarily mean that the algorithm is fair to all. “The four-fifths rule is an initial proxy,” says Venkatasubramanian. “Complying with it does not mean your system is unbiased; it means you’ve passed the test.” 

Computer scientist Rediet Abebe concurs. Abebe is a co-founder of the group Black in AI and a junior fellow at the Harvard Society of Fellows, where her research focuses on equity and justice within the realm of algorithms and AI. The four-fifths rule, she says, is “one metric we should be looking at” but “it’s just one of many ways to detect discrimination.” Meanwhile, in a September 2019 paper (pdf) on mitigating bias in algorithmic hiring, a team of researchers at Cornell notes that “vendors may find it necessary from a legal or business perspective to build models that satisfy the 4/5 rule, but this is not a substitute for a critical analysis into the mechanisms by which bias and harm manifest in an assessment.”

So what else should Pymetrics and other companies that rely on hiring algorithms do to ward against the possibility of bias? For one thing, the Cornell researchers recommend looking out for “differential validity, which occurs when an assessment is better at ranking members of one group than another.” For example, if there’s stronger correlation between a high score on a memory test and good job performance for men than for women, that would suggest that the test is biased in favor of men.

Venkatasubramanian suggests that companies also should be trying very hard to “red-team” their own systems—that is, to act like a cybersecurity company and hire people to identify the holes and vulnerabilities in their tools. “The reason to do that is to understand internally what the weaknesses of your system are, and secondly, to provide a measure of confidence to the public,” says Venkatasubramanian. If job-seekers are going to get assessed on the basis of games, he argues, they need strong evidence to assure them that “if you have a cold or a bad evening, that’s not going to torpedo your job chances.”

Some AI experts are pushing for third-party audits of algorithms and greater regulation of this space in general. And some argue that AI firms serious about warding off the risks involved in algorithmic hiring should themselves maintain a staff that’s diverse in characteristics like race, gender, sexual orientation. “Who codes matters,” computer scientist Joy Buolamwini, founder of the Algorithmic Justice League, said in a 2016 TEDx talk. “Are we creating full-spectrum teams with diverse individuals who can check each other’s blind spots?”

Abebe says that it may be useful to for companies using machine learning in hiring to also bring in people with backgrounds in fields like economics and sociology, who will be able to use their expertise in issues of inequality to spot potential pitfalls.

For her part, Polli says that Pymetrics is rather diverse for a tech company, with 48% women and 52% men. (The company declined to provide statistics on its racial and ethnic diversity.) She also says that when it comes to warding off bias, “it’s less about who’s literally coding, it’s more about who designed it.” That would be herself and her co-founder, Yoo. As women, she says, “being more sensitive to these issues has strongly influenced the way we think about designing AI.”

Pymetrics keeps track of how its product is performing along four different axes: efficacy, or whether companies that use Pymetrics for hiring are finding people who stick around longer and perform better; efficiency, or whether the product is faster than other processes and is leading to a higher yield of people who are successfully matched with jobs; the diversity of new hires according to gender, race, age, and socioeconomic background; and how job candidates rate their own experiences with the platform.

Pymetrics’ own data says that 95% or more of candidates who take the test are satisfied with the process. Meanwhile, a quick search through message boards on Reddit turns up job-seekers who’ve encountered the tests and deemed them ridiculous, or who are upset after being booted from the applicant pool because of their results—as well as people who are happy with the assessments.

In the happy camp is Peter Torres, a former senior business consultant in Grand Rapids, Michigan, who’s currently looking for a new job. Torres came across Pymetrics a few months ago while applying for a position as a relationship manager with LinkedIn, and has since joined Pymetrics’ candidate advisory council. He says the games “really helped me in terms of coming in on an even playing field—I didn’t feel like I was coming in at advantage or disadvantage.”

Playing the games also “got my competitive juices flowing,” he says. He agreed with his results—and appreciated the fact that the test didn’t try to box him into a particular career path. “In some previous assessments, I found that some people wanted more specifics as opposed to being versatile or broadly experienced,” he says.

Torres says the LinkedIn job didn’t pan out because of the location and other requirements, and he is still looking for a new position. But he’s optimistic about the idea that other companies that also use Pymetrics may find him through the platform and reach out to him about a new role. I ask if he’s heard from anyone in the past few weeks. “Admittedly,” he says, “no, I have not.”

Notably, Pymetrics itself is too small to use the games to evaluate and hire its own staff. “With machine learning, you need a minimum number of examples to train algorithms,” Polli explains. You need at least several dozens of people in the same type of role to start building a credible list of traits possessed by successful people in the position. But the company tries to structure its talent-evaluation process as much as possible, using the kinds of rubrics that have been shown to reduce bias in hiring decisions. In keeping with Pymetrics’ philosophy of potential over pedigree, she says, “We have a lot of people here doing their role for the first time, including the CEO.”

Image for article titled People are terrible judges of talent. Can algorithms do better?

The tech law of amplification

It’s inevitable that Pymetrics and its competitors will continue to attract scrutiny and criticism. The stakes, after all, are incredibly high. The worst-case scenario, according to Abebe? “You wind up having algorithms that disproportionately discriminate against individuals who were not represented in developing the algorithms that are used widely as a way of making hiring decisions, and wind up replicating and perpetuating inequality to a point where it’s nearly impossible to undo.”

That’s the darkest-timeline scenario—and why some advocates for workers’ right are already calling for greater regulatory oversight of AI in recruiting and hiring. Illinois, for example, passed a law meant to give job candidates more transparency about, and control over, the use of AI in video interviews. It goes into effect on Jan. 1, 2020.

According to Abebe, there’s something to be gained from the rise of machine learning in hiring and the ensuing controversies over its application. “Now we can have conversations about not just algorithms and hiring, but hiring more generally,” she says. “Giving people employment is a big deal. Algorithms did not create discrimination; discrimination is already there.”

When thinking about how machine learning could change the landscape of employment, Abebe says it’s useful to consider technology’s law of amplification—a theory developed by Kentaro Toyama, the WK Kellogg professor of community information at the University of Michigan School of Information.

The law of amplification, as Toyoma explains in his book Geek Heresy: Rescuing Social Change from the Cult of Technology, holds that technology’s “primary effect is to amplify human forces. Like a lever, technology amplifies people’s capacities in the direction of their intentions. You cannot expect a technology to transcend existing social forces or transform existing intentions; it tends instead to amplify whatever tendencies are already in place.”

What this means for Pymetrics is that, no matter how hard it works to refine its tests and machine learning, it can’t end bias in hiring on its own. To do that, employers, and society as a whole, need to address the underlying dynamics of discrimination that determine who gets the opportunity to work, and in what kind of jobs.