GARBAGE IN

Here’s what we know is wrong with the PPP data

Imprecise marks
Imprecise marks
Image: REUTERS/Jonathan Ernst/File Photo

The US government has finally released data on which companies took money from the government to support payrolls as part of the Paycheck Protection Program (PPP). It’s a mess. While there is rarely a dataset that doesn’t suffer from some methodological dirtiness, definitional quirks, or collection bias, this data has already become notorious for its failings.

Bad data entry

The data comes from loan applications facilitated by banks. Some banks submitted one application at a time. Others submitted them in bulk. Many clearly were not proofread or validated. For instance, the spelling of city names clearly was not checked.

There are at least 35 different spellings of Chicago in the data, including: CHCAGO, CHIACAGO, CHIACGO, CHIAGO, CHICAAGO, CHICACO, CHICAFO, CHICAG, CHICAGO, CHICAGOI, CHICAGOL, CHICAGOO, CHICAGP, CHICAO, CHICARGO, CHICGAO, CHICSGO, CHIGAGO, CHIOCAGO, and CHOCAGO.

Misspellings of Miami include: MAIAMI, MAIMI, MIAI, MIAM, MIAMI, MIAMIA, MIAMIF, MIAMIM, MIANI, MIANMI, MIMAI, and NIAMI

Misspellings of Dallas include: DALAS, DALASS, DALL, DALLA, DALLAA, DALLAD, DALLASQ, DALLAX, DALLLAS, DALLS, and DALLSA

Even in fields that are easy to validate, errors were made. Some loans that have Zip codes listed have state codes listed as XX. One loan has a zip code in Florida, but has the state code listed as FI instead of FL.

The field for business type contains checkable errors too. Excluding organizations listed as non-profits, there are 2,627 loans to organizations with “LLP” in their names which are not listed as a limited liability partnership under the business type. Similarly there are 21,287 loans to organizations with “LLC” in their names which aren’t listed as a limited liability company.

Then some loan information is just wrong. The loan listed for Ford’s Hometown Services Inc has the company listed at 549 Grove St in Hartford, CT. But that address doesn’t exist. The company’s website says that it’s located 60 miles (100 km) away at 549 Grove St in Worcester, MA.

The Zip code listed for a loan for La Jolla Dentistry in San Diego is 91121, a Zip code that is for Pasadena, California, 110 miles away. The correct Zip code is 92121.

If well known information like cities are misspelled, we can assume that other fields, like the names of companies or the names of banks are misspelled too. Garbage in, garbage out.

The Small Business Administration, which is operating the loan program and released the data did not provide answers to Quartz’s questions about any of the issues we found or provide any substantive comment on our findings.

Bad data translation

It’s clear that some data has been transposed or truncated into the wrong cells. For instance the most common street address for a business receiving a loan is “PO BOX” without a street address or box number. Some fields also have the wrong information in them.

There are 1,182 loans where numeric digits appear in the city field. Some of those are clearly spill overs or duplication from the address field. On 198 loan listings the city field contains an office suite number.

Quartz was able to identify 842 loans where what appears to be a name associated with the loan is listed in the city field. For 781 of those, the loaned amount was less than $150,000 which meant the recipients identity was intended to be withheld by the SBA. This error appears 824 times on loans processed by Bank of America.

A loan listed under Morgan-Keller Inc. says the company is at 70 THOMAS JOHNSON DRIVE in the city of SUITE 200 FREDERICK, MD rather than 70 Thomas Johnson Drive, Suite 200 in the city of Frederick, MD, as their website indicates.

A loan listed under Volta Power Systems LLC has its location listed as SUPERIOR CT in the city of 12550 HOLLAND, MI. On what appears to be the company’s website, a contact address of 12550 Superior Ct. Holland, MI is listed.

For 600 loans the city field contains a five-digit number. For 519 loans, that number matches the listed Zip code. In the loans where those fields don’t match, there are clearly data errors. A loan given to an unnamed business with an address at JFK Airport in New York is listed as being in Michigan. Its zip code is listed as 48851. It’s certainly not a coincidence that the industry code for “Freight Transportation Arrangement” is 488510.

Missing fields

There are 224 loans that don’t have a Zip code listed, 247 without a city listed, and 210 with the state listed as XX. Together, there are 166 loans without all three.

Information on loans made for over $150,000 include the name of the receiving entity. For eight loans, the name of the recipient is missing.

Loan recipients were given the choice whether to provide demographic information. The race and ethnicity field was left blank on 89% of loans. The gender field was blank on 78%. The veteran field was blank on 85%. All three were blank on 76% of the loans. Any analysis of the current dataset to show the share of loans received by race, gender or veteran status will be biased by those who chose to provide that information.

Ambiguous schema

When applicants did provide the race and ethnicity field it wasn’t fully standardized. Both “Hispanic” and “Puerto Rican” appear in the data as do “American Indian or Alaska Native” and “Eskimo & Aleut.”

Retaining more jobs than are likely to exist

Because each loan is coded by industry, it’s easy to compare these figures to other statistics about the industry. In 35 industries, the number of jobs retained is greater than what other official statistics show to be the total number of workers in that industry.

There are many explanations for this. Companies may be wrongly categorized in the loan data. They could be using an out-of-date code as the classification system is updated periodically. Companies might be overstating the number of jobs they’re retaining or using a different definition of what constitutes a job.

Improbable wages

There are a couple of ways money received through PPP loans can be used, but for the loan to be forgiven 60% must be spent on payroll. The amount that is forgiven is proportional to the number of staff a business keeps on in the 10 weeks after receiving the loan.

Of course some businesses may have just wanted the easy, cheap loan rather than the free, strings attached money to say, pay rent. Even still, on at least 9% of loans listed, the implied wages would be less than the federal minimum wage. This is assuming that a borrower received the maximum amount listed in its value range and used the money exclusively to pay the workers it retained over the 10-week period.

There are 209,558 loans where the minimum possible per-person annual payroll is greater than $100,000. PPP loans are only supposed to finance the pro-rated amount of the first $100,000 of a person’s annual wages.

Bloomberg News spoke to to every borrower with a loan listed for more than $1 million but only one job listed as retained. They all told the publication that there were mistakes in the data.

Conspicuous employee counts

The number of employees at an organization is somewhat random. In the loan data though, there are many companies that say the money is to retain a number of jobs evenly divisible by 10 and five.

There are seven loans that have a negative number of “jobs retained.”

192 loans were given to entities labeled as a sole proprietorship that say they are retaining 500 jobs each. Typically, sole proprietorships are used by the self-employed and hire very few, if any, people. The SBA describes them (pdf) as “very small businesses.”

On 61 loans, the business type is listed as self-employed and the jobs retained figure is 500. There are 16 loans to independent contractors that claim to support 500 jobs.

Conspicuous loan amounts

Similar to employee counts the value of loans based on monthly payroll should have random qualities since it’s determined formulaically. Looking at the last digit of the exact loan amounts (which were only given for loans under 150,000) shows that there is again a bias towards numbers evenly divisible by 10. In fact, 94% of the 4.9 million loans under $150,000 end with the digit zero. Just 624 of the 4.9 million loans had values ending in three, four, six, seven, eight, or nine.

Companies that didn’t apply

Should this data be trusted or company statements? A number of companies that are listed as loan recipients are now denying that they applied or received one. Scooter-sharing company Bird is listed as receiving a loan between $5 million and $10 million, but the company says that’s wrong.

A woman in Wisconsin is listed as receiving a loan between $5 million and $10 million, even though she took out a loan for about $9,300.

Restaurant group Benihana says that it applied for a loan but didn’t accept it (paywall). The data show 24 loans made to entities sharing Benihana’s corporate address in Miami. All of them contain “Benihana” in their name.