Numbers Rule Your World - Part 2
Library

Part 2

Years at Address = 0 to 0.5 AND.

Have Major Credit Card = Yes AND.

Banking Relationship = Savings Savings AND.

Number of Recent Credit Inquiries = 5 5 AND.

Account Balances = 16 to 30 percent of Credit Lines AND.

Past Delinquency = None THEN.

Score = 660

Now imagine thousands upon thousands of such rules, each matching a borrower to a three-digit number. More precisely, this number is a rating of applicants in the past who share similar characteristics as the present borrower. The FICO score is one such system. FICO modelers use 100 characteristics, grouped into five broad categories, listed here in order of importance: 1. Has the applicant dealt responsibly with past and current loans?2. How much debt is currently held?3. How long is the credit history?4. How eagerly is the applicant seeking new loans?5. Does the applicant have credit cards, mortgages, department store cards, or other types of debt?

In general, whether a score exceeds some cutoff level is far more telling than any individual score. Given a cutoff of 700, Mr. 720 is accepted, while Ms. 660 is rejected. Lenders set cutoff scores so that some desired proportion of applicants will be approved. They believe this proportion represents a healthy mix of good and bad risks needed to keep the business afloat.

The computer-harvested rules outperform the handcrafted ones: covering more details, they facilitate more nuanced comparisons, leading to more accurate predictions. For instance, rather than banning all painters, credit-scoring models selectively grant credit to painters based on other favorable traits. There is a low limit to how many characteristics the human mind can juggle, but the computer has the charming habit of digesting everything it is fed. Moreover, under each characteristic, the computer typically places applicants into five to ten groups, while traditional rules use only two. So instead of using debt ratio above or below 36 percent, a computer rule might split borrowers into groups of High (more than 50 percent), Medium (15 to 35 percent), Low (1 to 14 percent), and Zero debt ratio. This extra complexity achieves the same effect as gerrymandering does in creating voter districts. In the United States, the major political parties realize that simple rules based on things like county lines do not round up as many like-minded voters as do meandering boundaries. The new guidelines can appear illogical, but their impact is undeniable.

Automated scoring also has the advantage of being consistent. In the past, different companies, or a.n.a.lysts within the same company, frequently applied different rules of thumb to the same type of applicants, so credit decisions appeared confused and at times contradictory. By contrast, credit-scoring modelers predetermine a set of characteristics upon which all borrowers are evaluated so that no one characteristic dominates the equation. The computer then gives each applicant a rating, taking into account the importance of each characteristic. In the past, a.n.a.lysts weighed relative importance on the fly at their discretion; these days, FICO computers scan large databases to determine the most accurate weights. In these respects, credit scoring is fair.

It used to take generations to calibrate a simple rule such as "Don't lend to painters"; computers can do the job in less than a second because they excel at repet.i.tive tasks like trial and error. This extreme efficiency lends itself to discovering, monitoring, and refining thousands, even millions, of rules. Moreover, computers allow lenders to track the result of each loan decision, rather than knowing only the overall performance of an entire portfolio, thereby facilitating a more surgical diagnosis of why some decisions turned sour. The feedback loop is much shorter, so weaker rules get eliminated fast.

Early adopters reaped immediate and dramatic gains from statistical scoring systems. An experienced loan officer took about twelve and half hours to process an application for a small-business loan; in the same amount of time, a computer scored fifty applications. In this world, it is hardly surprising that Barbara Ritchie took out an auto loan with so little ha.s.sle-over 80 percent of auto loan approvals occur within an hour, and almost a quarter are approved within ten minutes. In this world, it is no wonder Barbara Ritchie was handed a Costco credit card-store clerks can open new accounts in less than two minutes. Thanks to credit scoring, the cost to process card applications has dropped by 90 percent, and the cost to originate a mortgage has been halved.

Lenders have reacted by ratcheting throughput up 25 percent, approving a great many more loans. As a result, the arrival of credit-scoring technology coincided with an explosion of consumer credit. In 2005, American households borrowed $2.1 trillion, excluding mortgages, a sixfold jump in twenty-five years. In turn, this spurred consumer consumption as credit empowered Americans to spend future income on current wants. Today, household expenditures account for two-thirds of the American economy; it is widely believed that heady consumers pulled the United States out of the 2001 recession. Amazingly, higher throughput did not erode quality: the loss rate of new loans turned out to be lower than or equal to the existing portfolio, just as the modelers prescribed. Further, all socioeconomic strata shared the bonanza: among households with income in the bottom 10 percent, the use of credit cards leaped almost twentyfold, from 2 percent in 1970 to 38 percent in 2001; among African-American households, it more than doubled from 24 percent in 1983 to 56 percent in 2001.

Before credit card companies fully embraced credit scores in the 1980s, they targeted only the well-to-dos; by 2002, each household had on average ten credit cards, supporting $1.6 trillion of purchases and $750 million of borrowing. During the 1990s, insurers jumped on board, followed by mortgage lenders. As of 2002, providers of substantially all credit cards, 90 percent of auto loans, 90 percent of personal loans, and 70 percent of mortgages utilized credit scores in the approval process. In industry after industry, it appears that, once let in the door, credit scoring is here to stay. What makes it so sticky?

Credit-scoring models rate the creditworthiness of applicants, allowing users to separate good risks from bad. This ability to select customers, balancing good and bad risks, is crucial to many industries. The insurance industry is no exception. People who tend to file more claims-say, reckless drivers-are more likely to want to buy insurance because they know those who do not file claims in effect subsidize those who do. If an insurer takes in too many bad risks, then the good customers will flee, and the business will falter. In the 1990s, insurance companies realized that credit scores could help them manage risk. To understand how this technology got to dominate industries, let's divide the players into Haves (those who use scoring) and Have-Nots. Thanks to scoring, the first Have cherry-picks the good risks with efficiency. The riskier borrowers are turned away, and they line up behind the Have-Nots. Before long, the Have-Nots notice deterioration in their loss performance. The most discerning Have-Not figures out why; it implements scoring to even the playing field, thus becoming a Have. As more and more firms turn into Haves, the Have-Nots see ever-worsening results, and eventually everyone converts. Acknowledging this domino effect, Alan Greenspan once remarked: "Credit-scoring technologies have served as the foundation for the development of our national markets for consumer and mortgage credit, allowing lenders to build highly diversified loan portfolios that substantially mitigate credit risk. Their use also has expanded well beyond their original purpose of a.s.sessing credit risk. Today they are used for a.s.sessing the risk-adjusted profitability of account relationships, for establishing the initial and ongoing credit limits available to borrowers, and for a.s.sisting in a range of activities in loan servicing, including fraud detection, delinquency intervention, and loss mitigation. These diverse applications have played a major role in promoting the efficiency and expanding the scope of our credit-delivery systems and allowing lenders to broaden the populations they are willing and able to serve profitably."

But this story can also be told in an alternative version.

To hear it from consumer advocacy groups, credit scoring is a wolf in sheep's clothing: its diffusion is a national tragedy, the science behind it fatally flawed. Birny Birnbaum, at the Center for Economic Justice, has warned that credit scoring will bring about the "end of insurance." Chi Chi Wu of the National Consumer Law Center has charged that credit scoring is "costing consumers billions and perpetuating the economic racial divide." Norma Garcia, speaking for the Consumers Union, has declared, "Consumers [are] caught in the crossfire." Such was the torrent of disapproval that Contingencies Contingencies, an actuarial trade publication, gave advice on adjusting to "living without credit scoring." Owing to the unrelenting pressure by consumer groups, at least forty states as of 2004 have pa.s.sed laws constraining the use of credit scoring. Some states, including California, Maryland, and Hawaii, have barred home and auto insurers from applying the technology. The FTC amended the Fair Credit Reporting Act in 1996 and again in 2003. Not a year pa.s.ses without some legislature holding hearings on the subject. These meetings are like a traveling circus with four acts; the same four basic themes are repeated over and over and over again: 1. Ban or heavily censor credit scoring because the statistical models are flawed. The models fail to link cause and effect. Worse, they unfairly slap minorities and low-income households with lower scores.2. Ban or heavily censor credit scoring until credit reports contain accurate and complete information on every consumer. Data problems are causing many consumers to pay higher rates for loans and insurance.3. Require credit-scoring companies to open the "black boxes" that hold the proprietary scoring rules. Consumers have a right to inspect, challenge, and repair their credit scores.4. Conduct more studies or hearings to seek the perfect model of consumer behavior. This research should focus on establishing causeeffect or on measuring disparate impact on the underprivileged.

Many a critique of credit scoring begins and ends with the horror story of a wronged consumer. James White became one in 2004 after his insurer hiked his rate by 60 percent. He learned that his credit score had been marked down substantially because of twelve recent credit inquiries (this was five times the national average of 2.4). Each inquiry occurred when someone requested his credit report, and White was shopping for a mortgage, an auto loan, and a credit card at the time. Critics complained that lenders inspecting his credit report could not have caused caused a change in White's creditworthiness, so it was nonsensical for modelers to relate the two. Extending this line of thought, they contended that scoring models should employ only characteristics that have a proven causal relationship with failure to repay loans. To them, predicting the behavior of people is a.n.a.logous to explaining the origin of diseases. a change in White's creditworthiness, so it was nonsensical for modelers to relate the two. Extending this line of thought, they contended that scoring models should employ only characteristics that have a proven causal relationship with failure to repay loans. To them, predicting the behavior of people is a.n.a.logous to explaining the origin of diseases.

In response, credit modelers maintain that they have never sought to find cause; their models find personal traits that are strongly correlated correlated with the behavior of defaulting on loans. Correlation describes the tendency of two things moving together, in the same or opposite directions. In James White's case, the model observed that, historically, borrowers who experienced a spurt of credit inquiries were much more likely to miss payments compared with those who did not. Most probably, neither thing directly affected the other. with the behavior of defaulting on loans. Correlation describes the tendency of two things moving together, in the same or opposite directions. In James White's case, the model observed that, historically, borrowers who experienced a spurt of credit inquiries were much more likely to miss payments compared with those who did not. Most probably, neither thing directly affected the other.

Indeed, acting on correlation is vital to our everyday experience. Imagine that John, who is trudging through snow five steps ahead of you, slips while turning the corner. Then David, Mary, and Peter slip, too. Slickly veering left, you stay upright. You could have tried to find the black ice, linking effect to cause, but you do not. You presume that walking straight would mean slipping. You act on this correlation. It saves you from the fall. Similarly, when it feels murky outside, you bring an umbrella. You don't study meteorology. Before you move your family to a "good school district," you ask to see standardized test scores. You don't examine whether good schools hire better teachers or just admit smarter students.

How does one live without correlational models? If you think about it, the computer acts like credit officers from yesteryear. If they had noticed the correlation between credit inquiries and failure to pay, they would have used it to disqualify loan applicants, too.

Not only is causation unnecessary for this type of decision, but it is also unattainable. No physical or biological laws govern human behavior exactly. Humans by nature are moody, petulant, haphazard, and adaptive. Statisticians build models to get close to the truth but admit that no system can be perfect, not even causal models. Sometimes, they see things that are not there, and, to borrow language from mutual-fund prospectuses, past performance may not be repeated. To the relief of their creators, credit-scoring models have withstood decades of real-world testing. The correlations that define these models persist, and our confidence in them grows by the day.

Another common grievance concerns the credit reports, which are known to be inaccurate and often incomplete. Common errors include typing mistakes, mistaken ident.i.ties, ident.i.ty fraud, duplicated entries, outdated records, and missing information. Critics say because garbage in must equal garbage out, when "dirty" data are fed to computers, the necessary outcome must be unreliable credit scores.

n.o.body disputes the messy state of data hygiene; it is fantasy, however, to a.s.sume that any credit-reporting system can be free of error. Between them, the three dominant U.S. credit bureaus process 13 billion data per month; on that base, an error rate as low as 0.01 percent still means one mistake appears every two minutes! Welcome to the real world of ma.s.sive data. Modelers have developed some powerful strategies to cope. We already observed that each computer rule contains multiple characteristics, and these are often partially redundant. For example, many scoring systems evaluate "years at current residence" together with "years at current job" and "length of credit history." For most people, these three traits are correlated. If one of the triad is omitted or inaccurate, the presence of the other two attenuates the impact. (Sure, a different rule applies, but the score shifts only slightly.) By contrast, credit officers cannot correct course, because their handcrafted rules take one characteristic at a time.

Does inaccurate or incomplete information always harm consumers? Not necessarily: when mistakes happen, some people will receive artificially lower scores, while others will get undeservedly higher scores. For example, the credit bureau could have confused James White's neighbor, a certain Joe Brown, with a namesake New York corporate lawyer, and so attached the latter's stellar debt repayment record to the former's credit report, b.u.mping up his credit score and qualifying him for lower interest rates. For good reason, we never hear about these cases.

Yet another line of attack by the critics of credit-scoring technology a.s.serts the right of consumers to verify and repair their credit scores. This legislative push engendered the Fair and Accurate Credit Transactions Act of 2003. Innocuous as it sounds, this misguided initiative threatens to destroy the miracle of instant credit. In this era of openness, people disgruntled with bad scores have started knocking on the doors of credit repair agencies, hoping for a quick fix. Dozens of shady online brokers have cropped up to help customers piggyback on the good credit records of strangers by becoming "authorized users" on their credit cards. The customers inherit these desirable credit histories, which elevate their credit scores. This is ident.i.ty theft turned upside down: cardholders with high FICO scores willingly rent out their ident.i.ties for $125 a pop. The online brokers, who charge $800 per repaired account, act as the eBay-style marketplace for buying and selling creditworthiness! This dubious tactic distorts credit scores, blurring the separation between good and bad risks. Faced with more losses, lenders would eventually have to turn away more applicants or raise rates. In 2007, FICO moved to close the loophole, eliminating the "authorized user" characteristic from its scoring formulas. However, this solution harms legitimate types of authorized users such as young adults leveraging their parents' history or one spouse rehabilitating the other's credit.

Piggybacking is but one example of credit repair scams, which will multiply as more knowledge about credit-scoring algorithms becomes available. In the extreme, unscrupulous credit repair ser-vices promise to remove negative but accurate items, and some bombard credit bureaus with frivolous disputes in the hope that creditors will fail to respond within thirty days, after which time such disputed items must be temporarily removed as per the law. The trouble with disclosure, with opening up the "black boxes," is that people with bad scores are more likely to be actively looking for errors, and that only negative items on reports will be challenged or corrected. Over time, the good risks will get lower scores than they deserve because they have not bothered to verify their credit reports, while the bad risks will get higher scores than they deserve because only beneficial errors remain on theirs. Consequently, the difference between bad risks and good risks may vanish along with other positives a.s.sociated with credit scoring. Such an outcome would harm the majority of law-abiding, creditworthy Americans. There is a real danger that overly aggressive consumer protection efforts will backfire and kill the goose that lays the golden egg.

Much alarming rhetoric has also been spewed over discriminatory practices allegedly hidden in credit-scoring technology. Both sides acknowledge that the underprivileged have a lower average average credit score than the general population. But income disparity is the economic reality which no amount of research will erase. Credit-scoring models, which typically do not use race, gender, or income characteristics, merely reflect the correlation that poorer people are less likely to have the means to pay back loans. Simple rules of old would have turned down the entire cla.s.s; that was why credit cards were originally toys for the rich. Credit-scoring models, by virtue of complexity, actually approve some portion of underprivileged applicants. Recall that in the past, certain lenders rejected all painters and plumbers, but computers today accept some of them on account of other positive traits. The statistics bear out this point: from 1989 to 2004, households earning $30,000 or less were able to boost borrowing by 247 percent. Many studies have shown that access to credit has expanded in all socioeconomic strata since credit scoring began. Going backward would only reverse this favorable trend. credit score than the general population. But income disparity is the economic reality which no amount of research will erase. Credit-scoring models, which typically do not use race, gender, or income characteristics, merely reflect the correlation that poorer people are less likely to have the means to pay back loans. Simple rules of old would have turned down the entire cla.s.s; that was why credit cards were originally toys for the rich. Credit-scoring models, by virtue of complexity, actually approve some portion of underprivileged applicants. Recall that in the past, certain lenders rejected all painters and plumbers, but computers today accept some of them on account of other positive traits. The statistics bear out this point: from 1989 to 2004, households earning $30,000 or less were able to boost borrowing by 247 percent. Many studies have shown that access to credit has expanded in all socioeconomic strata since credit scoring began. Going backward would only reverse this favorable trend.

In their dogged campaign against credit scoring, consumer groups have achieved mixed results to date. For example, Representative Wolens's bid to ban credit scoring by insurers in Texas was defeated. Similar legislative drives failed in Missouri, Nevada, New York, Oregon, and West Virginia. The strategy to go after the statistical foundation directly has proven unwise, given the solid foundation of the science. Credit-scoring technology is inarguably superior to the old way of underwriting by handcrafted rules of thumb. Having been deployed at scale, it gains affirmation with each pa.s.sing day.

In this chapter, we have seen two innovations of statistics that have made tremendous, positive impact on our lives: epidemiology and credit scoring. The breed of statisticians known as the modelers has taken center stage. A model model is an attempt to describe the unknowable by using that which is known: In disease detection, the model describes the path of infection (for all cases, including the unreported), based on interview responses, historical patterns, and biological evidence. In credit scoring, the model describes the chance of loan default based on personal traits and historical performance. is an attempt to describe the unknowable by using that which is known: In disease detection, the model describes the path of infection (for all cases, including the unreported), based on interview responses, historical patterns, and biological evidence. In credit scoring, the model describes the chance of loan default based on personal traits and historical performance.

These two examples represent two modes of statistical modeling; both can work magic, if done carefully. Epidemiology is an application in which finding cause cause is the only meaningful objective. We can all agree that some biological or chemical mechanism probably exists to cause disease. Taking brash action based on correlations alone might result in entire industries being wiped out while the disease continues to spread. Credit scoring, by contrast, relies on is the only meaningful objective. We can all agree that some biological or chemical mechanism probably exists to cause disease. Taking brash action based on correlations alone might result in entire industries being wiped out while the disease continues to spread. Credit scoring, by contrast, relies on correlations correlations and nothing more. It is implausible that something as variable as human behavior can be attributed to simple causes; modelers specializing in stock market investment and consumer behavior have also learned similar lessons. Statisticians in these fields have instead relied on acc.u.mulated learning from the past. and nothing more. It is implausible that something as variable as human behavior can be attributed to simple causes; modelers specializing in stock market investment and consumer behavior have also learned similar lessons. Statisticians in these fields have instead relied on acc.u.mulated learning from the past.

The standard statistics book grinds to a halt when it comes to the topic of correlation versus causation. As readers, we may feel as if the authors have taken us along for the ride! After having plodded through the mathematics of regression modeling, we reach a section that screams, "Correlation is not causation!" and, "Beware of spurious correlations!" over and over. The bottom line, the writers tell us, is that almost nothing we have studied can prove causation; their motley techniques measure only correlation. The greatest statistician of his generation, Sir Ronald Fisher, famously scoffed at Hill's technique to link cigarette smoking and lung cancer; he offered that the discovery of a gene that predisposes people to both smoking and cancer would discredit such a link. (This gene has never been found.) In this book, I leave philosophy to the academics (they have been debating the issue for decades). I do not deny that this is a fundamental question. But this is not the way statistics is practiced. Causation is not the only worthwhile goal, and models based on correlations can be very successful. The performance of credit-scoring models has been so uniformly spectacular that one industry after another has fallen in love with them.

George Box, one of our most preeminent industrial statisticians, has observed, "All models are wrong but some are useful." Put bluntly, this means even the best statistical model cannot perfectly represent the real world. Unlike theoretical physicists, who seek universal truths, applied statisticians want to be judged by their impact on society. Box's statement has become a motto to aspiring modelers. They do not care to beat some imaginary perfect system; all they want is to create something better than the status quo. They understand the virtue of being (less) wrong. FICO's scoring technology has no doubt improved upon handcrafted rules of thumb. The a.s.sortment of modern techniques like casecontrol studies and DNA fingerprint matching advanced the field of epidemiology.

Despite being cut from the same cloth, modelers in these two fields face divergent reception from consumer advocacy groups. Generally speaking, these groups support the work of disease detectives but deeply distrust credit-scoring models. However, epidemiologists face the more daunting task of establishing causation with less data and time, and in that sense, their models are more p.r.o.ne to error. It is clear that a better grasp of the cost and benefit of product recalls will further consumer interest more than yet another study on causality in credit scoring. In the meantime, take heart that modelers are looking out for our health and wealth.

3.

Item Bank / Risk Pool The Dilemma of Being Together I can define it, but I can't recognize it when I see it.

-LLOYD B BOND, EDUCATION SCHOLAR Millionaires living in mansions on the water are being subsidized by grandmothers on fixed incomes in trailer parks.

-BOB H HARTWIG, INSURANCE INDUSTRY ECONOMIST The late CEO of Golden Rule Insurance, J. Patrick Rooney, made his name in the 1990s as the father of the health savings account. For this political cause, he spent $2.2 million of his own fortune and partnered with conservative icon Newt Gingrich. For years, he was a generous donor to the Republican Party and a prominent voice within. In 1996, he ran for governor of Indiana. In business, he was equally astute, turning Indianapolis-based Golden Rule into one of the largest marketers of individual health insurance. When Rooney sold his company in 2003 to the industry giant UnitedHealth Group for half a billion dollars, he delivered a windfall of about $100,000 to every Golden Rule employee.

Even more than his commercial success, it was Rooney's extracurricular activities that kept him in the news. He was a maverick before mavericks became fashionable in the Republican set. Given his politics, Rooney was an unlikely defender of civil rights. In the mid-1970s, he noticed that all his insurance agents in Chicago were white-no wonder the firm had struggled to make inroads into the city's key black neighborhoods. He argued persuasively that the licensing examination unfairly disqualified blacks. He sued the developer of the disputed test, Educational Testing Service (ETS), which is better recognized as the company that administers the SAT to millions of high school students in the United States. At the time, test developers adhered to a "don't ask, don't tell" policy when it came to fair testing. The subsequent "Golden Rule settlement" between ETS and Rooney pioneered scientific techniques for screening out unfair test questions, defined as those for which white examinees outperformed black examinees by a sizable margin. But statisticians were none too happy about this apparently sensible rule, and the president of ETS publicly regretted his settlement. Let's examine why.

Even those who did not share Rooney's politics found him endearing. He possessed a flair for effortlessly linking his social causes, his profiteering self-interest, and his Christian duty. He made his name as the father of health savings accounts (HSAs), which provide tax benefits to encourage partic.i.p.ants to set aside money for medical expenses. Right before Congress authorized HSAs in 2003, a watershed event that owed much to his political savvy and expensive lobbying, Rooney sold his family insurance business to industry giant UnitedHealth Group for $500 million and then created a new business called Medical Savings Insurance (MSI), becoming one of the first in the nation to sell HSAs. Rooney proclaimed, "I am doing the right thing, and I think the Lord will be pleased about it," satisfied that "when I die, I would like G.o.d to welcome me." When asked why MSI made it a habit to underpay hospitals that served his customers, he contended, "We're trying to help the people that are not able to help themselves." In his mind, he was leading the good fight against "shameful" practices of "sinful" executives. The same motives, as much selfless as self-serving, drove him to file the lawsuit against the Illinois Department of Insurance and the Educational Testing Service.

In October 1975, Illinois had launched a new licensing examination for insurance agents, developed by ETS. In short order, it emerged that the pa.s.sing rate of the new test was merely 31 percent, less than half that of the previous version. In Chicago, one of Rooney's regional managers worried about the supply of black insurance agents needed to reach the 1.2 million blacks in the Windy City. Rooney knew Chicago was a key market for Golden Rule Insurance, so when he got wind of this information, he once again seized the role of a social-justice champion: he charged that "the new test was for all practical purposes excluding blacks entirely from the occupation of insurance agent."

The Illinois Department of Insurance tried to ward off the lawsuit by twice revamping the licensing examination, bringing the pa.s.sing rate up above 70 percent. But Rooney was a tenacious opponent, and he had an alluring argument. The overall pa.s.sing rate obscured an unsightly gap between black and white examinees. In the revamped exam, the pa.s.sing rate of blacks rose in tandem with that of whites, leaving the gap intact. Finally, in 1984, the two sides acceded to the so-called Golden Rule settlement, which required ETS to conduct scientific a.n.a.lysis on tests to ensure fairness.

However one felt about the commercial motivation behind Rooney's lawsuit, it was clear that his confrontational advocacy stimulated some serious rethinking about fair testing. The implications extended well beyond the insurance industry. ETS spearheaded the bulk of this new research, which was to make its strongest impact on college and graduate school admissions tests, as these, after all, supplied the nonprofit test developer and administrator its biggest source of revenue.

As admissions to U.S. colleges grow ever more compet.i.tive, parents become ever more invested in how their kids perform on admissions tests like the SAT. In Tia O'Brien's neighborhood in Marin County, the lush region just north of San Francis...o...b..y, boomer parents with college-age children exercise plain old good manners: "Less-than-near-perfect test scores should not be discussed in polite company." Bound for college in 2008, O'Brien's daughter was a member of the biggest cla.s.s ever to finish high school in California, which has the largest population as well as the most extensive and prestigious public-university system in America. Everywhere she looked, O'Brien saw insanity: anxious parents hired "SWAT teams of test-prep experts," who promised to boost test scores; they paid "stagers" to make over their kids' images; their counselors established "grade-point targets" for various colleges; and "summer-experience advisors" planned activities "every week of the summer," such as community service projects, trips abroad to study languages, and Advanced Placement cla.s.ses.

These type A parents inadvertently have been nudging the United States out of isolation from the rest of the world. In some European countries and especially in Asia, the shortage of university spots coupled with a long-standing reliance on standardized tests have produced generations of obsessed parents who strive as hard at micromanaging their kids' lives as they do at fulfilling unrealistic expectations. Previously, the United States was an island of serenity in the wild, wide world of university entrance derbies. These days, that is no longer the case.

In Marin, discussing test scores in public is taboo, though we know most students there take the SAT (as do other students in California, where only one in ten college applicants submit scores from the ACT, the other recognized college admissions examination). First administered by ETS in 1926, the SAT was taken by 1.5 million college-bound seniors in 2007; many of them took the test more than once.

The SAT measures a nebulous quant.i.ty known as "academic potential"-let's just call it ability-and its advocates point to the strong, proven correlation between SAT scores and eventual college GPA as the technical justification for the exam. Since 2005, each SAT test contains ten sections, three each of reading, mathematics, and writing, plus one "experimental" section. The first section of every test involves essay writing, while the other nine appear in random order. Students are permitted about four hours in which to complete the exam. The sixty-seven items in the reading sections (formerly called verbal) are split between sentence completion (nineteen items) and reading pa.s.sages (forty-eight). All reading items use the multiple-choice format, requiring students to select the correct answer from five choices. Antonyms and a.n.a.logies were discontinued in 1994 and 2005, respectively. The three mathematics sections (formerly called quant.i.tative) together contain forty-four multiple-choice items plus ten "grid-in" items requiring direct response, which replaced quant.i.tative comparison items in 2005. Besides the essay, the remaining writing sections consist of multiple-choice items focused on grammar.

Even though the format of each SAT section is fixed, any two students can see different sets of questions, even if they are seated next to each other in the same test center at the same time. This unusual feature acts as a safeguard against cheating but also strikes some of us as an unfair arrangement. What if one version of the test contains more difficult items? Would not one of the two students be disadvantaged? Here is where the "experimental" section comes to the rescue. This special section can measure reading, mathematics, or writing ability and is indistinguishable from the other corresponding sections of the test, but none of its questions count toward the student's total score. This experimental section should properly be considered the playground for psychometricians- statisticians specializing in education. In creating tests, they lift specific items from the scored sections of one version and place them into the experimental section of another. These shared items form a common basis upon which to judge the relative difficulty of the two versions and then to adjust scores as needed.

Statisticians have many other tricks up their sleeves, including how to make tests fair, a topic we shall examine in detail.

Asking whether a test item is fair is asking whether it presents the same level of difficulty to comparable groups of examinees. An item is deemed more difficult if a lesser proportion of test takers answers it correctly. Conversely, an easier item has a higher rate of correct answers. Statisticians make tests fair by identifying and removing questions that favor one group over others-say, whites over minorities or males over females. The cure is as simple as it gets. So why did it take almost ten years for Golden Rule and ETS to agree on an operating procedure to address Rooney's complaint about unfair testing? How can something that sounds so simple be so difficult to put into practice?

To explore this issue, let's inspect a set of sample test items, all of which were long-ago candidates for inclusion in SAT verbal sections. The first four are a.n.a.logy items, and the other two are sentence completion items. See if you can figure out which items proved more difficult for the college-bound students.

1. PLAIT:HAIRA. knead:breadB. weave:yarnC. cut:clothD. fold:paperE. frame:picture2. TROUPE:DANCERSA. flock:birdsB. ferry:pa.s.sengersC. barn:horsesD. dealership:ca.r.s.e. highway:trucks3. MONEY:WALLETA. rifle:triggerB. dart:spearC. arrow:quiverD. golf:courseE. football:goalpost4. DYE:FABRICA thinner:stainB. oil:skinC. paint:woodD. fuel:engineE. ink:pen5. In the past the general had been________________for his emphasis on defensive strategies, but he was________________when doctrines emphasizing aggression were discredited.A. criticized . . . dischargedB. parodied . . . ostracizedC. supported . . . disappointedD. spurned . . . vindicatedE. praised . . . disregarded6. In order to________________the health hazard caused by an increased pigeon population, officials have added to the area's number of peregrine falcons, natural________________of pigeons.A. reduce . . . alliesB. promote . . . rivalsC. starve . . . preyD. counter . . . protectorsE. lessen . . . predators Since there were five choices for each question, if all examinees guessed randomly, we would still expect 20 percent to be lucky. Actual test results ranked the items from the hardest to the easiest as follows: Item 5 17 percent correct (hardest) Item 1 47 percent correct Item 3 59 percent correct Items 2, 6 73 percent correct Item 4 80 percent correct (easiest)

Notice that the sentence completion item about war strategy (item 5) tripped up so many that the overall correct rate of 17 percent barely departed from the guessing rate. At the other extreme, 80 percent of test takers answered the DYE:FABRIC a.n.a.logy (item 4) correctly. Comparatively, the MONEY:WALLET a.n.a.logy (item 3) turned out to be markedly more difficult.

How well did this ranking match your intuition? If there were a few surprises, you would not be alone. Even seasoned a.n.a.lysts have learned that predicting the difficulty level of a test item is easier said than done. No longer teenagers, they are unable to think like teenagers. Moreover, what we just estimated was the overall difficulty for the average test taker; what about which questions put minorities at a disadvantage? In the six sample items, three items were unfair. Can you spot them? (The offending items will be disclosed later.) It should be obvious by now how hopeless it is to ask human minds to ferret out unfair test questions. At least most of us get to be teenagers once, but alas, a white man would never experience the world of a black female. This was what the eminent scholar of education Lloyd Bond meant when he parodied Justice Potter Stewart's celebrated anti-definition of the obscene: "I could not define it, but I know it when I see it." Bond experiences unfairness as something he can define (mathematically), but when he sees it, he cannot recognize it. When statisticians try their hand at spot-the-odd-item exercises as you just did, they too concede defeat. Their solution is to pretest new items in an experimental section before including them in real exams; they forgo subjective judgment, allowing actual test scores to reveal unfair items. So even though performance in the experimental SAT section does not affect anyone's test score directly, it has a profound effect on what items will appear in future versions of the test. Because test takers do not know which section is experimental, test developers can a.s.sume they exert the same effort on the experimental section as on other sections.

Constructing SAT test forms, a.s.sembling the test items, is a ma.s.sive undertaking well concealed from the casual test taker. Dozens of statisticians at ETS fuss over minute details of a test form, because decades of experience have taught them that the design of the test itself may undesirably affect scores. They know that changing the order of questions can alter scores, all else being equal, and so can replacing a single word in an item, shuffling the answer choices, or using specialist language. Therefore, much care goes into selecting and arranging test items. In any year, hundreds of new questions enter the item bank. It takes at least eighteen months for a new item to find its way onto a real test. Each item must pa.s.s six to eight reviews, with about 30 percent failing to survive.

There are people whose full-time job is to write new SAT questions. The typical item writer is middle-aged, a former teacher or school administrator, and from the middle cla.s.s. If anything, the writers are extremely dedicated to their work. Chancey Jones, the retired executive director of test development at ETS, fondly recalled one of his earliest experiences with the company: "[My mentor] told me to keep a pad right by the shower. Sure enough, I came up with items when taking a shower, and I wrote them down right away. I still keep Post-Its everywhere. You never know."

One of Jones's key responsibilities was to ensure that all SAT questions were fair; in particular, no one should be disadvantaged merely by the way a test item was written or presented. Before the 1980s, the statisticians steadfastly abided by a "don't ask, don't tell" brand of ethic. Test developers believed that, because they wore race blinders, their handiwork was unaffected by the racial dimension and, ipso facto, was fair to all racial groups. Transmitting a total trust in their own creation, they viewed the SAT score as a pure measure of ability, so a differential between two scores was interpreted as a difference in ability between two examinees and nothing more. It did not occur to them to ask whether a score differential could result from unfair test items. The statisticians presumed they did no harm; they certainly meant no harm.

Our set of sample questions had already pa.s.sed several preliminary filters for validity before it got screened for fairness. The items were neither too easy nor too difficult; if everyone-or no one-knew the right answer, that question had nothing to say about the difference in ability between students. Also pruned were motley items offensive to ETS, such as elitist words (regatta, polo), legal terms (subpoena, tort), religion-specific words, regionalisms (hoagie, submarine), and words about farms, machinery, and vehicles (thresher, torque, strut), plus any mention of abortion, contraception, hunting, witchcraft, and the like, all deemed "controversial, inflammatory, offensive or upsetting" to students.

Rooney did to standardized testing what he would later do to hospital bills. He rounded up an elephant of a problem and dropped it in the center of the room. The gap between blacks and whites in test scores had been evident for as long as score statistics have been published. It was not something new that emerged after Rooney filed his lawsuit. Harvard professor Daniel Koretz, in his book Measuring Up Measuring Up, acknowledges, "The difference has been large in every credible study of representative groups of school-age students." Koretz further estimated that, in the best case, the score of the average average black student fell below 75 percent of whites. According to ETS, the average SAT scores for blacks and whites, respectively, in 2006 were 434 and 527 in reading, and 429 and 536 in mathematics. black student fell below 75 percent of whites. According to ETS, the average SAT scores for blacks and whites, respectively, in 2006 were 434 and 527 in reading, and 429 and 536 in mathematics.

How the racial gap in scores should be interpreted is an extremely challenging and contentious matter for all concerned. Conceptually, group differences in test scores may arise from unequal ability, unfair test construction, or some combination of both. Prior to the Golden Rule settlement, the psychometrics profession was convinced that its "don't ask, don't tell" policy produced fair tests, so score differentials could largely equate to unequal ability. Educators generally agreed that African-Americans had inferior access to high-caliber educational resources, such as well-funded schools, top-ranked teachers, small cla.s.ses, effective curricula, and state-ofthe-art facilities, a condition that guaranteed a handicap in ability, which in turn begat the racial gap in test scores. Rooney and his supporters rejected this line of logic, arguing that score differentials were artifacts of unfair tests, which systematically underestimated the true ability of black examinees. In all likelihood, both extreme views had it wrong, and both factors contributed in part to the racial gap. This debate would not be settled until someone figured out how to untangle the two competing factors.

At the time Patrick Rooney sued ETS in 1976, the vetting of test questions for fairness was practiced as a craft, informally and without standards or doc.u.mentation. The Golden Rule settlement became the first attempt to formalize the fairness review process. Further, it mandated explicit consideration of race in the development of tests. In a break from the past, ETS agreed to collect demographic data on test takers and to issue regular reports on the comparative scores of different groups. There followed a period of spectacular advance in scientific techniques to measure fairness. By 1989, test developers at ETS had broadly adopted the technical approach known as differential item functioning (DIF) a.n.a.lysis to augment the traditional judgmental process.

Brokered in 1984 between Rooney and Greg Anrig, the former president of ETS, the Golden Rule settlement imposed two chief conditions for validity on every test item: the overall correct rate must exceed 40 percent, and the correct rate for blacks must fall within 15 percent of that for whites. Thus, if 60 percent of whites answered a question correctly, then at least 45 percent of blacks must also have scored in order to qualify this particular item. These new rules were originally developed for the Illinois insurance licensing exam, but several states began to explore applications in educational and other testing.

Within three years, however, Anrig was to concede publicly that the Golden Rule settlement had been a "mistake." Why did this about-face occur?

Researchers reported that the new scientific review would have cast doubt on 70 percent of the verbal items on past SAT tests they reexamined. While inspecting many of the offending items that appeared to favor whites, test developers were baffled to identify what about about them could have disadvantaged blacks. In practice, the Golden Rule procedure produced many false alarms: statisticians feared that hosts of perfectly fair questions would be arbitrarily rejected. The settlement significantly expanded the ability to identify potentially unfair test items, but it offered no help in explaining why they were unfair. them could have disadvantaged blacks. In practice, the Golden Rule procedure produced many false alarms: statisticians feared that hosts of perfectly fair questions would be arbitrarily rejected. The settlement significantly expanded the ability to identify potentially unfair test items, but it offered no help in explaining why they were unfair.

Consider sample item 3. Suppose substantially fewer boys than girls got the right answer (denoted by the asterisk). What might account for this difference in correct rates?

3. MONEY:WALLETA. rifle:triggerB. dart:spearC. arrow:quiver (*)D. golf:courseE. football:goalpost Perhaps boys, being generally more active, gravitated to choices that mentioned popular sports like golf and football. Perhaps more girls fell for Robin Hoodstyle folklore, where they encountered a word like quiver quiver. Ten people would likely come up with ten narratives.

Lloyd Bond, the education scholar, frowned upon this type of judgmental review. He once relayed an enlightening story in which he and his graduate student developed elaborate reasons why certain test items favored one group of examinees, only to later discover that they had accidentally flipped the direction of the preference, and were further embarra.s.sed when they had to reverse the previous points to support the now-reversed position. What if item 3 actually favored boys rather than girls? What might account for this difference in correct rates? Perhaps boys, being more active, grasped the relationship between golf golf and and course course and between and between football football and and goalpost goalpost, so they were less less affected by those distracters. Perhaps fewer girls were familiar with military words like affected by those distracters. Perhaps fewer girls were familiar with military words like quiver quiver and and rifle rifle. The trouble is that our fertile imagination frequently leads us astray. (By the way, girls underperformed boys by 20 percent on item 3 in real tests.) If reasonable people could not ascertain the source of unfairness even after a test item showed a difference between groups, there was no reasonable basis on which to accuse test developers of malpractice. The problem of false alarms demonstrated that some group differences were not caused by test developers but by differential ability, elevating the need to untangle the two factors. Henceforth, the mere existence of a racial gap should not automatically implicate item writers in the creation of unfair tests. While the initial foray into the scientific method turned out bad science, it nevertheless produced some good data, paving the way to rampant technical progress. By 1987, Anrig could turn his back on the Golden Rule procedure because the team at ETS had achieved the breakthrough needed to unravel the two factors.

Simply put, the key insight was to compare like with like. The statisticians learned not to carelessly lump together examinees with varying levels of ability. Before computing correct rates, they now match students with similar ability. High-ability whites should be compared with high-ability blacks, and low-ability whites with low-ability blacks. A test item is said to favor whites only if black examinees tend to have greater difficulty with it than whites of comparable ability. The blame can be safely a.s.signed to test developers when two groups with like ability perform differently on the same test item, since the matching process has made moot any gap in ability. In this way, a light touch of statistical a.n.a.lysis unbundled the two intertwined factors.

In hindsight, statisticians added just three extra words ("of comparable ability") to the Golden Rule settlement and made a world of difference. Anrig realized belatedly that the flawed Golden Rule procedure incorporated the hidden and untenable a.s.sumption that white and black examinees were identical, and thus comparable, except for the color of their skin. In reality, the two groups should not be compared directly, because blacks were overrepresented among lower-ability students, and whites overrepresented among higher-ability students. As a result, the correct rate of white examinees leaned toward that of high-ability students, while the correct rate of black examinees leaned toward that of low-ability students. The differential mix of ability levels effected a group difference in correct rates-it would not have mattered if high-ability blacks performed just as well as high-ability whites, and likewise, low-ability blacks and low-ability whites.

It was this important breakthrough, known as DIF a.n.a.lysis, that finally made the scientific review of test fairness practical. Today, statisticians use it to flag a reasonable number of suspicious items for further review. A question that "shows DIF" is one in which a certain group of examinees-say, boys-performs worse than another group of like ability. Of course, explicating the source of unfairness remains as slippery as ever. In addressing this task, two test developers at ETS, Edward Curley and Alicia Schmitt, took advantage of the "experimental" SAT sections to test variations of verbal questions previously shown to be unfair. How good were their theories for why certain groups performed worse than others? Could they neutralize a bad item by removing the source of unfairness?

Our list of sample test items provided a few clues. They were, in fact, extracted from Curley and Schmitt's research. Results from real SAT tests indicated that items 1, 3, and 5 showed DIF, while the three even-numbered items did not. (Did these match your intuition?) First, consider the DYE:FABRIC a.n.a.logy (item 4). Eighty percent of all students answered this question correctly, every racial group performed just as well as whites, and girls did just as well as boys.

4. DYE:FABRICA. thinner:stainB. oil:skinC. paint:wood (*)D. fuel:engineE. ink:pen A variant of this item (4-b), with the words paint paint and and stain stain inter-changed, also presented minimal difficulty, with an overall correct rate of 86 percent. inter-changed, also presented minimal difficulty, with an overall correct rate of 86 percent.

4-b. DYE:FABRICA. thinner:paintB. oil:skinC. stain:wood (*)D. fuel:engineE. ink:pen However, this variant was found to favor whites over every other racial group with comparable ability by 11 to 15 percent, an alarming differential. This result confirmed the hypothesis that the secondary meaning of the word stain stain could flummox nonwhite test takers. If the goal of the question was to a.s.sess recognition of word relationships rather than knowledge of vocabulary, then item 4 would be vastly preferred to item 4-b. could flummox nonwhite test takers. If the goal of the question was to a.s.sess recognition of word relationships rather than knowledge of vocabulary, then item 4 would be vastly preferred to item 4-b.

Next, look at item 5, which all examinees found to be very hard (17 percent correct): 5. In the past the general had been__________________for his emphasis on defensive strategies, but he was__________________when doctrines emphasizing aggression were discredited.A. criticized . . . dischargedB. parodied . . . ostracizedC. supported . . . disappointedD. spurned . . . vindicated (*)E. praised . . . disregarded Remarkably, all racial groups performed as well as whites of comparable ability, but girls appeared disadvantaged against boys of like ability by 11 percent. Researchers believed that unfairness arose from the a.s.sociation with conflict and tried switching the context to economics (item 5-b): 5-b. Heretofore___________________for her emphasis on conservation, the economist was_________________when doctrines emphasizing consumption were discredited.A. criticized . . . dischargedB. parodied . . . ostracizedC. supported . . . disappointedD. spurned . . . vindicated (*)E. praised . . . disregarded With this change, the group difference shrank to 5 percent. Recall that this 5 percent was calculated after matching ability: in practice, this meant Curley and Schmitt rebalanced the mix of ability levels of boys to mimic those of girls.

Item 1 showed an unusual racial DIF: the proportion of blacks answering correctly exceeded that of like-ability whites by an eye-popping 21 percent.

1. PLAIT:HAIRA. knead:breadB. weave:yarn (*)C. cut:clothD. fold:paperE. frame:picture In hindsight, it appeared that the word plait plait might be more common in the African-American community. Researchers tried using might be more common in the African-American community. Researchers tried using braid braid instead (item 1-b): instead (item 1-b): 1-b. BRAID:HAIRA. knead:breadB. weave:yarn (*)C. cut:clothD. fold:paperE. frame:picture The group difference disappeared, and not surprising, the overall difficulty plunged from 47 percent to 80 percent correct. The item writers took pains to point out that the question pa.s.sed judgmental reviews before DIF a.n.a.lysis, once again showing how hard it was to identify questions that might favor certain groups without real test data. That the advantage, in this case, accrued to the minority group raised yet another practical challenge: should this type of item be removed from the test? Because test developers at ETS regard DIF in either direction as "invalid," their standard procedure calls for its removal.

Through iterations of testing and learning, Curley and Schmitt validated some but not all of the theories used to explain why items showed DIF; they also demonstrated that, indeed, items could be edited in a manner that disadvantaged one group against another.

While the United States is home to world-cla.s.s test developers, American parents have only recently begun to express world-cla.s.s angst over test scores. What might the future bring? We can do no worse than observing the Asians. In Hong Kong, some test prep tutors have achieved pop star status, their headshots plastered on gigantic billboards in the city center. Lecturers quit their university jobs, opting to "teach to the test" for better pay and, dare one say, greater respect. In Tokyo, ky ky[image]iku mamas (education-obsessed mothers) prepare care packages for their kids to carry to the university entrance examinations. Every January, Nestle makes a killing because Kit Kat, its chocolate wafer bar, sounds like mamas (education-obsessed mothers) prepare care packages for their kids to carry to the university entrance examinations. Every January, Nestle makes a killing because Kit Kat, its chocolate wafer bar, sounds like kitto katsu kitto katsu, which means "sure win" in j.a.panese. It might not be long until Dr. Octopus, the comic character, emerges as a brand of study aid, since it sounds like okuto pasu okuto pasu, or "if you put this on your desk, you will pa.s.s."

To figure out whether an SAT question favors whites over blacks, it would seem natural to compare the correct rate of whites with that of blacks. If the gap in performance is too wide, one would flag the item as unfair. In the Golden Rule settlement, instigated by Rooney, the acceptable gap was capped at 15 percent.

Remarkably, the statisticians at ETS disowned such an approach to the problem. In fact, they saw the Golden Rule procedure as a false start, merely a discredited precursor to a new science for screening out unfair test items. When it comes to group differences, statisticians always start with the question of whether or not to aggregate groups. The ETS staff noticed that the Golden Rule mandate required lumping together examinees into racial groups regardless of ability, which stamped out the diversity of ability levels among test takers, a key factor that could give rise to score differentials. They achieved a breakthrough by treating high-ability students and low-ability students as distinct groups. The procedure of matching ability between black and white examinees served to create groups with like ability, so any difference in correct rates signaled unfairness in the design of the test question. This so-called DIF a.n.a.lysis addressed the unwelcome reality that black students were already disadvantaged by inferior educational opportunities, which meant they were already suffering lower test scores, even without injustice traced to test construction.

In the course of creating the new techniques, test developers realized how hopeless it was to use their intuition to pick out unfair questions, and they smartly let actual test results guide their decisions. By placing new items in experimental test sections, they observe directly how students perform in real-life situations, eliminating any guessing. Pinpointing the source of unfair treatment is sometimes still elusive, but items that show positive or negative DIF are typically dropped whether or not an explanation can be proffered.

The issue of group differences is fundamental to statistical thinking. The heart of this matter concerns which groups should be aggregated and which shouldn't. In a.n.a.lyzing standardized test scores, the statisticians strayed from what might be regarded as the natural course of comparing racial groups in aggregate. They stressed that all black examinees should not be considered as one-and neither should white examinees-because of the wide range in ability levels. (However, if the mix of ability was similar across races, the statisticians would have chosen to lump them together.) In general, the dilemma of being together or not awaits those investigating group differences. We will next encounter it in Florida, where the hurricane insurance industry finally woke up to this problem after repeatedly losing tens of billions in single years.

Like J. Patrick Rooney, Bill Poe Sr. was also an accomplished entrepreneur who built and ran an insurance business that was eventually worth many millions of dollars. Like Rooney, he also spent his personal wealth pursuing public causes.

Poe enlivened the airwaves in 1996 when he poured $1 million of his own money into opposing public funding of a new stadium for the Tampa Bay Buccaneers football team. He filed a lawsuit against all involved parties that went up to the Florida Supreme Court, where he lost the decision. A native of Tampa, Poe was heavily involved in local politics, including a stint as mayor. In his professional life, he made a career of selling and underwriting property insurance. His first business venture was a runaway success, rapidly becoming the largest insurance brokerage in Florida. When he retired, he sold his ownership stake for around $40 million. By 2005, he made the news yet again, but in a less ill.u.s.trious manner. His newest insurance venture, Poe Financial, buckled after a string of eight hurricanes battered the Florida coast in the course of two years. While Poe's customers vilified him, many of his peers sympathized, convinced that it was the breakdown of the disaster insurance market that had precipitated Poe's downfall. Indeed, as Poe had prospered in the prior decade, other insurers, particularly the national giants such as State Farm, were plotting their exits from the Sunshine State.

These developments made it increasingly certain that the average Florida resident would, willingly or not, subsidize those who choose to live near the vulnerable coast. We will now examine why.