How To Diagnose What Ails the New York Law Exam? Call the Exam Doctor
5.19.2025
Above: Paul Sackett, the Beverly and Richard Fink Distinguished Professor of Psychology and Liberal Arts at the University of Minnesota.
In 2016, more than 20 years after the New York bar exam was given a clean bill of health by three independent experts who specialize in stress-testing licensing exams, New York discarded its professionally validated version of the bar exam. It substituted the current New York Law Exam, which has never been subjected to the same intensive professional analysis as its predecessor, nor has it been certified by any expert psychometrician as valid, reliable, and non-discriminatory. Not surprisingly, the NYLE has generated criticism from virtually all quarters. Fortunately, the switch to the NextGen bar exam in July 2028 gives New York sufficient time to retain psychometricians to help it diagnose the problems with the NYLE accurately, identify a cure appropriate to New York’s unique conditions and circumstances, and launch law schools and law students on a treatment regimen that will enhance student success – both in passing the bar and in practicing at the bar – after 2028.
The New York Law Exam Isn’t Doing Well
Since its inception in 2016 as part of the state’s adoption of the Uniform Bar Exam, the NYLE has suffered from a failure to thrive. It has failed as an incentive to law students to study New York law, failed as an inducement to law schools to make teaching New York law a core part of their curriculum, failed as an effective tool for ensuring that new lawyers who appear in our courts are competent to practice New York law, and failed to achieve a reputation among law students, professors, and practitioners as a test that protects the public from lawyer incompetence. The troubling evidence of failure to thrive has been detailed in previous issues of the Bar Journal, which can be accessed at: https://nysba.org/presidents-message-time-for-change-future-attorneys-in-new-york-need-a-rigorous-exam-to-be-better-prepared; https://nysba.org/a-rigorous-new-york-law-exam-nuisance-or-necessity-a-view-from-the-bench/; https://nysba.org/new-yorks-next-bar-exam-where-should-we-go-from-here/.
How To Construct an Effective Diagnostic Process
In January, the New York Court of Appeals empaneled a committee to “study and report to the Court on various options for a robust New York-specific bar eligibility requirement, including the possibility of an in-person New York law component.” The court’s committee has held public hearings seeking comments about, among other things, “the advantages and/or disadvantages of alternatives to the current NYLC and NYLE, including eliminating or replacing an examination requirement.” The court has announced that this September, the committee will deliver a report “reviewing its findings.”[1]
Don’t Neglect the Patient’s History
This isn’t the first time the court has convened a commission that evaluated the New York bar exam for possible unreliability and unfairness. In 1988, the court established the New York State Judicial Commission on Minorities. Chaired by civil rights pioneer Franklin H. Williams, the commission investigated, among other things, the allegation that the New York bar exam was discriminatory, had not been shown to be job-related, and should be abolished. In 1991, the commission issued a report in which it recommended against abolition of the New York bar exam. It explained that the principal argument in favor of retaining an examination of New York law was that “the examination tends to ensure that law schools are not graduating students who lack certain basic skills.”[2] The commission went on to dismiss as fatally flawed the proposed alternatives to the New York bar exam, such as automatic admission to the bar upon receipt of a law school diploma, admission upon satisfactory completion of an apprenticeship, or substitution of skills testing for traditional testing via multiple-choice and essay questions.[3] Instead of amputating the exam entirely from the lawyer licensing process to cure the effects of pass rate disparities, as some critics demanded, the commission recommended that data be gathered so that the exam could be “evaluated for cultural and economic bias and for job relatedness.”[4]
Gather Relevant Data and Apply Careful, Professional Scrutiny
The year after the commission issued its recommendations, the Court of Appeals retained a team of independent professional psychometricians under the direction of Professor Jason Millman of Cornell University, which included Professor Paul Sackett of the University of Minnesota and Professor William Mehrens of Michigan State University. The court asked the Millman team to gather and analyze data regarding the New York law portion of the two-day bar examination in effect in 1992.[5] It gave the team as its “starting point” the fact that “the Bar Examination was a licensure examination and, as such, its purpose is to protect the public against incompetent lawyers.”[6] In 1993, Dr. Millman’s team submitted a nearly 300-page report analyzing reams of statistical data, public hearing testimony, and responses to interviews and surveys from New York practitioners, bar exam graders, and bar admission officials. Applying the tools and techniques endorsed by the psychometrics profession in the Standards for Educational and Psychological Testing, the Millman team concluded that “[t]he validity, reliability, lack of bias, and other aspects of the New York State Bar Examination and its implementation surpass acceptable levels.”[7]
Don’t Experiment With Treatment Fads or Fashions – Call Your Doctor
To New York’s great good fortune, one member of the team that prepared the Millman report is an active member of the psychometrics profession and continues to research, write and advise clients regarding professional school admissions and licensing. In April, we asked Dr. Paul Sackett, the Beverly and Richard Fink Distinguished Professor of Psychology and Liberal Arts at the University of Minnesota, to provide an insider’s view of the 1993 Millman report and offer some guidance to the court as to issues it should consider when evaluating proposed non-exam alternatives to the current NYLE. Our conversation below has been edited and condensed for clarity, continuity, and length.
David Marshall: Could you start off, Professor Sackett, by telling us what a psychometrician does in connection with the evaluation of a professional licensing exam like the bar exam?
Paul Sackett: It turns out that designing a test involves a wide variety of skills. The knee-jerk notion is, oh, I need a test, I’ll go home and make one up tonight and give it tomorrow. Write some questions and we have a test. The field of psychometrics makes clear that we’re so far removed from that for any professional test. For a licensure exam, we start by saying, okay, we’ve got to determine what is the content domain, the body of knowledge and skill that we’re going to use to decide you are or aren’t above a threshold for minimal competence. You can’t measure everything, so we identify the most important knowledge, skills, and abilities. Now we’ve got the content domain for a test.
Next, we’re going build a test specification. We design a plan that samples from each of those domains to make sure that everything we measure makes sense. We use expert panels to make ratings and linkages between items and the intended content domain. Is this important for practice? Is it essential, is it helpful, or could you do without it?
From there, we move to designing exams, using a set of skills that are very quantitative. We want to make sure that the individual items work as intended. We’ve got processes for identifying and removing items that are not functioning as we hoped they would. And then we use a batch of analytic tools that work at the level of the overall test score. We examine the relationship between all kinds of measurable characteristics of test takers from the typical demographics of race, gender, and ethnicity to background features such as where you went to school, your law school GPA, and your LSAT score. We examine relationships between all these pieces in order to form an informed decision about whether this test makes sense.
DM: I don’t want to put you on the spot with this question, and I hope you’ll answer it without worrying about the tender sensibilities of lawyers, but what can psychometricians do that lawyers can’t do? For example, you said you would convene a panel of lawyers to say what domains are important to be a competent lawyer. Why can’t we leave you psychometricians out entirely and just let that panel of lawyers come up with the questions for the exam, grade the answers, and figure out what the passing score is?
PS: I think it’s simply the fact that we have developed systematic processes for doing these things. Unless you are in the habit of thinking about it, there is a lot that can go wrong. I think there’s tremendous value in working with someone deeply involved in the test development and evaluation process who is aware of possible pitfalls.
DM: There are two concepts used in the articles that you and others in your profession have written about professional licensing testing: validity and reliability. Can you give us a summary of the meaning and application of each of those concepts?
PS: When you give a test and produce a score in the licensing arena, you’re making a claim or drawing an inference that a person above a threshold meets the state’s interest in safe and efficient practice of the profession. So, validity means, do we have evidence that would support that inference? Does the test measure what it’s supposed to measure?
Reliability involves consistency of measurement. Would you get the same score upon retesting? Imagine I give you a set of six essays and we get a score. Now that six is drawn from some broad domain of what could be given. Let’s give candidates six more. How well does the score on one set of six relate to another set of six? That ensures fairness to the individual. So that’s reliability and that’s essential.
You can’t have validity without also having reliability. A key contributor to reliability involves test length: the number of items. Any one item that you give, be it a multiple-choice item, be it an essay, has a lot of what I’ll call “noise” in the answer. Just by chance, that one item might be something you happen to have studied and invested in a lot and somebody else hasn’t. And if you were given the next item, that might be different. But as we accumulate items, there is a fundamental mathematical principle of psychometrics, namely, that noise or error averages out and the truth emerges. Why don’t we just give a three-item bar exam? Because it’s going to be unreliable. We examine how many items we need to reach a level that we can say confidently, “Here’s your score with a small enough plus or minus range around it.”
DM: Is it fair to say that if you want to make an exam harder and more reliable, just make it longer and that will give you a harder, more reliable exam?
PS: More reliable doesn’t mean harder. If I could give you all possible test items, I would then know your true absolute true score. We can’t get that. We’re trying to approximate it with a sample of items. That’s the whole mathematical idea of reliability. How close do you come from a fixed subset of items to a truth that would be known if we gave you, not a sample but the whole population of all possible test items.
DM: Has your profession developed metrics or measurement standards that would allow you to compare the validity and reliability of a pencil and paper test versus the validity and reliability of alternatives such as a supervised period of on-the-job apprenticing or the evaluation by a panel of professors or lawyers of a candidate’s portfolio of simulated work product?
PS: There are different issues involved when you’re giving a standardized, multiple-choice test, which is objectively scored versus when judgment is involved in scoring, as in an essay answer. For the latter, we have to look at the reliability not only of the examinee’s performance, how consistently you are performing from item to item, but also the reliability of the rating or grading. The fundamental idea is that the examinee should be indifferent as to who is assigned to read their essay.
But if the content domain is specified, then issues of reliability and validity can be addressed regardless of the approach taken to assessing a candidate. Overall, validity is an informed judgment of how well a given approach supports the inference we want to draw about whether a candidate merits licensure.
DM: I don’t think the issue of reliability is always well understood by people who are very opinionated about the superiority of having people perform or simulate tasks as a way of measuring their competence.
PS: Work sample testing is appealing and I’m all for it when one’s got the time and resources to do it. It just becomes really hard to do on a large scale and get a quality measure out of it.
There are all kinds of large-scale assessments where you’ve got these large numbers of different raters. The biggest is probably writing exams for high school graduation. People write all these essays, and a big company like Educational Testing Service will employ large numbers of essay graders, but there is extensive quality control and elaborate training in place to make sure that ratings are comparable. It’s not just, oh, let’s get some volunteers and grade some essays.
DM: Has your profession determined that there’s a ranking of the methods for conducting professional licensing assessments, in which at the top, the best, is apprenticing, the next is a portfolio of simulated work, next is a GPA score that you achieve at the end of your professional schooling, and, at the bottom, is a pencil and paper test like the bar exam?
PS: I would say no. To me, there are settings where any of those make good sense and there are settings where any of those is questionable. There’s no universal. I resonate with the notion of being as realistic as possible. The idea of asking people to do actual lawyering tasks sounds good. Work samples appeal because they are realistic. My general principle is that it takes a lot more testing time to get a given unit of information from a work sample or a simulation than from a multiple-choice test. So, if you’re saying, I want to reach a level of reliability where I can say, yep, if you use assessment format A versus format B, you’d get the same score – generally, that takes a lot more time, often more time than people feel they can afford to take. There are settings where I use and love work samples, but you need a lot of time generally to get a reliable measure.
When you involve bringing judges and raters in, you have to make sure that you’re taking into account issues of bias on the part of the evaluators, to what extent are you adding more problems than you’re avoiding.
The work that I’ve seen on work samples in place of bar exams produces some findings that might be counterintuitive. A common belief is that racial and ethnic passing rate differences are due to a flaw in the multiple-choice test. The big California experiment with work samples produced essentially the same group differences on the work sample as on the multiple-choice test. The implicit assumption that it’s worth going through all that effort because we’re going to see differences disappear wasn’t borne out there and often isn’t borne out.
Melinda Saran, University at Buffalo School of Law: The multiple-choice test, and the bar exam as a whole, has an issue of speededness. Unlike the medical profession, the legal profession, except in certain cases, is not a speeded profession. So, does that change a psychometrician’s view of validity when you’re asking someone to do a task very quickly that real practice does not require them to do very quickly?
PS: That’s a great question, and if someone waded through our 280-page report from 1993, you’d see we devoted a lot of time to the speededness issue. And we agree with you. There’s no evidence the exam should be measuring your performance under time pressure. That means to me it’s important to take speededness into account because a fundamental goal of a bar exam is that it should not be substantially speeded.
Virtually all exams have time limits, just for the logistics of how it works. But you try to work it out such that the vast, vast, vast majority of people can give it a solid shot within the time limit that you allot.
DM: I want to drill down a little bit more on that 1993 report you prepared for the court. According to the executive summary, and I quote, “the validity, reliability, lack of bias, and other aspects of the New York State Bar examination and its implementation surpass acceptable levels.”
Can we pick that apart a little bit? One of the things that you folks said you were looking at was, did that 1993 exam, which consisted of a full day of essays and multiple-choice questions, measure legal knowledge and legal reasoning ability? Can you talk about what you found with respect to the validity of the New York bar exam in terms of its validity as a test of legal knowledge and as a test of specific skills?
PS: Yeah. We found it very solid as a measure of foundational knowledge and reasoning skills. And that’s great because to us that’s a foundation for other pieces, doing the specific skills that build on that foundation. An exam can’t do everything. Factual investigation and case planning were not included in the 1992 exam. Would it be potentially useful to do so? Absolutely yes. That’s this idea of, can you build in some kind of work sample? Our report ends up saying these are worth investigating as mechanisms that might be added to subsequent exams.
DM: You were also asked to address the question of whether the 1992 exam tested memorization more than it tested legal reasoning and knowledge.
PS: As a general principle, I don’t buy the notion that per se a multiple-choice item is mere memorization. One can craft good items that really get at useful information. You can get into arcane, trivial memorization or you can build into a multiple-choice question a story, a scenario, and build a judgment into the correct answer. It’s something to aspire to.
We looked at the 1992 exam and found that the multiple-choice items sorted into the subset that is more memory oriented and the subset that’s more reasoning oriented. One of the interesting things we found, and to me one of the most interesting things in the whole study, is that if I rank order people on the memory items and I rank order them on the reasoning items, the correspondence is close to perfect.
It really gets at one underlying thing that’s a fundamental principle that says capable people go to law school. They know what’s going to be tested and they set out to do it. The more capable they are, they’re going to show it on all the components of the exam. We don’t see differential performance by subsets.
Now, somebody might say, well, if you get the same score from each piece or component, let’s make our lives easier and get rid of the essays because the essays are time consuming, and we’ve got to train the graders, and we’re going to get the same answer out of the essay piece anyway. That idea doesn’t hold. Because the reason you get correspondence is, as I say, these smart people know they’re going to have to write legal essays. They know they’re going to have to answer multiple choice questions. They invest in all that. If you tell them, no, the exam’s only going to be on this part, not that, odds are good that the set of skills that aren’t included on the exam drop away, aren’t developed.
It’s within that context that I like to tell this story. I was heavily involved in the last revision of the medical college admissions test or MCAT which, up until the last time, asked how much science, how much biology, and how much chemistry the test taker knew. With everything we know about the human side of medicine, and behavioral medicine and psychological and sociological aspects of health, we added a new subtest to the MCAT looking at behavioral health issues. And, what happened? All these pre-med students who never took psychology and sociology before now are taking those courses. They’re investing. They’re building a new body of knowledge. Is the new MCAT any more valid? No, it’s just as predictive of subsequent medical school performance as it was before, but the fact that you’re testing new subjects meant that people developed a body of skill that’s important in medicine. So, I feel that we really changed how the next generation of doctors are prepared, even if the test itself ends up pretty much in the same place.
DM: I want to turn to the question of disparate impact because that was a large part of what prompted the study that you undertook in 1993. Can you comment on what your findings were with respect to whether the multiple choice and essay questions were valid and reliable, or whether they were actually overly influenced by race or ethnicity or gender or disability?
PS: The general public perception would be that if there’s a difference in passing rates on a test, that defines bias, that a biased test is one on which any two groups that you pick don’t score on average the same. The psychometrics profession doesn’t accept that. We say – and the Supreme Court said the same thing in Griggs v. Duke Power – that a finding of a mean difference, a group difference, a difference in passing rates, should trigger scrutiny. It says this is something we should look at and ask what’s the cause.
There are two possibilities: bias in the test; or the test is measuring what it’s supposed to, but there are differences in preparation or other factors that produced these differences in passing rates. We have a set of techniques, some of which operate at the level of each individual item, that ask, do people with the same overall score, but one is male and one is female, do they have the same probability of getting item one right? Do they have the same probability of getting item two right? We call that differential item functioning. The key thing is not to ask whether there is a difference in the passing rate, in the percentage correct for item one, for men versus women. The key thing is whether men and women with the same overall score have the same chance of getting item one right. That’s our technique for looking at bias at the item level.
At the overall score level, we will say there are all kinds of features we can measure about people in terms of the things that they come to the table with. One of the biggest features that affects overall score level, in terms of being highly related to bar exam performance, is law school GPA. How well did you do in law school? The people who did well in law school, on average, do better on the bar exam. It’s not perfect. If it were, we wouldn’t need to think about a bar exam. Statistically, we’ll say that once one controls for differences in school attended, performance, etc., what happens to racial or ethnic differences is that those essentially fade from sizable to small to negligible. This is what led us to the conclusion in 1993 that racial and ethnic disparity is not an inherent flaw in the test used by New York in 1992. The test measures what it’s supposed to measure. Disparities are caused by differences in background and preparation.
If you start with the notion that the state has an interest in competent practice of law, we ask the question, do people meet that standard? The reason that we said we don’t see bias inherent in the test is that those surface differences disappear when you control for these background factors. I can’t say I’m happy about that. I wish we had a world where those background factors were equated for everybody. Everybody has equal opportunity. Everybody got to go to the same law school. But that’s just not where we are.
I thank goodness that there are retest opportunities. I can’t say I know the legal data as well as, for example, as I know medical data, but you’ll see very comparable findings. What you’re seeing with bar exams isn’t atypical in the context of professional licensure. For all programs requiring post-collegiate study, you’ll find gaps in the initial passing rate, but those gaps disappear with retake. Tests are not a permanent bar to the profession for very, very large numbers of people. But I think some people need more preparation to be able to master the content.
DM: Our Court of Appeals has commissioned a working group to study the options available for a bar admission eligibility requirement, whether it’s an exam or an alternative non-exam pathway. Right now, applicants to the bar in New York take a multiple-choice, open book, at-home exam to test knowledge of New York law. How would you advise the working group to go about the business of evaluating the options and giving the court guidance as to which bar admission requirement is going to achieve the ultimate goal of ensuring that the bar exam and the bar licensing process protect the public from lawyer incompetence?
PS: We start with a specification of what it is that you’re certifying – in this case how do we evaluate if you’re competent. We define competence in terms of whatever it is, knowledge, reasoning, in a certain set of domains. A standardized exam is a standardized exam that has been developed to make sure that it adequately samples all of that.
The option of an on-the-job evaluation by your boss may not be adequate if your job duties only deal with a small piece of the set of domains that come into play in assessing competence. So, each option would be evaluated against the standard of whether it matches up with what you claim you’re certifying.
For whatever people may think about it, one of the things about a standardized test is that it tries to take strong control away from a variety of biasing factors that are involved when people are making judgments of others. As to this notion that we’ll just have your boss on your first job make a rating and that’ll determine competence, I can’t say I know for sure what will happen, but I’ve got a guess.
I’m going to use as an example my current employer, the University of Minnesota. At every university, professors come up for tenure. After half a dozen years on the job, they get reviewed and the university has to decide, are you allowed to stay or do you have to go? The process always involves getting outside evaluations. So, you write to half a dozen experts in their field and ask them for a letter evaluating this candidate’s work. The state of Minnesota is not alone in this, but it’s one of the states that has a sunshine law that says that those letters would be made available to the candidate being evaluated. As a result, quite a large number of people who get requests for letters respond saying that they have a policy against writing such a letter in a state where it’s going to go to the candidate. Others shrug and say, okay, well, it’s something we have to do, but their letters have limited evidentiary value. It’s really rare to get a letter containing even a hint that that candidate is less than absolutely perfect. The fact is that no one’s going to go on record as saying, I think so-and-so’s scholarship is shoddy and the person shouldn’t be promoted. They worry about their legal liability for doing that, [to] their reputation. So, I worry about those kinds of features when there are stakes for the evaluator. That issue is, of course, not what’s at play in any standardized assessment where the evaluation and evaluators are anonymous and have no vested interest in a candidate.
So, I worry about the issue of motivation. The notion of a competency assessment that says that passing a certain set of courses in law school is sufficient to certify competence is one that I worry about. A given school is competing for students with other schools and may not want a reputation as a place that grades hard and thus keeps people from being licensed. So, I think there’s a series of impediments that have to be evaluated when we’re relying on judgmental evaluations by people potentially with vested interests in an individual student’s success.
DM: I’m going to put you on the spot one more time. I know you’re very busy, but if the New York Court of Appeals said, “Can we retain you to give us 10 hours to develop a strategic plan for how we should be going about our work,” would you be able to do that, building on what you did for the court in 1993?
PS: Can I find a few hours? Yes. Could I come in and redo the 1993 report? No, I would not have the time or availability to do anything of that sort. If I can be of value in a meeting with some group, I can find a few hours for that.
David Marshall co-chairs the Committee on Legal Education and Admission to the Bar. He is an adjunct professor and co-director of the Center for Labor and Employment Law at St. John’s University School of Law. He began his career at the National Labor Relations Board in Washington, D.C., before entering private practice in New York City, where he practiced labor and employment law for nearly four decades.
Endnotes:
[1] Notice to the Bar from Heather Davis, Chief Counsel and Legal Counsel to the State of New York Court of Appeals, Public Hearings: New York State Bar Examination (updated May 2, 2025), https://www.nycourts.gov/ctapps/news/nottobar/nottobar050225.pdf.
[2] Report of the New York State Judicial Commission on Minorities, 19 Fordham Urb. L.J. 181, 266 (1992),
https://ir.lawnet.fordham.edu/ulj/vol19/iss2/3.
[3] Id.
[4] Id.
[5] Jason Millman, William Mehrens & Paul Sackett, An Evaluation of the New York State Bar Examination, ES-1 (1993) (unpublished report) (on file in the SUNY/Buffalo Law Library: KFN 5CD76 M55 1993) (“Millman Report”).
[6] Millman Report, p. 1-2.
[7] Millman Report, p. ES-1.





