Why ChatGPT-4's Score on the Bar Exam May Not Be So Impressive

News Center
Why ChatGPT-4’s Score on the Bar Exam May Not Be So Impressive

One of the most celebrated moments of the artificial intelligence revolution was when ChatGPT’s developer OpenAI announced that the chatbot scored in the 90th percentile on the bar exam and completed the test in just six minutes.

But was ChatGPT-4’s stellar performance exaggerated? Eric Martinez, a doctoral student in MIT’s brain and cognitive sciences department, thinks so.

Martinez spoke at a recent New York State Bar Association continuing legal education course that explored whether GPT-4’s bar exam performance is indicative of its competence as a lawyer. Luca CM Melchionna, chair of the Tech and Venture Law Committee of the New York State Bar Association’s Business Law Section, joined Martinez for the 90-minute discussion.

Martinez said one of the reasons he is downplaying the achievement is the bar exam’s grading method, which rewards test takers who pass regardless of their score, reducing the incentive for prospective lawyers to maximize their grades.

“If I can draw the analogy of if you’re trying to pass a fitness test for the military and you need to get a 7-minute mile, you might not train very hard to get a faster time than that,” he said. “That doesn’t mean that you’re not capable of a much faster time, it might be more efficient to use resources elsewhere.”

Martinez also pointed out that ChatGPT-4’s improvement from the 10th to the 90th percentile on the bar exam from its predecessor ChatGPT-3.5 far exceeded that of similarly related exams including the LSAT where ChatGPT-4 raised its score by 40 percentage points.

He also said that having GPT-4’s score measured against the February test takers gave it an unfair advantage. That’s because prospective lawyers who sit for the February exam are mostly those who failed in July and repeat test takers tend to score lower than first-timers.

“It seems the most accurate comparison would be against first-time test takers or to the extent that you think that the percentile should reflect GPT-4’s performance as compared to an actual lawyer, then the most accurate comparison would be to those who pass the exam,” said Martinez.

GPT-4 scored closer to the 60th percentile when comparing first-time test takers in both July and February and that figure dropped to around the 40th percentile when scoring only the exam’s essay portion, according to Martinez’ paper on the matter.

“Although the leap from GPT-3.5 was undoubtedly impressive and very much worthy of attention, the fact that GPT-4 particularly struggled on essay writing compared to practicing lawyers indicates that large language models, at least on their own, struggle on tasks that more closely resemble what a lawyer does on a daily basis,” said Martinez.

There is also the difficulty of assessing the capability of an AI system as compared to those of a practicing attorney. The content of the Uniform Bar Exam is not based on the laws of a particular jurisdiction and so may not translate to understanding specific laws.

The bar exam results aside, Melchionna posed perhaps the most important question lawyers have about AI. Can it be an instrument to facilitate attorneys’ professional lives or replace the way in which they work today?

Martinez said that, in the short-term, AI can streamline research and cite cases if attorneys verify that those cases exist, but it is not as clear that it will be proficient at drafting documents or providing clients with guidance.

“I do think that it is really important, though, and hopefully as researchers we can figure out how to help our lawyer counterparts to figure out which types of AI models are appropriate and when,” he said.

A member of the audience asked if AI might not come up with as many new ideas as its use widens, depriving the technology of more innovative ideas being fed into its models.

“If it just becomes sort of like a self-fulfilling loop, then ironically, we’ll see less innovation at some point in terms of legal reasoning and then everything will just a look a certain way,” said Martinez. “I think that’s a fear, especially under the current paradigm, where AI models are just approximating kind of the average in some way. And so, if we’re just getting average output all the time without new input, then I think that is a concern.”

The program was sponsored by the Committee on Continuing Legal Education, the Technology and Venture Law Committee, the Task Force on Artificial Intelligence, and the Business Law Section.

Go HERE to register for the program on demand.

Why ChatGPT-4’s Score on the Bar Exam May Not Be So Impressive

Related Articles

Image-Generative AI: Has Technology Evolved Beyond Modern-Day Fair Use?

New York State Bar Association Leaders Take on Capitol Hill

New York State Bar Association Warns That AI Must Not Compromise Attorney-Client Privilege

Join NYSBA

My NYSBA Account

My NYSBA Account