The Experiment Failed: AI Isn’t Yet Ready to Be a Good Document Summarizer

Have you worried about whether you are missing a trick in incorporating AI into your business? I recently did an experiment that revealed AI is not yet ready for prime time for a key function: legal document summarization. What I learned applies to summarizing any important document.

I worked with a computer science Ph.D. candidate to try to develop a case-law summarizer. Ultimately, the project didn’t produce useful summaries, but this failure taught me about the state of AI.

We started by using TextRank. It’s an algorithm designed to identify significant elements in a network, such as important text in a document. Its summaries frequently repeated the same information, sometimes several times, and often didn’t state the case holding sufficiently accurately to be useful.

We then used ChatGPT, at first standing alone, and then having it process the output of TextRank. ChatGPT usually eliminated the repetition and produced a reasonably accurate summary but boiled things down too much. Also, it can’t be stopped from infusing its summary of a case with commentary by others.

In the end, neither TextRank nor ChatGPT, separately or in combination, produced a summary with sufficient detail and accuracy.

I asked my technical consultant why it didn’t work out. He said each approach has significant limitations.

A large language model (LLM), such as ChatGPT, is autocomplete on steroids. It trains on a massive volume of data to predict and type the best next word after a string of words. It needs an enormous volume of data to do a reasonably good job.

Currently, the best publicly available LLM is ChatGPT running GPT4. But ChatGPT is not trained only on legal documents, so its output when summarizing a legal case will be heavily and negatively influenced by non-legal writing in its training materials.

Also, one step of LLM training is for humans to rate the quality of the output, so it can learn what output is good. Its human training wasn’t done solely by knowledgeable attorneys seeking to produce the kind of output attorneys want.

Worse yet, it has a low limit on how much text it can summarize in one bite. If you want anything bigger summarized, you have to break it into parts and summarize each part.

Instead of using an LLM, you can run a summarization computer program, such as TextRank, that usually involves applying a dictionary of important terms. Here, a “dictionary” is not a literal English language dictionary of all English words but, instead, a database of the key terms most important to the reader wanting the summary.

That leads to a problem: the dictionary must be tuned to focus on the particular issues that concern the audience for the summary. That vocabulary list will change based on the document’s subject matter and what you want to summarize from it. Ultimately, either the dictionary will be too narrow for general summarization usage or, if it’s really broad, it will not be geared to summarize well the aspects of the document’s content that are important to the reader.

It’s possible companies are offering specialized generative AI products or summarizers that might do a better job. Still, technical limits make it unlikely that either an LLM or a summarizer will do a good job soon.

Regarding LLMs, there will still be problems with the largeness of the language model. LLMs need a massive amount of data to perform well. If you limit the LLM training data to high-quality and relevant documents, there likely would be insufficient data for a system to do a good job. You could make up for this by having human beings with subject matter expertise provide a lot of feedback to the LLM, but that can make producing the LLM extraordinarily expensive.

And this presumes the LLM could access all needed training data at low or no cost, to make it economic. Some important information is in paywalled databases, which may be inaccessible or cost-prohibitive to access.

Also, there is a trade-off between cutting repetition from the summary and missing important details in what is summarized. This applies to generative AIs such as ChatGPT and summarizers such as TextRank.

The repetition in the output is usually caused by the source document addressing the same issue several times. This is a tough-to-crack problem because sometimes the repetition in the source document is only verbosity, which could be distilled to a single statement, while in other cases, each instance might contain important new details or nuance that should be included in a summary.

Overall, we need further technological advancement to get a good summarizer tool for important documents.

Our experiment sounds like a failure, but it’s informative: AI isn’t yet ready to be a great case-law summarizer. Perhaps we’ll get there in a few years. For now, at least I learned that I’m not missing a trick.

Written on August 16, 2023

by John B. Farmer

© 2023 Leading-Edge Law Group, PLC. All rights reserved.