LLM summarization notes

(for IPLC Discovery-Day 2024-October)

on this page...

slides
notes
approach (contains code-links)
presentation text
text and summarization example

slides

Slides: https://docs.google.com/presentation/d/1EBd6ACzT8dAoN1MjhsXcNJwKSX1ebO8S4Irzu6cvVVo/

notes

The Brown Digital Repository (BDR) auto displays info in the MODS <mods:abstract> element.
This offers possibilites for "titles" and "sub-titles" using the same technique with minor prompt-tweaks.
Concerns:
- How to avoid implying Library 'agreement' with statements in the summarization?
- Note "(auto-summarization of extracted-text)" added at end of detail summarization.
- Note that although the intial sentences almost invaribly begin with text like "The text describes" (which is good), sometimes the following sentence or two does not. Follow up quality-control code could check this and force a regeneration.
In testing, even summaries of fliers offered reasonably-good summaries.
This summarization-work opens the possibility for combining all OCR-ed text for an organization, and running a summarizer on that.
- Note that a common technique for that is to iteratively summarize parts of the text -- and then summarize the summaries. I experimented with this with good results.
- Note that models are rapidly improving and allowing a greater number of tokens to be used at a time. (Tokens are not words, but can roughly be thought of as words.)

approach

I did a bunch of googling and found that most summarization seemed to involved extracting actual-text segments that were representative of the overall document. This is even with newer code using large-language-models trained to be good for extraction. Some experimental work.
I didn't want that -- I instead wanted what we'd think of as summaries.
I knew ChatGPT could do a fantastic job on this, and then remembered work some six-months ago I did on getting an open-source chat-oriented large-language-model running, for experimentation.
I got that code running again, following a video-tutorial, and built on it to experiment with using chat for summarization.
- Interesting: when I tried that code six months ago, it worked pretty smoothly. But now, that model is old (the link 404ed). I had a hard time finding it, and the libraries that worked with it then no longer do (I had to downgrade to older versions). Shows how fast things are changing!
Though I did get the summary-of-summaries approach working, for the demo I switched back to a simpler approach of just summarizing the first 1,000 words.
- tokens vs words.
- for the demo, this produced good results, due to the maximum text often being handled due to everything being single-page scans.
- i keep hearing that newer models are both better, and faster, and handle larger numbers of tokens -- so for the Hall-Hoag project, I may not use the summary-of-summaries approach -- except to experiment with organization-as-a-whole summarization.
Note the prompt-experimentation for the description-text -- and for subtitle-text.
Note also meta experience of using a large-language-model to explain the knobs (parameters) for working with a large-language-model. :)

presentation-text

Greetings all,

I'd like to give a short summary of an experiment using a large-language-model to summarize OCR-ed text to facilitate browsing-discovery and possibly search-discovery in our Brown Digital Repository (the BDR).

The reason for this experiment...

We have an unusual project in which we're ingesting into the BDR -- a collection of three-quarters of a million individual page-scans, grouped only by organization. A scan may be a one-page flyer; or a part of a four-page letter; or a part of a ten-page article -- with no metadata indicating that any of the pages are connected to one another.

We've OCR-ed the scans, and eventually will make the full-text searchable. But we're not indexing the full-text as we normally would because this is a collection of sometimes-offensive extremist literature. (We have indexing ideas.)

So as a researcher is browsing through the pages of the collection, a brief summary of the contents of the individual page would be very useful.

I was dissatisfied with initial experiments using natural-language-processing packages to extract summaries. They yielded supposedly "representative" excerpts of the full-text -- but I wanted actual summaries.

I kept thinking "ChatGPT could do this" -- and remembered that some six months before, I had experimented with running a large-language-model locally -- not for summaries, but just to play with. So I revisited that code to evaluate its summarizations.

As with all chat-type large-language-models, the "prompt" (the instructions you give the chat-tool) make all the difference. Once I refined the prompt, the results were very good.

We haven't implemented summarizations for this project two reasons:

It can take up to a minute for some of the summarizations, and though that work could happen concurrently -- our initial grant-driven focus has been to implement a high-speed ingestion-pipeline.
More importantly, we would need do much more careful testing, and implement additional quality-control checks -- given the sensitive nature of this collection. It's one thing for source-materials to potentially contain offensive-language; it's very much another for library metadata to unnecessarily reproduce offensive language.

Over time we'll be improving this and other Brown Digital Repository collections. Though we haven't decided how to proceed with summarizations for this particular collection -- large-language-model summarization results show real promise for improving browse-discovery and search-discovery.

text and summarization example

summarization...

The text is a critical analysis of the actions and attitudes of Dean
Acheson, former Secretary of State, during his tenure in office. It
highlights the contrast between Acheson's self-proclaimed role as an
impartial coroner and his apparent efforts to justify the US's actions
in Korea and Formosa, which ultimately led to the Communist takeover of
these areas. The text also mentions the abandonment of Asia to the
Communists and the scolding... (auto-summarization of extracted-text)

raw pdf text...

This inquest on China is not the work of a serene
and detached coroner but of a vitally interested party
in
the catastrophe. Unfortunately, at this point one of these
vital interests is self-justification, which certainly is the
enemy of objective analysis.
In 1948 we had begun withdrawing our army
from
South Korea, though it was known that for years
the
Soviet training schools in Siberia had been training
and arming the North Koreans. The Korean National
Assembly, on November 20, 1948, begged that we remain.
A secret memorandum to the State Department
from Lattimore proposed that Korea should be allowed,
as he subsequently stated most succinctly in the fellowtraveling Daily Compass, “to fall without making it
look as if the United States had pushed it.” Acheson
announced that no policy decisions could be made until
“after the dust has settled.”
In the Spring of 1949 the State Department recommended that economic aid only,
to the extent of $150
million, be extended to South Korea. Congress demanded that military aid be included; the provision for
military aid was included; but no military aid was
Shipped.
On December 23, 1949, Acheson instructed all
diplomatic and consular personnel
that Formosa had no
strategic value. But in May, 1951, under oath, Acheson
stressed Formosa's “strategic importance.”
On January 5, 1950, the British extended recognition to Red China with the full knowledge and approval of our State Department. (Statement by Beverly
Baxter, Member of Parliament.) On this same day
President Truman declared: “The United States government will not provide military aid or advice to Chinese forces on Formosa.”
On January 12, 1950, in a speech to the National
Press Club in Washington, Acheson discussed our defensive policy for Asia. “Our defensive perimeter,” he
said, “runs along the Aleutians to Japan and then goes
to the Ryukyus.” This policy abandoned Asia to the
Communists and invited them to seize both South Korea
and Formosa, since these areas lay outside the announced
perimeter.
'
In this same speech Acheson jeered at Senator
Robert A. Taft’s proposal
that we send a fleet to protect Formosa against Red invasion. He said he didn’t
realize that Mr. Taft was a military expert.
On June 10, 1950, speaking in St. Louis, Truman
said: “We are closer to world peace now than at any
time in the last three years.”
On June 25, 1950, the Communists invaded South
Korea, and three days later American troops began their
“police action” on Korean soil.
On June 27, 1951, Acheson stated we would be
satisfied if the Communists withdrew behind the 38th
Parallel. On August 2, 1951, he declared that the 38th
Parallel “as a line of demarcation” was “unacceptable.”
On September 13, 1951, President Truman said
that Dean Acheson will be Secretary of State “as long
as I am President of the United States.”
T
HAT’S THE RECORD of Dean Gooderham Acheson as a fighter
for freedom. Only one additional
entry
need be made.
On March 5, 1952, Acheson scolded Washington
reporters
for
inquiring why he had single-handedly
“cleared” Oliver Edmund Clubb, whom the State Department’s own board had judged a security risk. “In
the future,” he announced, “I hope you will understand that always the responsibility for these decisions
must rest with me.”
During the period of Acheson’s stewardship at the
State Department, more than six hundred million human beings have been dragged into Soviet slavery.
Soviet power
has mushroomed until it now threatens
the existence of freedom everywhere on earth. It therefore seems fair to ask this question; have the actions and
attitudes of Dean Acheson aided the cause of freedom in
its struggle against Soviet slavery, or have they impeded
it?
On November 4, 1952, the American people will
have the opportunity to answer this question.
TO OUR READERS:
Reprints of
this article are available at our offices, 11 East 36th Street, New York 16, N. Y.,
at the following rates:
10 copies $ 1.00
25 copies 2.00
50 copies 3.00
100 copies 5.50
1000 copies 45.00
Minimum order accepted,
ten copies.
—8—

(end)

birkin/llm_summarization_notes.md