Machine Learning

The Essential Guide to Effectively Summarizing Massive Documents, Part 2

 article, we planned to tackle one of the main challenges in document summarization, i.e., handling documents that are too large for a single API request. We also explored the pitfalls of the infamous ‘Lost in the Middle’ problem and demonstrated how clustering techniques like K-means can help structure and manage the information chunks effectively.

We divided the GitLab Employee Handbook into chunks, used an embedding model to convert those chunks of text into numerical representations called vectors.

Now, in the long overdue (sorry!) Part 2, we will get to the meaty (no offense, vegetarians) stuff, playing with the new clusters we created. With our clusters in place, we will focus on refining summaries so that no critical context is lost. This article will guide you through the next steps to transform raw clusters into actionable and coherent summaries. Hence, improving current Generative AI (GenAI) workflows to handle even the most demanding document summarization tasks!


A quick technical refresher

Okay, class! I am going to concisely go over the technical steps we have taken until now in our solutions approach:

  1. Files required
    A big document, in our case, we are using the GitLab Employee Handbook, which can be downloaded here.
  2. Tools required: 
    a. Programming Language: Python
    b. Packages: LangChain, LangChain Community, OpenAI, Matplotlib, Scikit-learn, NumPy, and Pandas
  3. Steps followed until now:

Textual Preprocessing:

  • Split documents into chunks to limit token usage and retain semantic structure.

Feature Engineering:

  • Utilized OpenAI embedding model to convert document chunks into embedding vectors, retaining semantic and syntactic representation, allowing easier grouping of similar content for LLMs.

Clustering:

  • Applied K-means clustering to the generated embeddings, grouping embeddings sharing similar meanings into groups. This reduced redundancies and ensured accurate summarization.

A quick reminder note, for our experiment, the handbook was split into 1360 chunks; the total token count for those chunks came to 220035 tokens, the embeddings for each of those chunks produced a 1272-dimensional vector, and we finally set an initial count of clusters to 15.

Too technical? Think of it this way: you dumped an entire office’s archive on the floor. When you divide the pile of documents into folders, that’s chunking. Embedding would attach a unique “fingerprint” to those folders. And finally, when you compartmentalize those folders into different topics, like financial documents together, and policy documentations together, that effort is clustering.


Class is resumed…welcome back from the holidays!

6 Now that we all have a quick refresher (if it wasn’t detailed enough, you could check the part 1 linked above!), let’s see what we will be doing with those clusters we got, but before, let us look at the clusters themselves.

# Display the labels in a tabular format
import pandas as pd
labels_df = pd.DataFrame(kmeans.labels_, columns=["Cluster_Label"])
labels_df['Cluster_Label'].value_counts()

In layman’s terms, this code is simply counting the number of labels given to each chunk of content. That is all. In other words, the code is asking: “after sorting all the pages into topic piles according to which cluster each page belongs to, how many pages are in each topic pile?” The size of each of these clusters is important to understand, as large clusters indicate broad themes within the document, while small clusters may indicate niche topics or content that is included in the document but that does not appear very often.

Cluster label counts. Redesigned by GPT 5.4

The Cluster Label Counts Table shown above shows the distribution of the embedded text chunks across the 15 clusters formed after the K-means clustering process. Each cluster represents a grouping of semantically similar chunks. We can see from the distribution the dominant themes in the document and prioritize summarization efforts for larger clusters while not overlooking smaller or more niche clusters. This ensures that we do not lose critical context during the summarization process.


Getting up close and personal

7 Let’s dive deeper into understanding our clusters, as they are the foundation of what will essentially become our summary. For this, we will be generating a few insights regarding the clusters themselves to understand their quality and distribution.

To perform our analysis, we need to implement what is known as Dimensionality Reduction. This is nothing more than reducing the number of dimensions of our embedding vectors. If the class recalls, we had discussed how each vector can be of multiple dimensions (values) to describe any given word/sentence, depending on the logic and math the embedding model follows (eg [2, 3, 5]). For our model, the produced vectors have a dimensionality of 1272, which is quite extensive and impossible to visualize (because humans can only see in 3 dimensions, i.e., 3D).

It is like trying to make a rough floor plan of a huge warehouse full of boxes organized according to hundreds of subtle characteristics. The plan will not encompass all of the details of the warehouse and its contents, but it can still be immensely useful in identifying which of the boxes tend to be grouped.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from umap import UMAP

chunk_embeddings_array = np.array(chunk_embeddings)

num_clusters = 15
kmeans = KMeans(n_clusters=num_clusters, random_state=42)
labels = kmeans.fit_predict(chunk_embeddings_array)

silhouette_avg = silhouette_score(chunk_embeddings_array, labels)

umap_model = UMAP(n_components=2, random_state=42)
reduced_data_umap = umap_model.fit_transform(chunk_embeddings_array)

cmap = plt.cm.get_cmap("tab20", num_clusters)

plt.figure(figsize=(12, 8))
for cluster in range(num_clusters):
    points = reduced_data_umap[labels == cluster]
    plt.scatter(
        points[:, 0],
        points[:, 1],
        s=28,
        alpha=0.85,
        color=cmap(cluster),
        label=f"Cluster {cluster}"
    )

plt.xlabel("Dimension 1")
plt.ylabel("Dimension 2")
plt.title(f"UMAP Scatter Plot of Book Embeddings (Silhouette Score: {silhouette_avg:.3f})")
plt.legend(title="Cluster", bbox_to_anchor=(1.02, 1), loc="upper left")
plt.tight_layout()
plt.show()

The embeddings are first converted into a NumPy array (for processing efficiency). K-means then assigns a cluster label to each chunk, after which we calculate the silhouette score to estimate how well separated the clusters are. Finally, UMAP reduces the 1272-dimensional embeddings to two dimensions so we can plot each chunk as a colored point.

But…what is UMAP?

Imagine you run a giant bookstore and someone hands you a spreadsheet with 1,000 columns describing every book: genre, tone, pacing, sentence length, themes, reviews, vocabulary, and more. Technically, that is a very rich description. Practically, it is impossible to see. UMAP helps by squeezing all of that high-dimensional information down into a 2D or 3D map, while trying to keep similar items near each other. In machine-learning terms, it is a dimensionality-reduction method used for visualization and other kinds of non-linear dimension reduction.

UMAP scatter plot of the handbook embeddings

So what are we actually looking at here? Each dot is a chunk of text from the handbook. Dots with the same color belong to the same cluster. When the same-colored dots bunch together nicely, that suggests the cluster is reasonably coherent. When different colors overlap heavily, that tells us the document topics may bleed into one another, which is honestly not shocking for a real employee handbook that mixes policy, operations, governance, platform details, and all sorts of enterprise life forms.

Some groups in the plot are fairly compact and visually separated, especially those out on the right side. Others overlap in the center like attendees at a networking event who all keep drifting between conversations. That is useful to know. It tells us the clusters are informative, but not magically perfect. And that, in turn, is exactly why we should treat clustering as a practical tool rather than a sacred revelation handed down by the algorithm gods.

But! What’s a Silhouette Score?! and what does 0.056 mean?!

Good question, young Padawan, answer you shall receive below.


Yeah, I’m not convinced yet with our Clusters

8 Wow, what a tough crowd! But I like that, one must not trust the graphs just because they look good, let’s dive into numbers and evaluate these clusters.

from sklearn.metrics import calinski_harabasz_score, davies_bouldin_score

calinski_score = calinski_harabasz_score(chunk_embeddings_array, kmeans.labels_)
davies_score = davies_bouldin_score(chunk_embeddings_array, kmeans.labels_)

print(f"Calinski-Harabasz Score: {calinski_score}")
print(f"Davies-Bouldin Score: {davies_score}")

Calinski-Harabasz Score: 25.1835818236621
Davies-Bouldin Score: 3.566234372726926

Silhouette Score: 0.056

This one already appears in the UMAP plot title. I like to explain the silhouette score with a party analogy. Imagine every guest is supposed to stand with their own friend group. A high silhouette score means most people are standing close to their own group and far from everyone else. A low score means people are floating between circles, half-listening to two conversations, and generally causing social ambiguity. Here, 0.056 is low, which tells us the handbook topics overlap quite a bit. That is not ideal, but it is also not disqualifying. Real-world documents are messy, and useful clusters do not have to look like flawless textbook examples.

Calinski-Harabasz Score: 25.184 (rounded up)

This metric rewards clusters that are internally tight and well separated from each other. Think of a school cafeteria. If each friend group sits close together at its own table and the tables themselves are well spaced out, the cafeteria looks organized. That is the kind of pattern Calinski-Harabasz likes. In our case, the score gives us one more signal that there is some structure in the data, even if it is not perfectly crisp.

Davies-Bouldin Score: 3.567 (rounded up)

The last metric measures the degree of overlap between clusters; the lower the better. Let’s go back to the school cafeteria from the previous example. If each table of students stuck to their own conversations, then the din of the room feels coherent. But if each table was having conversations with others as well, that too to different degrees, the room would feel chaotic. But there is a catch here, for documents, especially large ones, it’s important to maintain the context of information throughout the text. Our Davies-Bouldin Score tells us there is meaningful overlap but not too much to maintain a healthy separation of concerns.

Well, hopefully 3 metrics with solid numbers backing them are good enough to convince us to move forward with confidence in our clustering technique.


It’s time to represent!

9 Now that we know the clusters are at least directionally useful, the next question is: how do we summarize them without summarizing all 1360 chunks one by one? The answer is to pick a representative example from each cluster.

# Find the closest embeddings to the centroids

# Create an empty list that will hold your closest points
closest_indices = []

# Loop through the number of clusters you have
for i in range(num_clusters):

    # Get the list of distances from that particular cluster center
    distances = np.linalg.norm(chunk_embeddings_array - kmeans.cluster_centers_[i], axis=1)

    # Find the list position of the closest one (using argmin to find the smallest distance)
    closest_index = np.argmin(distances)

    # Append that position to your closest indices list
    closest_indices.append(closest_index)

selected_indices = sorted(closest_indices)
selected_indices

Now here is where some mathematical magic happens. We know that each cluster is essentially a group of numbers, and in that group, there will be a centre, also known in the calculus world as the centroid. The centroid is essentially the centre point of the object. We then measure how far each chunk is from this centroid; this is known as its Euclidean distance. Vectors that have the least Euclidean distance from their respective centroids are chosen from each cluster. Giving us a vector of vectors that represent each cluster the best (most semantically).

This part works by pulling out the single most telling sheet from every stack of documents, sort of how one would pick the clearest face in a crowd. Rather than make the LLM go through all pages, it gets handed just the standout examples at the start. Running this in the notebook gave back these specific chunk positions.

[110, 179, 222, 298, 422, 473, 642, 763, 983, 1037, 1057, 1217, 1221, 1294, 1322]

That means our next summarization stage works with fifteen strategically chosen chunks rather than all 1360. That is a serious reduction in effort without resorting to random guessing.


Can we start summarizing the document already?

10 Okay, yes, I apologize, it’s been a bunch of math-bombing and not much document summarizing. But from here on, in the next few steps, we will focus on generating the most representative summaries for the document.

For each representative chunk per cluster, we plan to summarize each one on its own (since it is text at the end of the day). This is almost akin to a map-reduce style summarization flow where we treat each selected chunk as a local unit, summarize it, and save the result.

from langchain. prompts import PromptTemplate
map_prompt = """
You will be given a single passage of a book. This section will be enclosed in triple backticks (```)
Your goal is to give a summary of this section so that a reader will have a full understanding of what happened.
Your response should be at least three paragraphs and fully encompass what was said in the passage.

```{text}```
FULL SUMMARY:
"""
map_prompt_template = PromptTemplate(template=map_prompt, input_variables=["text"])

There is nothing mystical happening here. We are simply telling the model, “Take one chunk at a time and explain it thoroughly.” This is much easier for the model than trying to reason over the entire handbook in one go. It is the difference between asking someone to summarize one chapter they just read versus asking them to summarize a giant manual they only skimmed while boarding a train.

from langchain.chains.summarize import load_summarize_chain
map_chain = load_summarize_chain(llm=llm3,
                             chain_type="stuff",
                             prompt=map_prompt_template)

selected_docs = [splits[doc] for doc in selected_indices]

# Make an empty list to hold your summaries
summary_list = []

# Loop through a range of the length of your selected docs
for i, doc in enumerate(selected_docs):

    # Go get a summary of the chunk
    chunk_summary = map_chain.run([doc])

    # Append that summary to your list
    summary_list.append(chunk_summary)

    print (f"Summary #{i} (chunk #{selected_indices[i]}) - Preview: {chunk_summary[:250]} n")

This block of code designs and wires the prompt into a summarization chain, grabs the 15 representative chunks, and then loops through them one by one. Each chunk is summarized by itself, which is appended to a list. In practice, this means we are creating 15 local summaries, each representing one major region of the document.

Output of all 15 summaries. Redesigned by GPT 5.4

So the notebook outputs could be a bit rough-looking, so I used my trusted GPT 5.4 to make it look good for us! We can see that each of those representative chunks covers a broad range of the handbook’s main topics: harassment policy, stockholder meeting requirements, compensation committee governance, data team reporting, warehouse design, Airflow operations, Salesforce renewal processes, pricing structures, CEO shadow instructions, pre-sales expectations, demo systems infrastructure, and more. This form of information extraction is exactly what we are aiming for. We are not just getting 15 random pages from the handbook; we are sampling the handbook’s main thematic spread.


Was it all worth it?

11 We will now ask the LLM to summarize those summaries into one rich overview. But before we start proceeding and pop the champagne, let’s see if doing all the math and multi-summary generation has actually paid off in reducing memory and LLM context load. We take the 15 summaries and then just join them ad hoc (for now), then convert that into its original document format and count the tokens.

from langchain.schema import Document
summaries = "n".join(summary_list)

# Convert it back to a document
summaries = Document(page_content=summaries)

print (f"Your total summary has {llm.get_num_tokens(summaries.page_content)} tokens")

Your total summary has 4219 tokens

Success! This new intermediate document is much smaller than the source. The combined summary weighs in at 4219 tokens, which is a far cry from the original 220035-token beast. We have achieved a 98% reduction in context window token consumption!

This is the kind of optimization that makes an enterprise workflow practical. We did not pretend that the original document is small; we are building a compact proxy for it that still carries the major themes forward.


Singularity

12 Now we are ready for the final “reduce” part and to converge all the summaries we have generated into the final holistic document summary.

combine_prompt = """
You will be given a series of summaries from a book. The summaries will be enclosed in triple backticks (```)
Your goal is to give a verbose summary of what happened in the story.
The reader should be able to grasp what happened in the book.

```{text}```
VERBOSE SUMMARY:
"""

combine_prompt_template = PromptTemplate(template=combine_prompt, input_variables=["text"])

reduce_chain = load_summarize_chain(llm=llm4,
                             chain_type="stuff",
                             prompt=combine_prompt_template,
                             verbose=True # Set this to true if you want to see the inner workings
                                   )

output = reduce_chain.run([summaries])
print (output)

We start by creating a second summarization prompt and creating a second summarization chain. The intermediate document we created in the previous step is then fed as the input for this chain. In simple terms, first we made the model understand each of the boroughs of NYC, and now we are asking it to understand NYC as a whole using those understandings.

The final output text. Redesigned by GPT 5.4

As we can see, the final output does read well. It is clear in information and pretty easy to follow. But here is the slightly awkward part: the report leans much harder into the demo systems and Kubernetes parts of the handbook than into the full spread of topics we saw earlier. This does not mean that the whole workflow collapsed and the experiment failed.

The smaller cluster summaries touched governance, pricing, Salesforce, Airflow, Okta, customer engagement, etc. By the time we reached the final combined summary, much of that had thinned out. So yes, the prose got cleaner, but the coverage also got narrower.

Why did this happen? What can we do to improve on this? Let’s look at these questions more in-depth.


Where did we go Right?

Enterprise documents are always messy. The topics within their content overlap, the useful pieces of information can appear anywhere, and sending the whole thing in one shot is too expensive and guarantees inaccuracies.

By clustering the split document chunks, choosing a fairly reliable representative out of those chunks, and then using them to summarize, we got something much more usable than brute forcing the whole handbook through one prompt. The LLM is no longer walking around a minefield blind.

We were able to take a 220035-token handbook and reduce it to a manageable set of representative chunks of text. The preview summaries covered a broad range of relevant themes of the handbook.

The intermediate summary of the chunks shrank the problem again into something the model could actually work with. So even though the reducer butterfingers the last handoff a bit, the results before it show that clustering and representative-chunk selection make this problem far easier to handle in a reliable way.


Where did we go Wrong?

Just as we recognize and acknowledge our strengths, we must also acknowledge our weaknesses. This system is not perfect, and its flaws are evident. The chunk-summary step preserved a diverse range of themes, but the final reduce and summarize step narrowed that diversity. Ironically, this led to a second round of the same problem we were trying to avoid: important information was lost during aggregation, even after it was preserved upstream.

Still, a single representative text chunk can miss nuances from the cluster. Overlapping clusters can blur the topic boundaries. The final synthesized LLM interaction can focus on the strongest or most detailed theme in the batch, as seen in this case. This doesn’t render the workflow useless; it highlights the areas for improvement.

The next round of fixes should include a stronger reduction prompt that requires coverage across major themes, multiple representatives per cluster (increasing the number of centroids), and a final topical-sanity check against the information spread observed in the previews.

If this workflow is used in domains where data loss is critical, such as medicine, legal review, or security, then validation of the final output is essential. Additionally, retrieval layers or a human-in-the-loop feedback step may be necessary.

“Useful” doesn’t imply “infallible.” It means we have a scalable system that is good enough to learn from and worth improving.


Class Dismissed, This Time for Real

Part 1 was about surviving the scale problem. Part 2 was about turning that survival strategy into an actual summarization pipeline. We started with 1360 chunks from a 220035-token handbook, grouped them into 15 clusters, visualized their structure, sanity-checked the grouping quality, picked representative chunks, summarized them individually, compressed those summaries into a 4219-token intermediate document, and then generated a final combined summary.

Clustering helps with the scale problem. Representative-chunk selection gives the workflow more structure. But the final summarization prompt still needs tuning for the whole-document coverage. To me, that is the practical value of this experiment. It gives us something useful right now, and it also points pretty clearly to what we should fix next.

So no, this is not a neat little mission accomplished ending. I think that is better, honestly. We now have a summarization pipeline that works well enough to teach us something real: keeping breadth alive in the final aggregation step matters just as much as reducing the document in the first place.

Photo by Wilhelm Gunkel on Unsplash

If you have made it this far, thank you again for reading and for tolerating my classroom metaphors. I hope this helped make large-document summarization feel a little less like it’s all AI magic and a little more buildable.

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button