Beyond Boundaries: Summarizing Entire Books into Enlightening Insights

Akash A Desai
12 min readApr 8, 2024

--

different methods for summarising the entire book

Level 1: Map Reduce — Summarizing Multiple Pages

This process involves passing each document through a Language Model (LLM) and then reducing them using the ReduceDocumentsChain. It’s effective in scenarios similar to the ReduceDocumentsChain but begins with an initial LLM call before reducing the documents.

A key consideration in building a summarizer is how to incorporate documents into the LLM’s context window. Two common approaches are:

  1. Stuffing: Simply combine all documents into a single prompt. This straightforward method is detailed further in the create_stuff_documents_chain constructor.
  2. Map-reduce: Summarizing each document individually in a “map” step and then combining the summaries into a final summary. This approach is implemented using the MapReduceDocumentsChain.

A fundamental inquiry in constructing a summarizer revolves around the method of feeding documents into the context window of the Language Model (LLM). Two prevalent strategies for this include:

source: langchain

When dealing with multiple pages for summarization, encountering token limits is a common challenge. While token limits may not always pose an issue, it’s essential to know how to address them if they arise. One effective approach to manage token limits is through the “Map Reduce” chain type.

This method involves generating summaries for smaller chunks of text that fit within the token limit, followed by summarizing these individual summaries.

llm = OpenAI(temperature=0, openai_api_key=openai_api_key)
BOOK_PATH = "/content/Rich-Dad-Poor-Dad.pdf"

# Load the book and preprocess text
loader = PyPDFLoader(BOOK_PATH)
pages = [page.page_content.replace('\t', ' ') for page in loader.load()[26:277]]
book_text_normal = "".join(pages)

llm.get_num_tokens(book_text_normal)
81291

That’s too many tokens, To ensure our text fits within the prompt limit, let’s split it into smaller chunks. We’ll use a chunk size of 8,000 characters.

text_splitter = RecursiveCharacterTextSplitter(separators=["\n\n", "\n"], chunk_size=8000, chunk_overlap=400)

docs = text_splitter.create_documents([book_text_normal])
num_docs = len(docs)

num_tokens_first_doc = llm.get_num_tokens(docs[0].page_content)

print (f"Now we have {num_docs} documents and the first one has {num_tokens_first_doc} tokens")
Now we have 47 documents and the first one has 1997 tokens

now we’re ready to proceed. Let’s use the LangChain’s load_summarize_chain for map-reducing tasks. But first, let’s set up our chain.

summary_chain = load_summarize_chain(llm=llm, chain_type='map_reduce',
verbose=True )

Now run it

output = summary_chain.run(docs)
output:
"Rich Dad Poor Dad" is a book that challenges traditional financial advice
and emphasizes the importance of financial intelligence and building assets.
It shares personal experiences and offers tips for achieving financial
success, while also discussing the importance of adapting to change
and finding one's life's purpose. The author's other books and teachings
aim to change the way people think about money and life.

The initial summary is good, but let's make it in some bullet points.

I’m instructing the model to generate bullet points using custom prompts.

While keeping the map_prompt unchanged, I will adjust the combine_prompt.

map_prompt = """
Write a concise summary of following:
"{text}"
CONCISE SUMMARY:
"""
map_prompt_template = PromptTemplate(template=map_prompt, input_variables=["text"])
combine_prompt = """
Write a concise summary of the following text delimited by triple backquotes.
Return your response in bullet points which covers the key points of the text.
```{text}```
BULLET POINT SUMMARY:
"""
combine_prompt_template = PromptTemplate(template=combine_prompt, input_variables=["text"])
summary_chain = load_summarize_chain(llm=llm,
chain_type='map_reduce',
map_prompt=map_prompt_template,
combine_prompt=combine_prompt_template,
# verbose=True
)
output = summary_chain.run(docs)
print (output)
- Two boys, Mike and the narrator, try to make money by pouring lead into plaster molds
- Narrator's dad catches them and explains counterfeiting, encourages them to keep trying
- Boys ask Mike's successful dad to teach them, he agrees if they work for him
- Mike starts working for his friend's father, learns valuable lessons about life and money
- Rich dad teaches author about money and importance of learning from life experiences
- Rich dad advises thinking like a Texan, taking risks, and focusing on goals for financial success
- Importance of having a success-focused mindset, positive thinking, and self-discipline
- Tips for success in real estate investing, taking action, and using financial intelligence
- Importance of converting earned income into passive and portfolio income for financial freedom
- Robert Kiyosaki challenges traditional advice and promotes financial education through his books and products
- Importance of being financially savvy, especially for women.
Execution time: 0:00:23.879601

Level 2: Best Representation Vectors — Summarize an entire book

In the above method, we pass the entire document (all 9.5K tokens of it) to the LLM. But what if you have more tokens than that?

What if you had a book you wanted to summarize? Let’s load one up, we’re going to load the Rich-Dad-Poor-Dad pdf

# normal method
import time
from langchain import OpenAI
from langchain.chains.summarize import load_summarize_chain
from langchain.text_splitter import RecursiveCharacterTextSplitter

BOOK_PATH = "https://sopheaksrey.files.wordpress.com/2012/04/rich_dad_poor_dad_by_robert_t-_kiyosaki.pdf"

# Load the book and preprocess text
loader = PyPDFLoader(BOOK_PATH)
pages = [page.page_content.replace('\t', ' ') for page in loader.load()[26:277]]
book_text_normal = "".join(pages)


# Measure execution time
start_time = time.time()

text_splitter = RecursiveCharacterTextSplitter(separators=["\n\n", "\n"], chunk_size=10000, chunk_overlap=500)

docs = text_splitter.create_documents([book_text_normal])

llm = OpenAI(temperature=0, openai_api_key=openai_api_key)

num_docs = len(docs)

num_tokens = llm.get_num_tokens(docs[0].page_content)

print (f"Now we have {num_docs} documents and {num_tokens} tokens")
Now we have 29 documents and the first one has 2446 tokens

So how do we do this without going through all the tokens? Pick random chunks? Pick equally spaced chunks.

that's best representation vector method

The goal is to Chunk your book and then get embeddings of the chunks. Pick a subset of chunks that represent a holistic but diverse view of the book. Or another way, is there a way to pick the top 10 passages that describe the book the best

Once we have our chunks that represent the book then we can summarize those chunks and hopefully get a pretty good summary.

below are The BRV Steps :

  1. Load your book into a single text file
  2. Split your text into large-ish chunks
  3. Embed your chunks to get vectors
  4. Cluster the vectors to see which are similar to each other and likely talk about the same parts of the book
  5. Pick embeddings that represent the cluster the most (method: closest to each cluster centroid)
  6. Summarize the documents that these embeddings represent

Another way to phrase this process is, "Which ~10 documents from this book represent most of the meaning? I want to build a summary of those."

Note: There will be a bit of information loss, but show me a summary of a whole book that doesn't have information loss ;)

# Loaders
from langchain.schema import Document
# Splitters
from langchain.text_splitter import RecursiveCharacterTextSplitter
# Model
from langchain.chat_models import ChatOpenAI
# Embedding Support
from langchain.vectorstores import FAISS
from langchain.embeddings import OpenAIEmbeddings
# Summarizer we'll use for Map Reduce
from langchain.chains.summarize import load_summarize_chain
# Data Science
import numpy as np
from sklearn.cluster import KMeans

I’m going to initialize two models, gpt-3.5 and gpt4. I’ll use gpt 3.5 for the first set of summaries to reduce cost and then gpt4 for the final pass which should hopefully increase the quality.

text_splitter = RecursiveCharacterTextSplitter(separators=["\n\n", "\n", "\t"], chunk_size=10000, chunk_overlap=3000)

docs = text_splitter.create_documents([text])
num_documents = len(docs)

print (f"Now our book is split up into {num_documents} documents")
ow our book is split up into 78 documents

Let's get our embeddings of those 78 documents

embeddings = OpenAIEmbeddings(openai_api_key=openai_api_key)

vectors = embeddings.embed_documents([x.page_content for x in docs])

Now let's cluster our embeddings. There are a ton of clustering algorithms you can choose from. Please try a few out to see what works best for you!

# Assuming 'embeddings' is a list or array of 1536-dimensional embeddings
# Choose the number of clusters, this can be adjusted based on the book's content.
# I played around and found ~10 was the best.
# Usually if you have 10 passages from a book you can tell what it's about
num_clusters = 11

# Perform K-means clustering
kmeans = KMeans(n_clusters=num_clusters, random_state=42).fit(vectors)

Here are the clusters that were found. It’s interesting to see the progression of clusters throughout the book. This is expected because as the plot changes you’d expect different clusters to emerge due to different semantic meaning

kmeans.labels_

It’s enjoyable, but whenever there’s a clustering task, it’s tempting to visualize it. Ensure that you include colors for better distinction. Additionally, we should perform dimensionality reduction to shrink the vectors from ‘n’ dimensions down to 2

from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

# Taking out the warnings
import warnings
from warnings import simplefilter

# Filter out FutureWarnings
simplefilter(action='ignore', category=FutureWarning)

# Perform t-SNE and reduce to 2 dimensions
tsne = TSNE(n_components=2, random_state=42)
reduced_data_tsne = tsne.fit_transform(vectors)

# Plot the reduced data
plt.scatter(reduced_data_tsne[:, 0], reduced_data_tsne[:, 1], c=kmeans.labels_)
plt.xlabel('Dimension 1')
plt.ylabel('Dimension 2')
plt.title('Book Embeddings Clustered')
plt.show()

Awesome, not perfect, but pretty good directionally. Now we need to get the vectors that are closest to the cluster centroids (the center).

# Find the closest embeddings to the centroids

# Create an empty list that will hold your closest points
closest_indices = []

# Loop through the number of clusters you have
for i in range(num_clusters):

# Get the list of distances from that particular cluster center
distances = np.linalg.norm(vectors - kmeans.cluster_centers_[i], axis=1)

# Find the list position of the closest one (using argmin to find the smallest distance)
closest_index = np.argmin(distances)

# Append that position to your closest indices list
closest_indices.append(closest_index)

Now sort them (so the chunks are processed in order)

selected_indices = sorted(closest_indices)
selected_indices

It's interesting to see which chunks pop up as most descriptive. How does your distribution look?

Let's create our custom prompts. we are using gpt4 (which has a bigger token limit) for the combine step so I'm asking for long summaries in the map step to reduce the information loss.

llm3 = ChatOpenAI(temperature=0,
openai_api_key=openai_api_key,
max_tokens=1000,
model='gpt-3.5-turbo'
)
map_prompt = """
You will be given a single passage of a book. This section will be enclosed in triple backticks (```)
Your goal is to give a summary of this section so that a reader will have a full understanding of what happened.
Your response should be at least three paragraphs and fully encompass what was said in the passage.

```{text}```
FULL SUMMARY:
"""
map_prompt_template = PromptTemplate(template=map_prompt, input_variables=["text"])
map_chain = load_summarize_chain(llm=llm3,
chain_type="stuff",
prompt=map_prompt_template)

Then go get your docs which the top vectors represented.

selected_docs = [docs[doc] for doc in selected_indices]

Let's loop through our selected docs and get a good summary for each chunk. We'll store the summary in a list.

# Make an empty list to hold your summaries
summary_list = []

# Loop through a range of the lenght of your selected docs
for i, doc in enumerate(selected_docs):

# Go get a summary of the chunk
chunk_summary = map_chain.run([doc])

# Append that summary to your list
summary_list.append(chunk_summary)

print (f"Summary #{i} (chunk #{selected_indices[i]}) - Preview: {chunk_summary[:250]} \n")

Great, now that we have our list of summaries, let's get a summary of the summaries

summaries = "\n".join(summary_list)

# Convert it back to a document
summaries = Document(page_content=summaries)

print (f"Your total summary has {llm.get_num_tokens(summaries.page_content)} tokens")
llm4 = ChatOpenAI(temperature=0,
openai_api_key=openai_api_key,
max_tokens=3000,
model='gpt-4',
request_timeout=120
)
combine_prompt = """
You will be given a series of summaries from a book. The summaries will be enclosed in triple backticks (```)
Your goal is to give a verbose summary of what happened in the story.
The reader should be able to grasp what happened in the book.

```{text}```
VERBOSE SUMMARY:
"""
combine_prompt_template = PromptTemplate(template=combine_prompt, input_variables=["text"])
reduce_chain = load_summarize_chain(llm=llm4,
chain_type="stuff",
prompt=combine_prompt_template,
# verbose=True # Set this to true if you want to see the inner workings
)

Run! Note this will take a while

output = reduce_chain.run([summaries])
print (output)
"Rich Dad Poor Dad" is a book that explores the author's journey towards financial literacy and independence, guided by the teachings of his rich dad. The author's rich dad emphasizes the importance of not working for money, but rather having money work for you. He explains that most people are trapped in the cycle of working for money due to fear and lack of financial education. The rich dad challenges the author to work for him for free, in order to learn the lesson of not working for money. This experience is meant to teach the author the value of financial education and the importance of having money work for you.

The author also discusses the difference in perception between his rich dad and poor dad when it comes to owning a house. The rich dad views a house as a liability, while the poor dad sees it as an asset. The author explains that owning a home can lead to financial strain due to the expenses associated with it, such as mortgage payments, property taxes, insurance, maintenance, and utilities. He argues that many people work their entire lives paying for a home they never truly own, as they continually take out new loans to pay off previous ones.

The author emphasizes the importance of understanding the difference between assets and liabilities. He suggests that instead of focusing on buying a bigger house, individuals should invest in income-producing assets that can generate cash flow to cover expenses. He illustrates this point by comparing the financial statements of his poor dad, whose liabilities outweigh his assets, with those of his rich dad, who has focused on investing and minimizing liabilities. The rich dad's asset column generates enough income to cover expenses and continue growing, leading to increased wealth over time.

The author discusses the importance of minding your own business and focusing on building assets rather than just relying on a job for income. He explains that many people spend their lives working for others and end up with nothing to show for it. The educational system is criticized for focusing on preparing students for jobs rather than teaching them about financial independence. The author emphasizes the difference between a profession and a business, stating that it is essential to have your own business in addition to a job. Financial security is achieved by focusing on building assets rather than just increasing income.

The author discusses the importance of financial intelligence and how it can impact one's ability to create wealth. He highlights the concept of creating money through financial intelligence rather than simply saving money. He emphasizes the idea that financial intelligence involves having more options and being creative in solving financial problems. He gives examples of people missing out on opportunities due to lack of financial intelligence and the importance of being able to see and seize opportunities when they arise.

The author contrasts the beliefs of his two father figures - his educated dad and his rich dad. His educated dad believed in the traditional idea of specialization, where one should focus on a specific field to excel and advance in their career. In contrast, his rich dad encouraged him to learn a little about a lot, gaining knowledge and experience in various areas to become a well-rounded individual. This advice led the author to work in different departments of his rich dad's companies, gaining a diverse skill set.

The author discusses the importance of taking risks and having the right attitude towards failure in order to achieve financial freedom. He emphasizes the need to start accumulating wealth early and to have a positive attitude towards failure, as it can lead to growth and success. The passage also highlights the impact of cynicism and doubt on one's ability to take risks and make investments. The author uses examples of individuals who missed out on opportunities due to listening to negative opinions from others.

The author discusses the importance of financial intelligence and being the master of money rather than letting money control you. By training oneself and loved ones to be masters of money early on, one can avoid being a slave to money and instead have money obey them. The author emphasizes the power of choosing heroes and emulating them as a way of learning and gaining inspiration. By having heroes, individuals can tap into a source of raw genius and make difficult tasks seem easier, thus motivating themselves to achieve similar success.

The author emphasizes the importance of taking action and using financial intelligence to solve common problems in life. The author shares a story of a friend who was struggling to save money for his children's college education, but with the author's guidance, he was able to invest in real estate and generate passive income to fund his children's education and his own retirement. Through strategic investments and financial intelligence, the friend was able to turn a $7,900 investment into a successful real estate venture that provided him with a steady income stream and substantial returns.

The author reflects on the personal insights gained while transitioning from the Employee (E) and Self-Employed (S) side to the Business Owner (B) and Investor (I) side of the CASHFLOW Quadrant. The shift in mindset and perspective is highlighted as the author navigates through different quadrants, indicating a growth in understanding and awareness of financial independence and wealth-building strategies. The author emphasizes the importance of surrounding oneself with individuals who align with the desired quadrant. By listing the six adults they spend the most time with and categorizing them into quadrants, the author underscores the significance of influence and environment in shaping one's financial goals and aspirations.

The book concludes with the author's commitment to providing financial education through his company, offering courses and resources to support individuals in evolving into the B and I quadrants of financial success. The author's journey of personal growth and education, moving away from traditional schooling and towards self-discovery and entrepreneurship, is also highlighted. The author reflects on the pressure of achieving high grades and prestigious degrees in his family of educators, but ultimately decides to pursue a path of learning for personal development rather than external validation. This shift in mindset leads him to question the flaws in traditional education, such as the lack of preparation for the real world and the disconnect between academic success and real-life success

Wow, that was a long process, but you get the gist.BRV is the best method to get a better summary.

--

--

Akash A Desai

Data Scientist with 4+ years of exp ! open source !Vision ! Generative AI ! Vectordb ! llms