Advanced RAG: Extracting Complex PDFs containing tables & Text Using LlamaParse

In this blog, we’ll compare LangChain and LlamaIndex for better extraction of PDF data, especially those containing tables and text. Here’s what we’ll cover:
- Q&A on PDF data using LangChain
- Q&A on PDF data using LlamaIndex
- Q&A on PDF data using LlamaIndex with LlamaParse
We’ll use LanceDB as the vector database for this Q&A. By the end of this post, you’ll have a clear idea of which method is best for table extraction.
Step-by-Step Practical Guide
Let’s start by downloading a complex PDF that includes tables and text.
we are using uber_10q_march_2022 a
pdf file. This will help us evaluate the performance and results of each method.
Set Up the Environment
First, we need to install both LlamaIndex and LangChain. Once installed, you can use the following code to set up the environment:
import nest_asyncio
nest_asyncio.apply()
import os
# API access to llama-cloud
os.environ["LLAMA_CLOUD_API_KEY"] = "llx-"
# Using OpenAI API for embeddings/llms
os.environ["OPENAI_API_KEY"] = "sk-proj-"
Download the Example PDF:
we are using uber_10q_march_2022
1. Q&A on PDF Data Using LangChain
below is code taken from the langchain reference
import bs4
from langchain import hub
from langchain_community.document_loaders import WebBaseLoader
from langchain_openai import ChatOpenAI
from langchain.vectorstores import LanceDB
from langchain_community.document_loaders import PyPDFLoader
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
llm = ChatOpenAI(model="gpt-3.5-turbo-0125")
loader = PyPDFLoader("")
docs = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = text_splitter.split_documents(docs)
vectorstore = LanceDB.from_documents(documents=splits, embedding=OpenAIEmbeddings())
# Retrieve and generate using the relevant snippets of the blog.
retriever = vectorstore.as_retriever()
prompt = hub.pull("rlm/rag-prompt")
def format_docs(docs):
return "\n\n".join(doc.page_content for doc in docs)
rag_chain = (
{"context": retriever | format_docs, "question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)
qa_langchain_query1 = "how is the Cash paid for Income taxes, net of refunds from Supplemental disclosures of cash flow information?"
rag_chain.invoke(qa_langchain_query1)
I don't know.
Unfortunately, LangChain with Q&A RAG did not return any answers for the sample PDF in our tests
Also, I'm only showing one sample question here in the collab notebook I have provided more questions.
2. Q&A on PDF Data Using LlamaIndex
Now let’s use the Lamaindex with a normal PDF loader
from llama_index.core import SimpleDirectoryReader
from llama_index.postprocessor.flag_embedding_reranker import (
FlagEmbeddingReranker,
)
reader = SimpleDirectoryReader(input_dir="/content/data_pdf/")
documents_pdf_loader = reader.load_data()
vector_store_pdf = LanceDBVectorStore(uri="/tmp/lancedb_lamaindex")
storage_context_pdf = StorageContext.from_defaults(vector_store=vector_store_pdf)
lance_index_pdf = VectorStoreIndex.from_documents(
documents_pdf_loader, storage_context=storage_context_pdf
)
reranker = FlagEmbeddingReranker(
top_n=5,
model="BAAI/bge-reranker-large",
)
Lance_index_query_pdf = lance_index_pdf.as_query_engine(
similarity_top_k=10, node_postprocessors=[reranker]
)
qa_lama_query1 = "how is the Cash paid for Income taxes, net of refunds from Supplemental disclosures of cash flow information?"
output1 = Lance_index_query_pdf.query(qa_lama_query1)
print(output1.response)
$22

we got only $22, you can use Reranker or without reranker the results are the same.
3 .LamaParser with Lamaindex
Now let's do some experiments with LlamaParse.
LlamaParse is an API developed by LlamaIndex for efficient file parsing and representation. It integrates seamlessly with LlamaIndex frameworks, providing a robust solution for document processing. Currently, LlamaParse supports only PDF files and is available for free.
from llama_parse import LlamaParse
from llama_index.vector_stores.lancedb import LanceDBVectorStore
from llama_index.core.node_parser import SimpleNodeParser
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import VectorStoreIndex, StorageContext
node_parser = SimpleNodeParser()
documents = LlamaParse(result_type="markdown").load_data("/content/uber_10q_march_2022.pdf")
vector_store_lancedb = LanceDBVectorStore(uri="/tmp/lancedb_lamaindex")
nodes = node_parser.get_nodes_from_documents(documents)
storage_context = StorageContext.from_defaults(vector_store=vector_store_lancedb)
index = VectorStoreIndex(
nodes=nodes,
storage_context=storage_context,
embed_model=OpenAIEmbedding(),
)
query_engine = index.as_query_engine(similarity_top_k=15)
# query = "What is Multi-Head Attention also known as?"
query = "how is the Cash paid for Income taxes, net of refunds from Supplemental disclosures of cash flow information?"
response_1 = query_engine.query(query)
print("\n**LlamaParse+ Lamaindex**")
print(response_1)
output:
The Cash paid for Income taxes, net of refunds from Supplemental disclosures of cash flow information was $41 million for the three months ended March 31, 2021, and $22 million for the three months ended March 31, 2022.
By following the above steps, you can compare the performance of LangChain, LlamaIndex, and LlamaIndex with LlamaParse for extracting data from PDFs containing tables and text. This comparison will help you determine the best method for your specific needs.
For a more comprehensive comparison, you can check our collaborative
the notebook where we’ve included more examples.
Lamaparser is doing well in this case. you can try some different PDFs & questions to compare the results.