Advanced RAG: Extracting Complex PDFs containing tables & Text Using LlamaParse

4 min readMay 25, 2024

In this blog, we’ll compare LangChain and LlamaIndex for better extraction of PDF data, especially those containing tables and text. Here’s what we’ll cover:

Q&A on PDF data using LangChain
Q&A on PDF data using LlamaIndex
Q&A on PDF data using LlamaIndex with LlamaParse

We’ll use LanceDB as the vector database for this Q&A. By the end of this post, you’ll have a clear idea of which method is best for table extraction.

Step-by-Step Practical Guide

Let’s start by downloading a complex PDF that includes tables and text.

we are using uber_10q_march_2022 a pdf file. This will help us evaluate the performance and results of each method.

Set Up the Environment

First, we need to install both LlamaIndex and LangChain. Once installed, you can use the following code to set up the environment:


import nest_asyncio
nest_asyncio.apply()

import os

# API access to llama-cloud
os.environ["LLAMA_CLOUD_API_KEY"] = "llx-"
# Using OpenAI API for embeddings/llms
os.environ["OPENAI_API_KEY"] = "sk-proj-"

Download the Example PDF:

we are using uber_10q_march_2022

1. Q&A on PDF Data Using LangChain

below is code taken from the langchain reference


import bs4
from langchain import hub
from langchain_community.document_loaders import WebBaseLoader
from langchain_openai import ChatOpenAI
from langchain.vectorstores import LanceDB
from langchain_community.document_loaders import PyPDFLoader
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter


llm = ChatOpenAI(model="gpt-3.5-turbo-0125")

loader = PyPDFLoader("")
docs = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = text_splitter.split_documents(docs)
vectorstore = LanceDB.from_documents(documents=splits, embedding=OpenAIEmbeddings())

# Retrieve and generate using the relevant snippets of the blog.
retriever = vectorstore.as_retriever()
prompt = hub.pull("rlm/rag-prompt")

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)


rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)


qa_langchain_query1 = "how is the Cash paid for Income taxes, net of refunds from Supplemental disclosures of cash flow information?"
rag_chain.invoke(qa_langchain_query1)

I don't know.

Unfortunately, LangChain with Q&A RAG did not return any answers for the sample PDF in our tests

Also, I'm only showing one sample question here in the collab notebook I have provided more questions.

2. Q&A on PDF Data Using LlamaIndex

Now let’s use the Lamaindex with a normal PDF loader

from llama_index.core import SimpleDirectoryReader
from llama_index.postprocessor.flag_embedding_reranker import (
    FlagEmbeddingReranker,
)

reader = SimpleDirectoryReader(input_dir="/content/data_pdf/")

documents_pdf_loader = reader.load_data()

vector_store_pdf = LanceDBVectorStore(uri="/tmp/lancedb_lamaindex")
storage_context_pdf = StorageContext.from_defaults(vector_store=vector_store_pdf)
lance_index_pdf = VectorStoreIndex.from_documents(
    documents_pdf_loader, storage_context=storage_context_pdf
)


reranker = FlagEmbeddingReranker(
    top_n=5,
    model="BAAI/bge-reranker-large",
)

Lance_index_query_pdf = lance_index_pdf.as_query_engine(
    similarity_top_k=10, node_postprocessors=[reranker]
)

qa_lama_query1 = "how is the Cash paid for Income taxes, net of refunds from Supplemental disclosures of cash flow information?"
output1 = Lance_index_query_pdf.query(qa_lama_query1)
print(output1.response)

$22

we got only $22, you can use Reranker or without reranker the results are the same.

3 .LamaParser with Lamaindex

Now let's do some experiments with LlamaParse.

LlamaParse is an API developed by LlamaIndex for efficient file parsing and representation. It integrates seamlessly with LlamaIndex frameworks, providing a robust solution for document processing. Currently, LlamaParse supports only PDF files and is available for free.


from llama_parse import LlamaParse
from llama_index.vector_stores.lancedb import LanceDBVectorStore
from llama_index.core.node_parser import SimpleNodeParser
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import VectorStoreIndex, StorageContext


node_parser = SimpleNodeParser()

documents = LlamaParse(result_type="markdown").load_data("/content/uber_10q_march_2022.pdf")


vector_store_lancedb = LanceDBVectorStore(uri="/tmp/lancedb_lamaindex")



nodes = node_parser.get_nodes_from_documents(documents)

storage_context = StorageContext.from_defaults(vector_store=vector_store_lancedb)

index = VectorStoreIndex(
    nodes=nodes,
    storage_context=storage_context,
    embed_model=OpenAIEmbedding(),
)


query_engine = index.as_query_engine(similarity_top_k=15)
# query = "What is Multi-Head Attention also known as?"
query = "how is the Cash paid for Income taxes, net of refunds from Supplemental disclosures of cash flow information?"
response_1 = query_engine.query(query)
print("\n**LlamaParse+ Lamaindex**")
print(response_1)

output:

The Cash paid for Income taxes, net of refunds from Supplemental disclosures of cash flow information was $41 million for the three months ended March 31, 2021, and $22 million for the three months ended March 31, 2022.

By following the above steps, you can compare the performance of LangChain, LlamaIndex, and LlamaIndex with LlamaParse for extracting data from PDFs containing tables and text. This comparison will help you determine the best method for your specific needs.

For a more comprehensive comparison, you can check our collaborative

the notebook where we’ve included more examples.

Google Colab

Edit description

colab.research.google.com

Lamaparser is doing well in this case. you can try some different PDFs & questions to compare the results.