Elevate Your Results: Enhancing LLM Capabilities with Instructor and LanceDB

Akash A Desai
8 min readMar 31, 2024

using advanced rag methods such as hybrid search with reranker

Discover how to harness the full potential of structured outputs from large language models (LLMs) with the Instructor and LanceDB libraries. This comprehensive guide walks you through the process of leveraging these powerful tools to enhance your workflows and achieve more accurate and reliable results.

Efficient Data Processing with LangChain

First things first, let’s simplify the process of handling PDF documents. With LangChain’s PyPDFLoader and RecursiveCharacterTextSplitter modules, you can easily extract and organize data from PDF files of any size. This sets the stage for smoother analysis down the line.

from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Load zommato annual report. This may take 1-2 minutes since the PDF is large
sec_filing_pdf = "https://b.zmtcdn.com/investor-relations/Zomato_Annual_Report_2022-23.pdf"

# Create your PDF loader
loader = PyPDFLoader(sec_filing_pdf)

# Load the PDF document
documents = loader.load()

# Chunk the pdf
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1024, chunk_overlap=50)
docs = text_splitter.split_documents(documents)

Harnessing LanceDB for Structured Data

Next up, LanceDB comes into play. This versatile vector store, combined with OpenAI embeddings, allows you to efficiently manage and manipulate textual data. With LanceDB, you’ll have all the tools you need to process strings effortlessly.

from langchain_community.vectorstores import LanceDB
from langchain.embeddings.openai import OpenAIEmbeddings
import lancedb
from lancedb.embeddings import get_registry
from lancedb.pydantic import Vector, LanceModel

openai = get_registry().get("openai").create()

class Schema(LanceModel):
text: str = openai.SourceField()
vector: Vector(openai.ndims()) = openai.VectorField()

embedding_function = OpenAIEmbeddings()

db = lancedb.connect("~/langchain")
table = db.create_table(
"zomatofood",
schema=Schema,
mode="overwrite",
)

# Load the document into LanceDB
db = LanceDB.from_documents(docs, embedding_function, connection=table)

Fine-Tuning Search with LanceDB’s Reranking Methods

But we’re not done yet. LanceDB offers advanced reranking methods like ColbertV2, Cohere reranker, etc which lets you optimize your search algorithms for tailored results. Whether you’re analyzing financial reports or exploring market trends, LanceDB ensures you extract insights with precision.

from lancedb.rerankers import ColbertReranker

reranker = ColbertReranker()
docs = table.search(query, query_type="hybrid").limit(5).rerank(reranker=reranker).to_pandas()["text"].to_list()

below are results from reranking method

docs
['which enable them to engage and acquire customers to grow their business while also providing a reliable and efficient last mile delivery service.\n Note: As p\ner the Indian Accounting Standards, end-users and delivery partners are considered as \nZomato’s customers only under limited circumstances. For the purpose of this BRSR disclosure, the definition of customers includes end-users while delivery partners have been considered as value chain partners.',
'We may not be able to fully manage expectations of some of our stakeholders including grievances of key stakeholders, such as customers, merchants, and delivery partners. Customer preferences are dynamic in nature and failure in keeping up with these emerging trends can result in loss of trust or dissatisfaction which may have a negative impact on the Company. To address grievances effectively, Zomato has dedicated tools and teams in place. These resources track, monitor, and resolve complaints across various communication channels including real-time chat / call support through the Zomato app. For unresolved issues, stakeholders can directly write to us through designated email addresses which are available on the Zomato website. Additionally, Zomato offers an SOS Help Desk service which provides immediate assistance to delivery partners in case of emergencies.',
'Lack of product innovation can result in Zomato’s offerings becoming less relevant compared to other market players as customer preferences are dynamic in nature and keep on evolving. This can lead to a negative impact on the Company. We remain committed to enhancing overall stakeholder experience with a focus on driving long-term engagement through innovation. Zomato continuously collects feedback from various stakeholders to improve its offerings. Zomato also has a process in place to ensure testing is done before any feature / product is rolled out to our customers. \n16. Auditors and auditors’ reports\ni. Statutory auditor\n M/s. Deloitte Haskins & Sells, Chartered Accountants, (FRN: 015125N), are appointed as the Statutory Auditors of the Company for a term of 5 (five) consecutive years to hold office from the conclusion of the 10\nth AGM till the conclusion of the 15th AGM.\n M/s. De\nloitte Haskins & Sells, Chartered Accountants, \nStatutory Auditors have confirmed that:\na. their app',
'• ISO certification- Zomato is committed to adhering to global best practic\nes for data protection \nand has secured ISO 27001 certification for the management of information security. \n• Periodic assessment- The company has a review mechanism in place to evalua\nte the security \nposition of the company including independent assessment, such as audits, Vulnerability Assessment and Penetration Testing (VAPT) assessments, third-party reviews, bug bounty programs, etc.Negative implications- Inadequate mitigation measures may lead to data breach or loss of confidential information, resulting in negative financial impact.\n5. Governance\nManagement \nof key stakeholders (End-users, restaurant partners and delivery partners) Risk Risk- Ineffective \nmanagement of our key stakeholder expectations, and inadequate redressal of grievances may lead to dissatisfaction resulting in business disruption, loss of trust, impact on reputation and long-term growth, among others.',
'Zomato Limited operates in India and UAE.\n a. Number of locations \nLocations Number\nNational (No. of States) 34 (31 States and 3 Union Territories)\nInternational (No. of Countries) 1 (UAE)\n b. What is the contribution of exp\norts as a percentage of the total turnover of the entity?\n There i\ns a limited export for Zomato IP to its overseas group entities and marketing services to a third \nparty. Total export is 0. 11% of total revenue from operations of Zomato Limited for FY23.\n c. A brief on types of cust\nomers\n For the pu\nrpose of this BRSR disclosure, we have two types of customers as defined below-\n 1. End-us\ners of our platform - End-users are customers who use our platform to search and \ndiscover restaurants, read and write customer generated reviews and view and upload photos, \norder food delivery, book a table and make payments while dining-out at restaurants.\n 2. Restaura\nnt partners - We provide restaurant partners with industry-specific marketing tools']

Enhancing Output Quality with Instructor Library

Now, let’s take our processed data and elevate its quality using the powerful capabilities of the Instructor library. By leveraging Pydantic, the Instructor simplifies the validation process for LLM outputs, ensuring they meet your predefined criteria without the need for manual checks or parsing

from openai import OpenAI
from pydantic import BaseModel
import instructor

# Apply the patch to the OpenAI client
# enables response_model keyword
client = instructor.patch(OpenAI(api_key='sk-yourkey'))

class QuestionAnswer(BaseModel):
question: str
answer: str

context = docs
question = query


qa_normal: QuestionAnswer = client.chat.completions.create(
model="gpt-3.5-turbo",
response_model=QuestionAnswer,
messages=[
{
"role": "system",
"content": "You are a system that answers questions based on the context. answer exactly what the question asks using the context.",
},
{
"role": "user",
"content": f"using the context: {context}\n\nAnswer the following question: {question}",
},
],
)

print(qa_normal)

Here are the results obtained from the code execution:

question='what is zomato key buisness offerings ?' 
answer="Zomato's key business offerings include providing a platform
for end-users to search, discover, and order food as well as
offering industry-specific marketing tools to restaurant partners."

The output appears satisfactory to us. Now, let’s utilize the Instructor for validating the output and further enhance its quality.

import instructor
from pydantic import BaseModel, BeforeValidator
from typing_extensions import Annotated
from instructor import llm_validator
from openai import OpenAI
from pydantic import BaseModel

context = docs
question = query

client = instructor.patch(OpenAI(api_key='sk-your'))


class QuestionAnswerNoEvil(BaseModel):
question: str
answer: Annotated[
str,
BeforeValidator(
llm_validator("don't say objectionable things", allow_override=True,model = "gpt-3.5-turbo",client=client)
),
]


try:
qa_with_instrctor: QuestionAnswerNoEvil = client.chat.completions.create(
model="gpt-4",
response_model=QuestionAnswerNoEvil,
max_retries=2,
messages=[
{
"role": "system",
"content": "You are a system that answers questions based on the context. answer exactly what the question asks using the context.",
},
{
"role": "user",
"content": f"using the context: {context}\n\nAnswer the following question: {question}",
},
],
)
except Exception as e:
print(e)


print(qa_with_instrctor)

output using instructor

question='what is zomato key buisness offerings ?' 
answer="Zomato's key business offerings include providing end-users with a
platform to search and discover restaurants, read and write customer
generated reviews, view and upload photos, order food delivery,
book a table, and make payments while dining-out at restaurants.
Also, the company provides restaurant partners with
industry-specific marketing tools that enable them to engage
and acquire customers to grow their business while also providing
a reliable and efficient last-mile delivery service."

The quality of output we receive from Instructor showcases its effectiveness in enhancing LLM capabilities. Let’s continue to harness this power to maximize the potential of our LLM.

Below are a few more examples, both with and without Instructor’s assistance, demonstrating its impact.

below are few quetions with & without instrcutor:


#Normal llm

quetion 1:

question 1 ='What are the key operational and financial highlights discussed in the report?'

answer= 'The key operational and financial highlights discussed in the report include total income,
total expenses, exceptional items, share of profit of an associate and joint venture, profit/loss before tax, tax expenses, profit/loss for the year, and items in other comprehensive income/loss such as remeasurements of defined benefit plans, equity instruments, exchange differences on translation of foreign operations, and debt instruments.'


# with instrctor

question 1='What are the key operational and financial highlights discussed in the report?'

answer="The key financial highlight discussed in the report is the Company’s financial statements on a standalone and consolidated basis for the financial years ended on March 31, 2023 and 2022. It includes details such as total income, total expenses, exceptional items, share of profit of an associate and joint venture, profit or loss before tax, tax expenses and profit or loss for the year. It also mentioned other comprehensive income or loss, items that will not be reclassified to profit or loss in subsequent periods and items that will be reclassified to profit or loss in subsequent periods on a standalone and consolidated basis. The Company's risks such as market risk, credit risk and liquidity risk have also been mentioned."


___________________________________________________________________________

quetion 2:

Normal llm

question 2 ='how zomato is making revenue?'

answer='Zomato generates revenue through its offerings, including services for end-users and restaurant partners, marketing services to third parties, and limited export of Zomato IP to overseas group entities. Revenue is also generated through advertising on the platform and supply of high-quality ingredients to Restaurant Partners.'

# with instructor:
question 2 ='how zomato is making revenue?'

answer='The Company operates as an internet portal which helps in connecting the Users, Restaurant Partners and Delivery Partners. It provides platform to restaurant partners to advertise themselves to the target audience in India and abroad. They also supply high quality ingredients to Restaurant Partners.'

in this blog, we discovered how to maximize LLM potential with Instructor and LanceDB. Streamline data processing, optimize search algorithms, and enhance output quality effortlessly.

To learn more about LanceDB with advanced techniques visit our documentation page and chat with us on Discord about your use cases!

--

--

Akash A Desai

Data Scientist with 4+ years of exp ! open source !Vision ! Generative AI ! Vectordb ! llms