Optimizing GraphRAG with Microsoft for CSV Data: A Guide with LanceDB

5 min readAug 29, 2024

graphrag for people fining conceptual image

This blog will teach you how to optimize GraphRAG with Microsoft for efficient CSV data processing using LanceDB. Discover practical applications, step-by-step tutorials

Explore how GraphRAG, developed by Microsoft Research, enhances data processing, particularly for CSV files its revolutionizes language models by structuring data into hierarchical knowledge graphs from raw text. This framework goes beyond traditional RAG methodologies, which rely on vector similarity, by intricately mapping relationships within data to enhance processing and reasoning capabilities.

Practical Application of GraphRAG

GraphRAG, standing for Graph Retrieval Augmented Generation, elevates semantic search capabilities by constructing detailed knowledge graphs. It captures not only text but also the intricate connections within, organizing data into a network that fosters community hierarchies and summarized insights. This approach significantly boosts language models’ ability to handle complex queries and integrate diverse data formats like text and CSV, ensuring more accurate and context-relevant responses.

we have compared the rag vs graphrag here you can check our blog here

Utilizing LanceDB with GraphRAG

In our setup, LanceDB serves as the primary vector database for indexing, making it an integral part of the GraphRAG system. This combination allows for efficient data management and enhances the searchability of large, complex datasets.

In this guide, we’ll explore how to use Microsoft GraphRAG for processing and indexing CSV files. This powerful tool can enhance data accessibility and querying capabilities, making it ideal for managing complex datasets.

Microsoft GraphRAG should be installed. You can install it using pip:


!pip install graphrag

Step 1: Download the Data

First, download the dataset required for this tutorial:

!wget https://github.com/akashAD98/dummy_data/blob/main/Lancedbai_Full_Employees_GraphRAG_With_DOB.csv

Note: I have already combined all other columns into text columns. if you have multiple columns, you can combine them into one column.

Step 2: Initialize Indexing

Initialize the indexing process which creates the necessary configuration files:

!python3 -m graphrag.index --init --root /content/input_people

This command will create an output folder containing a settings.yml file, which you will need to configure in the next step.

Once its successful we can see the output folder here

folder structure -output folder will be generate after init

Step 3: Configure the settings.yml File

Modify the settings.yml file to handle your CSV file correctly. Here’s how you should configure it:

input:
  type: file # Use 'blob' if processing data from blob storage
  file_type: csv
  base_dir: "input"
  file_encoding: "utf-8"
  file_pattern: ".*\\.csv$"
  source_column: "source"            
  text_column: "text"
  timestamp_column: "timestamp"
  timestamp_format: "%Y-%m-%d"

file_type: Set to ‘csv’ as we are working with CSV files.
source_column: A metadata field indicating the document’s origin, not used in indexing but useful for UIs post-summarization.
text_column: Column containing the main textual content.
timestamp_column: Column representing the date and time of data creation, formatted accordingly.

Step4: Running the query

we are doing a few tests of our Graphrag model

python3 -m graphrag.query - root ./rag_ad_new_people_finder - method local " what is Divit Dara birthday date is"

Below are a few examples with its answer

Q1 = "What is birthday date of Divit Dara"


SUCCESS: Local Search Response: Divit Dara's date of birth is September 3, 2000.
This information highlights Divit Dara as a Full Stack Developer with expertise in both front-end and back-end development aspects,
indicating a comprehensive skill set in software development. 
Dara is managed by Rasha Baral and is based in Bangalore, 
which is a significant hub for technology and innovation in India.
This geographical location, combined with Dara's role and expertise, places him within a vibrant ecosystem for tech professionals [Data: Sources (30)].

Q2 = "give me employee who are wokring from london & they are Data engineer"

Identified Data Engineers in London:

- Shanaya Kaul is a Data Engineer located in London with her manager Prerak Chand. Born on September 16, 1996, her role highlights London's status as a key center for technology professionals, especially in data engineering. [Data: Entities (61); Relationships (83)].
- Pranay Raja, also based in London, serves as a Data Engineer, reinforcing the city's importance as a hub for data engineering talent and its appeal to the tech community. [Data: Entities (42); Relationships (75)].

Both professionals contribute to London's vibrant and diverse tech environment, playing integral roles in the thriving tech industry which fosters innovation and collaboration. Their presence underscores the city's capability to nurture tech careers and foster professional connections in the technology sector.

Q3 = "how many people reporting to Onkar Wali"


Onkar Wali leads a diverse team spread across London, Bangalore, and Pune, indicating his key leadership role within the organization. His team consists of eight members, each in different technical and managerial roles, showcasing the broad scope of projects and responsibilities he oversees.

### Team Members Reporting to Onkar Wali
1. **Advik Butala** - AI Engineer based in London.
2. **Kanav Dyal** - Data Engineer located in Bangalore.
3. **Anahita Sha** - Product Manager working in Pune.
4. **Akarsh Bhatnagar** - Software Engineer positioned in London.
5. **Samaira Sankar** - Software Tester also based in London.
6. **Neysa Walla** - Full Stack Developer in Bangalore.
7. **Samar Thakur** - Data Scientist located in Bangalore.
8. **Arnav Ganesh** - Software Engineer based in London.

This team composition highlights Onkar Wali's significant role in managing a varied and geographically dispersed team, underscoring his strong leadership and organizational capabilities within a complex, multi-location structure.

Q4 = "give me all AI engineer list of compnay"

Based on the provided data, here is a detailed list of AI Engineers, including their roles, managers, locations, and birth dates. This list highlights the diversity and geographical distribution of AI talent in Pune, London, and Bangalore, offering insights into the professional networks and managerial relationships within the tech industry.

### AI Engineers in Pune

- Inaaya Sengupta
  - Designation: AI Engineer
  - Manager: Rasha Baral
  - Location: Pune
  - Date of Birth: 09-08-2001

- Emir Mani
  - Designation: AI Engineer
  - Manager: Prerak Chand
  - Location: Pune
  - Date of Birth: 12-10-1988

- Fateh D’Alia
  - Designation: AI Engineer
  - Manager: Nayantara Iyer
  - Location: Pune
  - Date of Birth: 30-09-1987

- Riya Mallick
  - Designation: AI Engineer
  - Manager: Rasha Baral
  - Location: Pune
  - Date of Birth: 15-07-1996

### AI Engineers in Bangalore

- Alisha Kunda
  - Designation: AI Engineer
  - Manager: Shanaya Guha
  - Location: Bangalore
  - Date of Birth: 05-10-1985

- Yashvi Tak
  - Designation: AI Engineer
  - Manager: Khushi Magar
  - Location: Bangalore
  - Date of Birth: 01-01-1988

### AI Engineers in London

- Purab Buch
  - Designation: AI Engineer
  - Manager: Shanaya Guha
  - Location: London
  - Date of Birth: 02-07-2004

- Advik Butala
  - Designation: AI Engineer
  - Manager: Onkar Wali
  - Location: London
  - Date of Birth: 22-05-1985

- Ela Varkey
  - Designation: AI Engineer
  - Manager: Anaya Karpe
  - Location: London
  - Date of Birth: 27-01-1989

This compilation underscores the global nature of the tech industry, with major cities like Pune, Bangalore, and London serving as key hubs for AI development and innovation, facilitated by a network of influential tech leaders.

check our collab for more about the above

Google Colab

Edit description

colab.research.google.com

Conclusion

GraphRAG, integrated with LanceDB, showcases remarkable capabilities in managing and querying complex datasets, demonstrating its effectiveness through precise data handling. For those interested in further exploring the potential of vector databases and large language models, visiting the LanceDB repository offers additional insights and innovative applications to enhance your data solutions.

We have another blog in which we discussed RAG Vs GraphRAG

For high-quality resources & applications for LLMs, multi-modal models and VectorDBs visit https://github.com/lancedb/vectordb-recipes