Leveraging Doctran and LLM to enhance the RAG
profile picture Pratik Raj
10 min read Aug 12, 2024
GPT

Leveraging Doctran and LLM to enhance the RAG

GPT

Introduction:

In our previous exploration, we embarked on an ambitious journey to harness data from PDF images of floor plans to enhance the capabilities of our Large Language Models (LLMs). This innovative approach yielded promising results, with the LLM demonstrating commendable accuracy in responding to a variety of queries about these floor plans. Our journey provided valuable insights and set the stage for further refinement.

Challenges:

Despite the success, we encountered significant limitations, particularly when addressing more complex, in-depth questions. A critical factor contributing to these limitations was the methodology of similarity search within our Vector Database. The main challenge stemmed from the inherently unstructured nature of the extracted data, which hindered the effectiveness of our similarity search algorithms. This limitation impacted the LLM’s ability to generate precise and relevant responses, necessitating a strategic reassessment of our approach.

Actions:

Recognizing the need for a more structured approach to data handling, we propose a strategic shift towards preliminary data processing. This step aims to structure the extracted data into a more refined and abstract format, facilitating improved accuracy and relevance in similarity searches within our Vector Database. The first step in this process involves the meticulous extraction of pertinent information from the unstructured data, converting it into a structured JSON format. This not only enhances data processing but also provides the flexibility to tailor the data to better meet our analytical needs. In this blog, we delve into the nuances of leveraging DocTran for optimizing LLM chatbots in the context of floor plan data analysis, highlighting the transformative impact of structured data on LLM performance and offering insights into the future of intelligent data processing and chatbot interaction.

Let’s Get started…..

First, we have to convert the PDF pages to images and then extract the data as we did in previous blog.

Since we extracted the data we can see how unstructured data is, Our task is to do some magic and convert this unstructured data into more structured data.

The first thing we have to do is extract the PDF Data in a more readable JSON format, and for this again LLM came to the rescue.
We will use Doctran Library which is also integrated with langchain to extract the required data from the Unstructured Data we extracted before.

.

from langchain.document_transformers import DoctranPropertyExtractor

properties = [
   {
       "name": "Flat Unit",
       "description": "Identifies the specific unit within the building. Each unit is designated by a unique code.",
       "type": "string",
       "enum": ["X01","X02","X03","X04","X05","X06","X07","X08","X09","X10"],
       "required": True
   },
   {
       "name": "No Of Bedrooms",
       "description": "Indicates the total number of bedrooms present in the flat unit. This helps in understanding the unit's capacity and size.",
       "type": "string",
       "enum": ["1","2","3","4","5","6"],
       "required": True
   },
   {
       "name": "Dimension",
       "description": "Provides detailed dimensions for each room within the unit, including bedrooms, bathrooms, kitchen, and other spaces. Each entry should specify the room type and its corresponding dimensions.Implement a standardization process where all variations like 'bathroom' and 'toilet' are uniformly labeled as 'Bathroom' in the system.",
       "type": "array",
       "items": {
           "type": "object",
           "properties": {
                   "name": {
                       "type": "string",
                       "description": "Specifies the type of room, with a standardization process to ensure uniform room names (e.g., all 'toilet' entries are standardized to 'Bathroom')."
                   },
                   "dimension": {
                       "type": "string",
                       "description": "Provides the dimensions of the specified room, typically in length x width format."
                   }
       }},
       "required": True
   },
   {
       "name": "Total Area",
       "description": "Details the total area of the flat unit, including separate entries for Suite Area, Balcony Area, and the overall Total Area. Each entry should clearly specify the area type and its measurement.",
       "type": "array",
       "items": {
           "type": "object",
           "properties": {
                   "name": {
                       "type": "string",
                       "description": "Indicates the category of the area being measured (e.g., Suite Area, Balcony Area)."
                   },
                   "dimension": {
                       "type": "string",
                       "description": "Specifies the total measurement of the area in an appropriate unit (e.g., square feet, square meters)."
                   }
       }},
       "required": True
   }
]

property_extractor = DoctranPropertyExtractor(properties=properties,openai_api_model='gpt-3.5-turbo-16k')

Now we have defined the properties we want, next is to run this doctran to get the actual data in a JSON format.

Let’s see the next steps to structure the data as we want to make it readable.

extracted_document = await property_extractor.atransform_documents(
   pages_data, properties=properties
)

Exciting news! We’ve successfully extracted metadata from unstructured data. Let’s evaluate the quality of our extracted data.

import json
extracted_document_json = json.dumps(extracted_document[1].metadata, indent=2)
print(extracted_document_json)

The extracted_document_json returns a list of of page data along with their metadata, what we are interested in is the metadata of each page, let’s check that out.

{
  "source": "./floor-plan-images/page0001-09.jpg",
  "extracted_properties": {
    "Flat Unit": "X07",
    "No Of Bedrooms": "1",
    "Dimension": [
      {
        "name": "Bathroom",
        "dimension": "2.20 X 1.80"
      },
      {
        "name": "Bedroom",
        "dimension": "4.65 X 3.70"
      },
      {
        "name": "Kitchen",
        "dimension": "3.50 X 1.80"
      },
      {
        "name": "Balcony",
        "dimension": "3.18 X 1.48"
      }
    ],
    "Total Area": [
      {
        "name": "Suite Area",
        "dimension": "3127 SQM"
      },
      {
        "name": "Balcony Area",
        "dimension": "5.08 SQM"
      },
      {
        "name": "Total Area",
        "dimension": "36.35 SQM"
      }
    ]
  }
}

Remarkably, the extracted data is not only clean and accessible but also presented in JSON format, offering us the flexibility to creatively manipulate it. This versatility allows us to enhance the performance of our LLM significantly.

It is clear from looking at the metadata that each flat unit, along with all of its features, is the focal point of the organization of the data. We introduce the data from multiple angles to help our Language Learning Model (LLM) be trained as efficiently as possible. This strategy makes sure that the LLM understands the data thoroughly, which is exactly what we need to do next.

Let’s change the perspective!!!

import re
def restructure_room_name(room_name):
   restructured_name = room_name.lower().replace(" ", "")
   restructured_name = re.sub(r'\d+', '', restructured_name)
   return restructured_name

room_data = {}
for i in range(len(extracted_document)):
   extracted_page_data = extracted_document[i].metadata
   #print(extracted_page_data['extracted_properties']['Dimension'])


   for dimension in extracted_page_data['extracted_properties']['Dimension']:


       room_name=restructure_room_name(dimension['name'])
       room_dimension = dimension['dimension']
       flat_unit = extracted_page_data['extracted_properties']['Flat Unit']
       if room_name not in room_data:
           room_data[room_name] = {flat_unit: {f"{room_name}1": room_dimension}}
       else:
           if flat_unit not in room_data[room_name]:
               room_data[room_name][flat_unit] = {f"{room_name}1": room_dimension}
           else:
               num = len([key for key in room_data[room_name][flat_unit] if room_name in key]) + 1
               room_data[room_name][flat_unit][f"{room_name}{num}"] = room_dimension
print(json.dumps(room_data))
print(room_data.keys())

We’ve shifted the data’s focus from being flat unit-based to emphasising the individual room types, detailing dimensions within each flat unit. Let’s explore the updated data.

{"bedroom": {"X06": {"bedroom1": "3.80 X 3.90"}, "X07": {"bedroom1": "4.65 X 3.70"}, "X03": {"bedroom1": "4.2m x 3.1m", "bedroom2": "3.8m x 2.9m", "bedroom3": "438 X 340"}, "X01": {"bedroom1": "3.40 X 330"}, "X05": {"bedroom1": "3.85 X 3.60"}, "X10": {"bedroom1": "438 X 340"}, "X04": {"bedroom1": "4.60 X 3.60"}, "X08": {"bedroom1": "3.18 X 148"}, "X02": {"bedroom1": "4.38 X 3.40"}, "X09": {"bedroom1": "4.65 X 3.65", "bedroom2": "3.75 X 3.60"}}

The picture above shows the dimensions of bedrooms across various flat units, showcasing just one of the numerous ways data can be tailored to enhance the training of Language Learning Models (LLMs). Ultimately, the intelligence and effectiveness of any custom model or LLM hinge on the quality of its training data.

Now, we’ll organize our extracted and refined data into a document format, preparing it for storage in our vector database.

  1. Let’s first structure the Flat Unit-based extracted data.

    from langchain.schema import Document
    def structure_json_to_text(extracted_document):
       documents = []
    
    
       for i in range(len(extracted_document)):
           string = ""
           print(type(extracted_document[i].metadata))
           extracted_document_json = extracted_document[i].metadata
           page_number = extracted_document_json['source'].split('/')[-1].split(".")[0][-2:]
           print(page_number)
           string = string + f"page number:{page_number} flat unit:{extracted_document_json['extracted_properties']['Flat Unit']} Area: "
           for rooms in extracted_document_json['extracted_properties']['Total Area']:
                   print(rooms)
                   string = string + f"{rooms}={extracted_document_json['extracted_properties']['Total Area'].get(rooms,None)} "
           string = string + "Dimension:"
           for rooms in extracted_document_json['extracted_properties']['Dimension']:
                   string = string + f"{rooms}={extracted_document_json['extracted_properties']['Dimension'].get('dimension',None)} "
           print(string)
           documents.append(Document(page_content=string))
       return documents
    
    documents = structure_json_to_text(extracted_document)
    
  2. Let’s then structure our updated data from the room’ perspective.

from langchain.document_loaders import JSONLoader

from langchain.schema import Document
num_elements = len(room_data)
page_nos = range(1, num_elements + 1)
doc=[]
def create_document(room_data, page_no):




  for room_name, room_info in room_data.items():
      document_content = ''
      flag=0
      for flat_unit, room_dimensions in room_info.items():
          prev_room_name = None


          for dimension_key, room_dimension in room_dimensions.items():
              if room_name == prev_room_name:
                 document_content += f'{dimension_key}={room_dimension} '
              else:
                 if(flag==0):
                     document_content += f'{room_name}: {flat_unit}: {dimension_key}={room_dimension} '
                     flag = 1
                 else:
                     document_content += f'{flat_unit}: {dimension_key}={room_dimension} '
              prev_room_name = room_name


      docu=Document(page_content=document_content)
      doc.append(docu)
  document_content += f'Page: {page_no}'
  return Document(page_content=document_content)

create_document(room_data, 6)
print(len(doc))
doc

Fantastic! We now have clean and processed data, all set to be compiled and stored in vector stores. Let’s proceed to store our processed data in the Vector Stores.

Once again, we’ll utilize our trusted FAISS for vector storage, leveraging its robust capabilities for managing our processed data.

from langchain.vectorstores import FAISS
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
def embed_data(all_pages,file_path):

       print(all_pages[1])
       print(len(all_pages))
       text_splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=200)
       documents = text_splitter.split_documents(all_pages)

       vectordb = FAISS.from_documents(
       documents,
       embedding=OpenAIEmbeddings(),
       )
       return vectordb

vector_db = embed_data(documents,"vector_db-faiss")

Pheww! We have our Vector Db ready. Shall we test our VectorDb Similarity search for a better understanding? And see whether our different perspective data made any difference(pun intended).

results = vector_db.similarity_search(
   query="What is the dimension of toilet of X05 unit?",
   k=3)
count = 0
for i in results:
   print(f"{count}-{i}")
   count = count + 1

Fantastic, the results look promising! The first three recommendations from our similarity search offer the detailed insights we were aiming for.

0-page_content='toilet: X05: toilet1=160 X 130'
1-page_content='page number:02 flat unit:X01 Area: Dimension:'
2-page_content='bathroom: X07: bathroom1=2.20 X 180 X01: bathroom1=210 X 165 X10: bathroom1=2.50 X 1.60 X04: bathroom1= X08: bathroom1=2.20 X 190 X02: bathroom1=2.50 X 1.60 X03: bathroom1=250 X 160 X09: bathroom1=220 X 170 bathroom2=240 X 1.70'

Let’s move to the main action—training our Language Learning Model (LLM) to observe its response and performance.

1 .First initialize the memory for this to be as a chat session.

from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory
from langchain.chat_models import ChatOpenAI

def initialize_memory():



   memory = ConversationBufferMemory(
                   memory_key="chat_history",


                   return_messages=True,
                   output_key="answer"

               )
   return memory

memory_chat_qa = initialize_memory()
  1. For this blog, we’ll employ the Langchain ConversationalRetrievalChain to evaluate our trained LLM. This tool is versatile and can also be integrated with agents and various other types of chains for enhanced results.
def create_chain(vector_db, memory_chat_qa, prompt):
       qa_chain = ConversationalRetrievalChain.from_llm(
               ChatOpenAI(temperature=0.1, model_name=open_ai_models[1]),
               retriever=vector_db.as_retriever(search_kwargs={"k": 8}),
               return_source_documents=True,
               memory=memory_chat_qa,
               verbose=True,

           )
       return qa_chain

qa_chain_chat = create_chain(vector_db,memory_chat_qa,"")

Finally, our RAG (Retrieval-Augmented Generation) model is trained and ready for testing. While we’ll start by testing it with a single question, the notebook is structured to allow the addition of multiple questions to the list. This way, you can simultaneously test numerous questions and analyze the costs associated with each run.

from langchain.callbacks import get_openai_callback
import json
embedding = OpenAIEmbeddings()
queries = [
          "What is the dimension of toilet of X05 unit? "
          ]

responses = {}
cost = 0
for query in queries:
   with get_openai_callback() as cb:
       result = qa_chain_chat({"question":query})
       print(result["answer"])
       print(f'Source Document is {len(result["source_documents"])}')
       responses[query] = result["answer"]
       cost = cost + (cb.total_cost)

print(json.dumps(responses))
print(f"Total cost of this run is ${cost}")

Given the context of testing a RAG model trained on diverse and well-structured data, the query likely returned highly relevant and insightful responses. These responses would demonstrate a clear understanding of the query’s subject matter, leveraging the nuanced data perspectives we’ve incorporated.

And within a second we got the response and it is …

The size of the toilet in unit X05 is 160 cm by 130 cm.

Voila! We received the correct answer from the PDF page, showcasing a significant improvement in accuracy and speed over the results from directly using the extracted data. This highlights the effectiveness of our refined approach and the power of our trained model.

Application Modernization Icon

Innovate faster, and go farther with serverless-native application development. Explore limitless possibilities with AntStack's serverless solutions. Empowering your business to achieve your most audacious goals.

Build with us

Author(s)

Tags

Your Digital Journey deserves a great story.

Build one with us.

Recommended Blogs

Cookies Icon

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information in order to improve and customize your browsing experience and for analytics and metrics about our visitors on this website.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference not to be tracked.

Build With Us