A comparison between ‘traditional’ and hybrid (graph) RAG applications

I’ve recently come into some unscheduled downtime and have been trying to keep busy. One of the things I’ve been exploring is LLMs and generative applications. So, I took a short course targeted at developers and while exploring the material, I stumbled upon a few tutorials showing a way to enhance RAG output using knowledge graphs. I’m writing this post, and maybe a few others to follow, to document my process and notes as I try to replicate it.

Components and Structure

RAG in this context stands for Retrieval Augmented Generation.

It is a process of enhancing the output generated from a large language model by providing it with relevant context outside of its training data along with your query/prompt.

Typically you provide an LLM with a snippet of text (query/prompt) that it completes (answers) based on its training data. In a RAG application, you also provide specific material (text, audio etc) that the LLM references in addition to its training data when generating the response.

Ingestion

For a typical RAG application, the augmenting document is uploaded by a user and read by one of an ever increasing library. You then chunk it up and get a numerical/vector representation of each chunk called a vector embedding that you store in a vector store. This numerical representation allows you to search the data by similarity in an efficient way.

For a graph-RAG application however, you convert the document into a knowledge graph and generate vector embeddings from this graph. Alternatively, you can generate the vector embeddings the usual way as described above.

Querying

Next is finding the most relevant chunks to the query provided and supplying those to the LLM for generation.

Most datastores come packaged with appropriate functionality to query the underlying data efficiently. You can invoke these on the appropriate stores (chroma, neo4j) and get the relevant data to pass to the LLM for generation.

Worth noting, is the entity extraction. An LLM chain is used to get named entities from the user query and these entities are used to query the graph database for relevant pieces of information.

I then set these up to show the context retrieved and results of generation for both the vector store only and hybrid mode (vector and graph) side by side.

One of the bottlenecks I run into almost immediately was the time/resources required to create the knowledge graph from the uploaded document. Even when using paid tiers on the most popular inference provider platforms, a significant amount of time/tokens/credits is required to generate the graph for a single 30-50 page pdf.

I think further research and refinement in my configuration could cut this down significantly.

Initial testing and Results

Firstly, it’s important to highlight that I think it’s still too early to tell anything conclusive. To test this, i set up a couple of LLM chains with vector only (chroma) and hybrid (neo4j KG) to show their output side by side. Surprisingly, after testing it with the same questions about the same 40-50 page pdf file, there isn’t an astounding difference I noticed between the output generated by the two. From the articles I read, the hybrid set up should perform better at surfacing implied information/

The hybrid setup is more detailed/verbose in its output and has drawn more nuance in 1 or 2 instances but overall the generated text is largely similar in accuracy/nuance.

Improvemments

As mentioned earlier, performance improvements in the knowledge graph generation would go a long way in making this more viable as a feature in a full-fledged product. Additionally, more exhaustive testing using one of the more recently developed frameworks used to track/compare llm output would be a more objective way of comparing the performance between the two setups.