Not getting source_documents vector search LLM

Hello, I am following the GraphAcademy, ' Build a Neo4j-backed Chatbot using Python. When running the movies database, I can return source documents and they seem to get presented to the LLM. When I create my own database with my own text chunks in there, the source_documents is and the response from the LLM is like it hasn't seen those source documents. Here is my code:

neo4jvector = Neo4jVector.from_existing_index(
embeddings,
url=st.secrets["NEO4J_URI"],
username=st.secrets["NEO4J_USERNAME"],
password=st.secrets["NEO4J_PASSWORD"],
index_name="BestPracticeContent",
node_label="TextChunk",
text_node_property=["text"],
embedding_node_property="embedding",
)

retriever = neo4jvector.as_retriever(k=15)

kg_qa = RetrievalQA.from_chain_type(
llm = llm,
chain_type="stuff", # using 'prompt stuffing'
retriever=retriever,
verbose=True,
return_source_documents = True,
)

tag::tools

tools = [
#Tool.from_function(
# name="General Chat",
# description="For general chat not covered by other tools",
# func=llm.invoke,
# return_direct=True
#),
Tool.from_function(
name="Vector Search Index",
description="Provides information about Best Practice using Vector Search",
func = kg_qa,
return_direct=False
),

Tool.from_function(

name="Cypher QA",

description="Provide information about Best Practice questions using Cypher",

func = run_cypher,

return_direct=True

),

]

memory = ConversationBufferWindowMemory(
memory_key='chat_history',
k=5,
return_messages=True,
)

agent_prompt = PromptTemplate.from_template("""
You are a medical expert providing information to medical professionals based solely on the specific documents or outputs from tools provided. It is crucial to follow these instructions carefully:

  • ONLY use information from the provided tools or documents in your responses.
  • DO NOT rely on or include your pre-trained knowledge for medical advice.
  • be helpful and polite
  • If the information is not available in the documents or tool outputs, clearly state that the required information is not available.
  • Example of Correct Use: "Based on the information provided by documents from BMJ Best Practice, the best practice for treating condition X is..."
  • Example of Incorrect Use: "Generally, condition X is treated with..."
  • DO NOT state the necessity to consult a doctor/nurse/healthcare provider/medical professional

Remember, the accuracy and relevance of your responses depend entirely on the use of the provided information. Any deviation from these instructions is unacceptable and compromises the quality of the advice given to healthcare professionals. These are healthcare professionals, so don't ask them to contact any healthcare professional.

TOOLS:

You have access to the following tools:

{tools}

If asked a medical question, always use a tool, please use the following format:

Thought: Do I need to use a tool? Yes
Action: the action to take, should be one of [{tool_names}]
Action Input: the input to the action
Observation: the result of the action

When you have a response to say to the Human, or if you do not need to use a tool, you MUST use the format:

Thought: Do I need to use a tool? No
Final Answer: [your response here]

Begin!

Previous conversation history:
{chat_history}

New input: {input}
{agent_scratchpad}
""")

agent = create_react_agent(llm, tools, agent_prompt)
agent_executor = AgentExecutor(
agent=agent,
tools=tools,
memory=memory,
verbose=True,
handle_parsing_errors=True
)

def generate_response(prompt):

response = agent_executor.invoke({"input": prompt})

return response['output']

And this is the response I get:

Entering new AgentExecutor chain...
Thought: Do I need to use a tool? Yes
Action: Vector Search Index
Action Input: treatment options for COPD

Entering new RetrievalQA chain...

Finished chain.
{'query': 'treatment options for COPD', 'result': "Treatment options for Chronic Obstructive Pulmonary Disease (COPD) often involve a combination of medication, lifestyle changes, and potentially surgery. Medications can include bronchodilators, inhaled steroids, combination inhalers, and oral steroids. Pulmonary rehabilitation programs can help manage the disease through exercise, disease management training, and nutritional advice. Oxygen therapy may also be necessary in severe cases. In extreme cases, surgeries like a lung transplant or lung volume reduction surgery might be considered. It's important to consult with a healthcare provider for personalized treatment plans.", 'source_documents': []}Do I need to use a tool? No
Final Answer: Based on the information obtained from the Vector Search Index, treatment options for Chronic Obstructive Pulmonary Disease (COPD) often involve a combination of medication, lifestyle changes, and potentially surgery. Medications can include bronchodilators, inhaled steroids, combination inhalers, and oral steroids. Pulmonary rehabilitation programs can help manage the disease through exercise, disease
management training, and nutritional advice. Oxygen therapy may also be necessary in severe cases. In extreme cases, surgeries like a lung transplant or lung volume reduction surgery might be considered.

Finished chain.

I know I haven't refined a 'retrieval_query' in the neo4jvector, but I thought at least something should be returned?

I think I've found my mistake. I've uploaded my embeddings as strings of lists of floats rather than a list of floats, so of course they are not being retrieved!

Hello @enicholson Would you share what kind of information did you use to import in Neo4j GraphDB - structured or unstructured, pdf or csv files or some other source of data? And how newly imported information (data, chunks) are embedded - does the index is recreated or there is different approach?
Thanks in advance.

Hi @milena.kondeva, sure I can share a more info about how this was done.

I was using CSV. It was structures like this:

Topic ID; Topic name; Section; Row ID; Chunk ID; Raw text; Unique ID

I restructured the df and embedded the columns of chunked text like this:

# Getting chunks of a particular topic/section on the same row

df.sort_values(by=['Section', 'Row ID', 'Chunk ID'], inplace=True)
agg_df = df.groupby(['Topic ID', 'Topic name', 'Section', 'Row ID']).agg({
    'Raw text': lambda x: list(x)  # Aggregate 'Raw text' into a list
}).reset_index()

max_chunks = agg_df['Raw text'].apply(len).max()

# Expand each list of 'Raw text' into separate columns
expanded_df = pd.DataFrame(agg_df['Raw text'].tolist(), index=agg_df.index)
expanded_df.columns = ['Chunk_' + str(i) for i in range(max_chunks)]

# Merge expanded columns back with the original aggregated DataFrame
final_df = pd.concat([agg_df.drop(['Raw text'], axis=1), expanded_df], axis=1)

# Save the processed DataFrame to a new CSV for Neo4j import
final_df.to_csv('name.csv', index=True)

from openai import OpenAI

import openai

# set the OPENAI API key here

os.environ["OPENAI_API_KEY"] = "xxx"

client = OpenAI()

# function to embed the text chunks

def get_embedding(text, model="text-embedding-ada-002"):

return client.embeddings.create(input = [text], model=model).data[0].embedding

# function to apply the embeddings to the df
def process_row(row):
    for column in row.index:
        text = row[column]
        if pd.notnull(text):  # Checks if the cell is not None/NaN
            row[column] = get_embedding(text)
    return row

# embedding the df raw text columns
for column in final_df.columns:
    if column.startswith("Chunk"):
        final_df[f'{column}_embedding'] = final_df[column].apply(lambda text: get_embedding(text) if pd.notnull(text) else None)

My resultant csv file had these columns:
'Index', 'Topic ID', 'Topic name', 'Section', 'Row ID', 'Chunk_0',
'Chunk_1', 'Chunk_2', 'Chunk_3', 'Chunk_4', 'Chunk_5', 'Chunk_6',
'Chunk_7', 'Chunk_8', 'Chunk_9', 'Chunk_10', 'Chunk_11',
'Chunk_0_embedding', 'Chunk_1_embedding', 'Chunk_2_embedding',
'Chunk_3_embedding', 'Chunk_4_embedding', 'Chunk_5_embedding',
'Chunk_6_embedding', 'Chunk_7_embedding', 'Chunk_8_embedding',
'Chunk_9_embedding', 'Chunk_10_embedding', 'Chunk_11_embedding', 'URI'

And I loaded them into Neo4j like this:

# start a new instance of a graph on aura db and swap out the details below:

from neo4j import GraphDatabase, basic_auth

URI = "neo4j+s://xxxxxxxx.databases.neo4j.io"
AUTH = ("neo4j", "Password")

with GraphDatabase.driver(URI, auth=AUTH) as driver:
    driver.verify_connectivity()

# Loading Topic, Section and Chunk_0 nodes

#this is connecting to the where the raw files ar

import ast
import csv
import requests

file_path = 'filepath/name.csv'

with open(file_path, newline='') as csvfile:
    reader = csv.DictReader(csvfile)
    for row in reader:
        embedding_list = ast.literal_eval(row['Chunk_0_embedding'])  # Convert string to list
        cypher_command = """
                MERGE (topic:Topic {id: $topic_id, name: $topic_name})
                MERGE (chunk0:TextChunk {index: $index})
                ON CREATE SET chunk0.text = $chunk_text,
                              chunk0.rowID = $row_id,
                              chunk0.embedding = $embedding_list,
                              chunk0.section = $section_name,
                              chunk0.source = $source
                MERGE (topic)-[:HAS {section: $section_name}]->(chunk0)
            """
        # Execute the Cypher command
        with driver.session(database="neo4j") as session:
            results = session.execute_write(
                    lambda tx: tx.run(cypher_command,
                                      topic_id=row['Topic ID'],
                                      topic_name=row['Topic name'],
                                      section_name=row['Section'],
                                      index=f"{row['Topic ID']}-{row['Index']}-0",
                                      chunk_text=row['Chunk_0'],
                                      row_id=row['Row ID'],
                                      source=row['URI'],
                                      embedding_list=embedding_list).data())

def process_additional_chunks(file_path, max_chunks):
    for i in range(1, max_chunks + 1):
        with open(file_path, newline='') as csvfile:
            reader = csv.DictReader(csvfile)
            for row in reader:
                embedding_key = f'Chunk_{i}_embedding'
                if embedding_key in row and row[embedding_key]:
                    embedding_list = ast.literal_eval(row[embedding_key])
                    cypher_command = """
                        MATCH (prevChunk:TextChunk {index: $prev_index})
                        MERGE (chunk:TextChunk {index: $index})
                        ON CREATE SET chunk.text = $chunk_text,
                                      chunk.embedding = $embedding_list,
                                      chunk.rowID = $row_id,
                                      chunk.section = $section_name,
                                      chunk.source = $source
                        MERGE (prevChunk)-[:NEXT]->(chunk)
                    """
                    try:
                        with driver.session(database="neo4j") as session:
                            session.execute_write(lambda tx: tx.run(cypher_command,
                                prev_index=f"{row['Topic ID']}-{row['Index']}-{i-1}",
                                section_name=row['Section'],
                                index=f"{row['Topic ID']}-{row['Index']}-{i}",
                                chunk_text=row[f'Chunk_{i}'],
                                row_id=row['Row ID'],
                                source=row['URI'],
                                embedding_list=embedding_list))
                    except Exception as e:
                        print("Error executing Cypher command:", e)


process_additional_chunks(file_path, 11)

Then I created the vector index in neo4j

I hope this helps. Let me know if you would like any other info, or if you can see any improvements to the script

Thanks for your fast and comprehensive response. The only thing that probably might be improved is the reading if the csv file in process_additional_chunks function - instead of reading it each time for a single chunk, it is more efficient to read it once.