Prompting to improve LLM response to text2cypher retriever

I am working on text2cypher based application and have hit a bit of a wall with the LLM responses.

The aim of the application is to interrogate complex technical infrastructure to answer questions around dependencies between items / object. So far the application is working well with the graph we have developed and the overall performance has been vastly improved with the introduction of several well considered examples (few-shot prompts) and implementing Exact Match Pruning of the rather sizable schema.

We are getting some very good cypher queries as responses that, under testing, are valid and that actually answer the question. The challenge now is that the LLM takes the very good and valid generated cypher queries and then refuses to answer the question. The biggest challenge is the propensity of the LLM to return SQL versions of the generated cypher query rather than the results of the cypher query with the requisite context. Our current test environment is limited to using Llama 3.2:3b for now.

A bit of digging into the documentation surfaces the default text2cypher prompt but there is little to no information on how to tweak the prompt to improve responsiveness. I have implemented and attempted to use a custom prompt to explicitly forbid SQL responses, but the LLM seems to ignore that instruction. It seems to ignore all the instructions.

Has anyone had success with custom text2cypher prompting? Is there a format or approach that we should be investigating? Or is this likely more an issue with our use of the smaller local llama3.2:3b LLM? Anyone willing to share what has worked for them would be greatly appreciated.

Have you tried to connect to a remote model (e.g. Claude Opus) for testing?

It is likely you have a problem with the model, since 3b is literally like a baby.

I do some text analysis and the difference between llama 8b and 405b is abysmal … but the response time goes from seconds to minutes :)

(you will still have the potential ‘weird’ issue - so you better log everything as well).

I suspect you are right about the model size. It is strange because about a third of the time the model provides a nice coherent response, it’s good with counting things and can occasionally generating lists of things. The rest of the time it just spits back SQL queries of questionable quality or just flat out hallucinates - even with questions that are explicitly in the example set.

I had been hoping that there was some prompt tweaking that would help things along, but it maybe the thing to do after a model upgrade.

Thanks for taking a moment to respond, really appreciated.

You have to remember that the “billions of parameters” in a model represent all the tokens that went in … which means that unless you have a “specifically trained SQL to NLP” model, the representative size of your target space is just a percentage of the total model’s size.

Since the common models focus more on language, as you grow the model, it also starts including other domains.

So a smaller model might have 1% of SQL, whilst a very large model might have 3% (and when you scale, that 3% turns to be 20x or more).

You need to be mindful that GenAI + RAG + whatever so far is incapable of removing hallucinations, and if you have multistep processes that engage with GenAI, you have to consider probabilities for accuracy:

1-step call is 99.5%, 
2-step call is 99.5*99.5 = 99%

(and so far, usually you’ll get around 80-90% accuracy)

I would absolutely accept 80% accuracy. It would still save us hours of time either way. Thanks again for the information. Gives me lots of think about to get this working as expected.

I suppose :slight_smile:

The problem is that the 80% looks like “2 bad, 8 good” … but it could be “10 partially good”. And as queries get complex - that is a lot of debugging.

Also I referred to 80% per step … 2 steps = 64% … 3 steps = 51%