We are building a GraphRAG application to make a bunch of old documents accessible in an intelligent manner. The "diffbot" method from the "LLM Knowledge Graph Builder" doesn't work for us, since diffbot knows only public entities (not the ones from our organization, products, customers... )
So we trained a spaCy model for NER and get a list of named entities for every document and paragraph. However, since the spelling of the entities sometimes differ from the nodes in our graph and their name or title attribute, I am wondering if there has been research or best practices on how to effectively onboard or connect entities from incoming text documents. Using Levenshtein distance? Application logic, or is there something helpful available in neo4j?
BTW, the combination of LLMs and 'traditional' NLP methods like in this case looks quite promising.
Thanks for any input!
Cheers,
Chris