Fulltext indexing that can include context

Neo4j + Lucene are a powerful combination. There is one feature I would love to see in this integration. Lucene was built on document stores, where a single document contains a collection of key value pairs. When we read a document we expect all related information to be contained in that document. So when you index documents in a MongoDB database, or records in a MySQL database (i.e. a row of a single table), in both cases you are limited to the key:value pairs contained inside the container.
But a graph database differs from document stores; in a graph you generally benefit from a more exploded view of everything. So a Person node may not contain fields describing that person, instead Person may have relationships to other nodes like Hobbies, Employment, Address...

When I index person, I would like to be able to use more context when I describe my Person to the Lucene index. I would like to be able to make a query to compose the set of fields for my Person node index to include Address.city, Employment.currentEmployer, Hobbies.favorite.

I can't see that there is any way to do this other than to run a query that creates an actual field in my Person node derived from those related nodes, and then base my index on that materialised field. Lucene will accept only one label at a time, and there is no place to specify a query.

Perhaps a great feature would be to be to allow the index creator to include fields from connected nodes.

PS: I do see a challenge in implementing this. The index for the Person node above would have to be aware when the Address node changes so that Person node connected to it would be re-indexed.

Hi,
Say you have (Person)-->(Address) you want to index the combination of it?

In Neo4J full text creation you can give more than one labels. One caveat is that It is at the node level not at path level.

CALL db.index.fulltext.createNodeIndex("PersonAddr",["Person", "Address"],["firstName", "lastName", "line1", "city","state"])

This creates an index that can encompass both Person and Address. Say if you search for "Smith" you can get person nodes which have smith in their name and address which has smith in the line1 or city.

Say if you have Person named Smith living at "1 Smith ln" the search response will include both Person and address nodes in the response.

If you do need to have at the path level, one way is to have the properties on Person node as you surmised. Either you have to manually add those to person node or implement a transaction handler that can update these properties as part of beforeCommit or afterCommit usage.

I get it. That index may return one result that points to a Person node plus one result that points to an Address node, but the Address node could very well be connected to some other Person node. I think in that case I should search only on Address, and traverse to the Person node.

And perhaps one day, Neo4j will find a way to allow us to specify a mother node, connected nodes, and fields on mother node and connected nodes, and then return to us just the mother node.

It's me again, the original poster. It's been about 4 years. Fulltext indexing is now better integrated into Neo4j. I have the same question. Would it be possible to take advantage of context in Neo4j and not feed Lucene only one document but specify an immediate neighborhood of nodes to be indexed? On a match the index would send back only the id of a chosen central node of that neighborhood. For example you may have a Person node and a tightly coupled Address node. Lucene indexes both, and any hits in Lucene direct the user back to the Person node. Creating the index would look something like this:

CREATE FULLTEXT INDEX PersonAddress FOR (p:Person, a:Address FROM (p)-[:HAS_ADDRESS]->(a)) ON EACH [p.name, a.city]

I see that LLM's are moving into this search space of capturing objects plus their context and providing better search results. But I feel this feature I am asking for would pretty much nullify that advantage!

It's particularly important for a graph as opposed to say MongoDB because the MongoDB document store encourages all related information to a core concept to be stored together in a single document, even if it means duplication of some auxillary or shared content. Graphs are modeled to not repeat information. The more granular the modelling of the graph the more true that becomes. Allowing the user to include, say, a one-hop neighborhood--relationship connected neighborhood--in the Lucene search index would have a profound impact on search result quality. And to the extent that index options influence modeling, this would encourage us not to think 'document' while designing our graph.

1 Like