Indexing nodes by each entry in a list property
- Indices are primarily used to improve performance.
- Performant Cypher on large and complex data requires careful and clever management of lists, maps, and collections.
- Neo4j 4 will only support basic indices, and fulltext indices.
However, there is no longer any path for indexing values in a list property.
{ mylistprop: ['name1', 'name2','indexme2'] }
There are multiple uses and applications for indexing such that name1, name2, indexme2
are each a key in the index pointing a single node.
Example simple graph
CREATE (:Thing {name: 'thing1', listprop: ['thing1', 'alias', 'alias 2']})
CREATE (:Thing {name: 'thing2', listprop: ['thing2', 'aka', 'another thing']})
Desired index:
'alias' → (thing1)
'alias 2' → (thing1)
'aka' → (thing2)
'another thing' → (thing2)
Intended Use
CALL apoc.load.json(url) YIELD value
WHERE exists(value.name)
OPTIONAL MATCH (prime:Thing {listprop: value.name})
USING INDEX prime:Thing(listprop)
WITH value as imported, CASE WHEN prime IS NOT NULL THEN prime ELSE value END AS target
MERGE (x:Thing {name: target.name})
SET x = target
SET x.listprop = apoc.coll.toSet(target.listprop + imported.name)
MERGE (:Meta {usefuldetail: 'graph-power'})-[:ABOUT]->(target)
DEPRECATED (will be removed in Neo4j 4.x):
Neo4j, Cypher Manual indexing, and apoc.index.*
.
That leaves four ways to accomplish this goal, all of which are bad options:
- Create a node for every property in the lists being indexed.
- Significantly inflates your DB, and is not ideal when needing an index for large and complex graphs, which is the primary application for this kind of index.
- Use manual indexing, locking the application to Neo4j 3
- Build a Lucene anaylzer specific for the purpose.
- Convert the list to a single string, replacing spaces with underscores, and useing the
whitespace
fulltext index analyzer.
I suspect I'm missing something simple, and I may simply go the way of option 1 in the interest of preserving data. However, in many cases, including mine, this creates an n-to-n problem, where the resulting data will be Nodes^n resulting nodes and relationships, effectively many times larger than necessary.
Am I missing something obvious to anyone?