Indexing values in a node list property {list: ['index me']}

Indexing nodes by each entry in a list property

  • Indices are primarily used to improve performance.
  • Performant Cypher on large and complex data requires careful and clever management of lists, maps, and collections.
  • Neo4j 4 will only support basic indices, and fulltext indices.

However, there is no longer any path for indexing values in a list property.

{ mylistprop: ['name1', 'name2','indexme2'] }

There are multiple uses and applications for indexing such that name1, name2, indexme2 are each a key in the index pointing a single node.

Example simple graph

CREATE (:Thing {name: 'thing1', listprop: ['thing1', 'alias', 'alias 2']})
CREATE (:Thing {name: 'thing2', listprop: ['thing2', 'aka', 'another thing']})

Desired index:

'alias' → (thing1)
'alias 2' → (thing1)
'aka' → (thing2)
'another thing' → (thing2)

Intended Use

CALL apoc.load.json(url) YIELD value
WHERE exists(value.name)
OPTIONAL MATCH (prime:Thing {listprop: value.name})
USING INDEX prime:Thing(listprop)

WITH value as imported, CASE WHEN prime IS NOT NULL THEN prime ELSE value END AS target
MERGE (x:Thing {name: target.name})
SET x = target
SET x.listprop = apoc.coll.toSet(target.listprop + imported.name)
MERGE (:Meta {usefuldetail: 'graph-power'})-[:ABOUT]->(target)

DEPRECATED (will be removed in Neo4j 4.x):
Neo4j, Cypher Manual indexing, and apoc.index.*.

That leaves four ways to accomplish this goal, all of which are bad options:

  1. Create a node for every property in the lists being indexed.
    • Significantly inflates your DB, and is not ideal when needing an index for large and complex graphs, which is the primary application for this kind of index.
  2. Use manual indexing, locking the application to Neo4j 3
  3. Build a Lucene anaylzer specific for the purpose.
  4. Convert the list to a single string, replacing spaces with underscores, and useing the whitespace fulltext index analyzer.

I suspect I'm missing something simple, and I may simply go the way of option 1 in the interest of preserving data. However, in many cases, including mine, this creates an n-to-n problem, where the resulting data will be Nodes^n resulting nodes and relationships, effectively many times larger than necessary.

Am I missing something obvious to anyone?

I guess for now, I'll go with the "blow up my database" option, and hopefully find some time to explore adding a better solution into a Neo4j 4.5-ish at some point.