TL;DR
Comparing array elements is inefficient - how to improve?
Should I use a proper programming language? Which can be recommended?
MATCH (a:example_label),(b:example_label)
WHERE any(temp IN a.list WHERE temp IN b.list)
MERGE (a)-[:relationship]->[b];
Setup / Data Structures
I am on Neo4J 4.3.2.
All values and labels made short for simplicity.
I have nodes with label nC
. They have two list properties: list_intern
and list_extern
.
See code below.
MATCH (n:nC)
return * limit 1;
{
"identity": 324,
"labels": [
"nC"
],
"properties": {
"list_intern": [
"123",
"abc"
],
"list_extern": [
"456",
"123"
],
"id": "319414",
"name": "foobar"
}
}
Lists are sometimes null (do not exist). Lists are sometimes empty. List items are sometimes equal to the string null
.
Most lists contain 2 to 15 items, some lists have over 50 items.
Lists are sorted, but do sometimes contain duplicate elements. Duplicates do not provide any further use, it would be possible to discard them.
Goal
I need to find all pairs of nodes where at least one non-empty entry of list_intern
matches any non-empty entry of list_extern
. I need to create a relationship between these pair of nodes. My source for this data doesn't provide relationship information, I have to compute them.
Trivial example
Using the any()
function.
MATCH (a:nC),(b:nC)
WHERE any(temp IN a.list_intern WHERE temp IN b.list_extern)
MERGE (a)-[:relationship]->[b];
Improved Version
I was able to exclude some useless data from being processed by any()
MATCH (a:nC),(b:nC)
WHERE a.list_intern IS NOT NULL AND b.list_extern IS NOT NULL
AND NOT all(temp IN b.list_extern WHERE temp="null")
AND any(temp IN a.list_intern WHERE temp IN b.list_extern)
MERGE (a)-[:relationship]->[b];
Recently I discovered, I could replace any() with
apoc.coll.intersection(a.list_intern, b.list_extern) <> []
but that didn't make a difference.
Perfomance improved by using SOLR-Index
So, conveniently the data this DB uses is provided by an Apache SOLR.
For those not familiar: This application creates inverted indices. Consequently you are able to search your dataset quickly. SOLR offers a REST-API where I am able to request all objects that fulfill certain conditions.
I can make a GET-Request to receive all docs where item
is in list_extern
:
CALL apoc.load.json('<URL of SOLR>select?omitHeader=true&wt=json&rows=10000&fl=*&q=type:nC%20AND%20list_extern:%22'+item+'%22')
where I replace <URL of SOLR>
with a valid value.
Note, SOLR-documents have a property type
which is equal to the label used. type=nC
.
This is my best attempt at this problem, still using cypher:
match (a:nC)
WHERE a.list_intern IS NOT NULL
UNWIND a.list_intern as item
CALL apoc.load.json('<URL of SOLR>select?omitHeader=true&wt=json&rows=10000&fl=*&q=type:nC%20AND%20list_extern:%22'+item+'%22')
YIELD value UNWIND value.response.docs as bcsv
WITH a,bcsv.id AS bid
MATCH (b:nC{id:bid})
WITH a,b
MERGE (a)-[:relationship]->[b];
Machine has 8 GB RAM, tiny dataset 27k nodes, of which 25k have the label nC
.
Too slow. Dataset is going to be increased with similar data, at least doubling the node-count.
Question
How can this query be improved?
I know that performance also heavily depends on hardware. I want to reduce time complexity and computation time needed to complete the query.
Should I instead compute these relationships using a programming language?
Recommendations?
Do you recommend using a Neo4J-Driver to send the merge instructions? or should I build a new json/csv containing relationship information? (Which is then processed using cypher)