What's a better way to search/compare list elements? any() function

TL;DR

Comparing array elements is inefficient - how to improve?
Should I use a proper programming language? Which can be recommended?

MATCH (a:example_label),(b:example_label)
WHERE any(temp IN a.list WHERE temp IN b.list)
MERGE (a)-[:relationship]->[b];

Setup / Data Structures

I am on Neo4J 4.3.2.
All values and labels made short for simplicity.
I have nodes with label nC. They have two list properties: list_intern and list_extern.
See code below.

MATCH (n:nC)
return * limit 1;
{
  "identity": 324,
  "labels": [
    "nC"
  ],
  "properties": {
"list_intern": [
      "123",
      "abc"
    ],
"list_extern": [
      "456",
      "123"
    ],
"id": "319414",
"name": "foobar"
  }
}

Lists are sometimes null (do not exist). Lists are sometimes empty. List items are sometimes equal to the string null.
Most lists contain 2 to 15 items, some lists have over 50 items.
Lists are sorted, but do sometimes contain duplicate elements. Duplicates do not provide any further use, it would be possible to discard them.

Goal

I need to find all pairs of nodes where at least one non-empty entry of list_intern matches any non-empty entry of list_extern. I need to create a relationship between these pair of nodes. My source for this data doesn't provide relationship information, I have to compute them.

Trivial example

Using the any() function.

MATCH (a:nC),(b:nC)
WHERE any(temp IN a.list_intern WHERE temp IN b.list_extern)
MERGE (a)-[:relationship]->[b];

Improved Version

I was able to exclude some useless data from being processed by any()

MATCH (a:nC),(b:nC)
WHERE a.list_intern IS NOT NULL AND b.list_extern IS NOT NULL
AND NOT all(temp IN b.list_extern WHERE temp="null")
AND any(temp IN a.list_intern WHERE temp IN b.list_extern)
MERGE (a)-[:relationship]->[b];

Recently I discovered, I could replace any() with

apoc.coll.intersection(a.list_intern, b.list_extern) <> []

but that didn't make a difference.

Perfomance improved by using SOLR-Index

So, conveniently the data this DB uses is provided by an Apache SOLR.
For those not familiar: This application creates inverted indices. Consequently you are able to search your dataset quickly. SOLR offers a REST-API where I am able to request all objects that fulfill certain conditions.

I can make a GET-Request to receive all docs where item is in list_extern:

CALL apoc.load.json('<URL of SOLR>select?omitHeader=true&wt=json&rows=10000&fl=*&q=type:nC%20AND%20list_extern:%22'+item+'%22')

where I replace <URL of SOLR> with a valid value.
Note, SOLR-documents have a property type which is equal to the label used. type=nC
.
This is my best attempt at this problem, still using cypher:

match (a:nC)
WHERE a.list_intern IS NOT NULL 
UNWIND a.list_intern as item
CALL apoc.load.json('<URL of SOLR>select?omitHeader=true&wt=json&rows=10000&fl=*&q=type:nC%20AND%20list_extern:%22'+item+'%22')
YIELD value UNWIND value.response.docs as bcsv
WITH a,bcsv.id AS bid
MATCH (b:nC{id:bid})
WITH a,b
MERGE (a)-[:relationship]->[b];

Machine has 8 GB RAM, tiny dataset 27k nodes, of which 25k have the label nC.
Too slow. Dataset is going to be increased with similar data, at least doubling the node-count.

Question

How can this query be improved?
I know that performance also heavily depends on hardware. I want to reduce time complexity and computation time needed to complete the query.

Should I instead compute these relationships using a programming language?

Recommendations?
Do you recommend using a Neo4J-Driver to send the merge instructions? or should I build a new json/csv containing relationship information? (Which is then processed using cypher)

Thanks for reading, appreciate any comments!

Looking at a different way, is it possible to represent the list elements as nodes and then replace the list of elements stored as properties with relationships to the terms? Assuming the item nodes have a label of Term, your query may look like the following:

MATCH (a:nC),(b:nC)
WHERE exists ( (a)-[:HAS_INTERNAL_TERM]->(:Term)<-[:HAS_EXTERNAL_TERM]-(b)  )
MERGE (a)-[:relationship]->(b)

If the lists come with the ingested data and are only used for the purposes of linking ‘a’ to ‘b’, then you can create the terms and links to the terms as you ingest each node. You can then add to the above query the deletion of the HAS_TERM relations. You can periodically clean up unused TERM nodes.

If the HAS_TERM relationships were permanent, would you need the ‘relationship’ between the two nodes, as you could use the common linkage to at least one term to relate any two ‘nC’ nodes.

Sorry in advance if this doesn’t make sense for your application.

Back to your question. How are you running these queries now, manually in neo4j browser? I don’t see how to use Java and the driver to improve this. You need to match two candidate nodes then test if the two nodes are similar by comparing their list of terms. I don’t see how you split this up between cypher and Java.

It would be possible to build a custom function using the neo4j Java API that takes two nodes and returns a Boolean value, true if they have a least one term in common and false otherwise. The function would use Java to compare the lists. Assuming this function is called ‘isSimilar’, then your query would be the following:

MATCH (a:nC),(b:nC)
WHERE isSimilar(a, b)
MERGE (a)-[:relationship]->(b)

The custom function solution requires deploying the code on the server. Do you have access to your server, or are you using Aura? This is not allowed with Aura. I can write the custom function if you want to try it.

Thanks for your reply.

This is possible. Great idea - indices on lists are not supported. This way indices on properties can be used.
Creating the Data Model I did think of that possibility, but discarded it, not knowing about indices in-depth.
I will test this asap.

Displayed relationships have to be (:nC)-[]->(:nC).
Making a request to the graph-application, users should not see :Term nodes - their content is not secret, it may still appear as list-properties of :nC nodes though.
apoc.create.vRelationship() will be useful, if I decide to keep them.

yes / cypher-shell sometimes.

I was thinking about manipulating the source data before importing, before creating any nodes. I would read the .json I get from SOLR, compute the relationships and add them to the .json or write to another file. Those files would be loaded in a cypher query like above.
I am thinking that the usage of a "real" programming language or an ETL-Tool, no DB-Reads, just some items in lists, will be quicker.

yes, full access.

If you want to, sure let's find out what works best! :grin:

Thank you so far for this in-depth response.

I finished the custom function. Do you have a domain name? If so, I can use it as the prefix to the method when calling it. it is best to have your own namespace so your custom methods don't conflict with others with the same name.

I am not sure what you are planning to do with the custom function. How are domain names relevant?
I have no idea - i am curious.
So, I assumed the custom function would be deployed somewhere local on the server that is also running Neo4J - don' t need to worry about name conflicts.

Regarding the topic:
I am currently testing apoc.periodic.iterate() and apoc.periodic.commit(), have read a ton about memory configuration, bookmarked similiar topics in this forum.
You'll hear from me again next week.
Also i found this comment:

which reinforces taking a look at computing relationships before importing the data.

Ok, I will use a reasonable name. I will just upload the source and get you the jar, so you don't have to build it. You can try it if you want to see if it helps.

Preprocessing the data before ingestion is also an option to investigate. If you can process the data and create pairs of nodes by finding the nodes with the common elements in the two arrays and create a file with these pairs, your ingestion should be much faster. Are these nodes being streamed in realtime, or do get them offline and can process them beforehand.

I can't upload binary items to the board. If you are interested, send me an email address and I can send you the source code and the jar so you don't have to build it to test.

Your query would look like what is shown below. I externalized the property names, incase they changed you could just change your query instead of the source. It's assumed that the two properties consist of a single string value or a list of string values. The two properties can have any combination of list and scalar value. isSimilar will return true when the two properties have at least one value in common, false otherwise.

MATCH (a:nC),(b:nC)
WHERE customFunctions.isSimilar('list_intern', a, 'list_extern', b)
MERGE (a)-[:relationship]->(b)