I have yet to understand how to profile Neo4j correctly, but I am attempting to apply best practices in writing queries. I am running into some performance problems.
In my use case, I need to add ties between nodes that, for now, we can just identify by a "token_id" variable. The network is quite dense, and so the queries, sorted by originating node, have about 100 ties to other nodes per query.
In my data model, ties have weights and a timestamp. Given this configuration, I decided to use a layout where the ties are represented as nodes (such that I can index timestamps on the ties), and connect with directed anonymous ties. That is
(a:token {token_id:x )-[:onto]->(b:edge {weight:..., timestamp:...})-[:onto]->(b:token)
I set token_id's as uniques and index timestamp.
Performance for retrieval is then quite good. But I need to do a lot of merging, and that is terribly slow.
After my research, I found out that I should do a double unwind. Suppose my data is in a list called "sets". Each element has two variables, "ego" for the originating token, and "ties" for all ties that originate there. My program supplies this via JSON as parameter.
Here is an example (to run it, need to add nodes with token_ids)
(p1,p2 are unindexed parameters)
:param sets=>[{ego: 12, ties: [{alter: 11, time: 20000101, weight: 0.5, p1: 15, p2: 0},{alter: 13, time: 20000101, weight: 0.5, p1: 15, p2: 0}]},{ego: 12, ties: [{alter: 14, time: 20000101, weight: 0.5, p1: 15, p2: 0},{alter: 11, time: 20000101, weight: 0.5, p1: 15, p2: 0}]}]
The corresponding query is then:
UNWIND $sets as set MATCH (a:token{token_id: set.ego}) WITH a,set UNWIND set.ties as tie MATCH (b:token{token_id: tie.alter}) MERGE (b)<-[:onto]-(r:edge {weight:tie.weight, time:tie.time, p1:tie.p1,p2:tie.p2})<-[:onto]-(a)
This could be faster.. If I run Profile, I get a curiously huge plan which I don't understand.
My main intend was to let Neo4J query an indexed node first, THEN unroll other parameters and add ties. That way, it would go row by row where each row is an element of "sets" and correspond to an originating node.
This plan suggests Neo4j does a lot of global indexing besides this - but there is a good chance I misunderstand what is going on.
I can vary the size of "sets" per query programmatically. Currently I pass about 100-200 of such sets through HTTP in Python, where each set has a "tie" object of about 100 ties. In every case, it's always the same "query", only different parameters, so I was hoping to get some more performance from Neo4J
Is this the best I can do? In that case, I would try to squeeze more out of posting these queries concurrently with my analysis (I already found out that parallel writes in these dimensions is a nono). Or can I improve that query?
In my situation, what do you think would be an efficient query?