Performance question - how to work with indexes

Hello all!
this is gonna be a bit of a generic question.

I have a graph with nodes that have 20+ properties, lets say we have an id, hash, groupA, groupB properties + some more that aren't relevant.

I have a constraint on the id property which must be unique

I periodically pull data from an api and send it to the DB, to CRUD the nodes, so if the hash of any node changed I update the updateTime, if I send a node with an id that doesnt exists it creates it and populates the createTime, and if there is already in the DB a node with an id that wasnt sent it should "delete" the node (actually just updates the label and not really deletes it). now all this happens for different "groups" (A and B)

at first I did it all in the cypher query, sent all the nodes to the cypher and had a big query that did all the logic, the problem was that most times I was sending lots of data that wasnt being updated and the query took to long to run, which I figured was caused by sending all the data to the DB when it wasnt necessary, so we moved the CRUD logic into the server.

so now in order to know what to update/create/delete we run more queries,
the first one looks something like this match (n:Label {groupA: 'a', groupB: 'b'}) return n.id as id, n.hash as hash
then in the server side we check which nodes hash changed, which id's didnt exists in the db to create and which id's dont need to exists anymore.

the problem is that this query takes a long time, and I was wondering how to speed it up.
my guess is adding some indexing, but I didnt understand very well from the docs how the indexes work. (we have around 150k+ nodes of the same label, groupA and groupB)

should I add 3 indexes for each property - groupA, groupB, hash(I guess id doesnt need index because of the constraint)
or should I add 1 index with all 3 properties?
and will that even help?

thanks a lot.

Hi @aideslucas !

Welcome to the community!

You have share a lot of info and looks like you are on the right track. Are you using APOC on your Neo4J installation? You may like considering Triggers for the update/create logic on updateTime and createTime. Yes, it may be considered as vendor lock-in but it depends on your server load managment.

About your question, you need indexes just on the properties you are planning to conditionally query, in this case, groupA and groupB. Having extra properties may help you as a cache mechanism if you wanna avoid access to the nodes, but it's not that worthy.

Please try :

CREATE INDEX GROUP_A_GROUP_B_LABEL for (n.Label) on (n.groupA,n.groupB);

Then retry the query. If you have any issues, add Explain to the query and share the execution plan here.

Have fun!

Bennu

Ok, I created the index, and I dont see that it made the performance better, I see in the profile that there are way less db hits, but because the amount of data is quite big the query still takes around 1.5 minutes to finish.

here is the before profile

and after the index profile

I have lots of different labels where the index will help, but on this specific label most of the data has the same values for both groupA and groupB, so I'm not sure if this will help

Hi @aideslucas ,

Are you measuring this time on your server or just the query itself? This query should not take that long for sure. What kind of operations are you doing server side? Are you collecting the data?

If you wanna play a little bit, you can try with a 4 properties index so instead of a projection you will have cache usage, it can't be faster than that.

Something like,

CREATE INDEX GROUP_A_GROUP_B_ID_HASH_LABEL for (n.Label) on (n.groupA,n.groupB, n.id, n.hash);

Bennu

I'm measuring the time on the neo4j browser.
since I run the query until it shows on the table view..

with the new index (including id and hash) it went back to the same profile like without indexes...