Hi, I have been experiences extremely slow relationship merges to Neo4j. When rerunning a merge of data already inserted, the query runs 10x faster (as there are no writes to perform), but when none of the nodes / relationships exist, the query runs very slowly.
I have the following code:
UNWIND $author.organizations AS orgs
# fulltext lookup to find organization (variable orgResult)
MERGE (p: Person {{ name: $author.name }})
MERGE (organization: Organization {{ name: orgResult }})
MERGE (p)-[:AFFILIATED_WITH]-(organization)
I have verified with PROFILE that index lookups on name are being used for both Persons and Organizations. I also have tried using a directional relationship merge. I have tried batch sizes from 2000 relationships to 20000 relationships and the writes / second is always the same (13 relationships / second). With tens of millions of relationships this would take well over a day to insert.
Query performance can be greatly increased by creating proper indexing. So please review your indexing and use it appropriately.
Many thanks
Mr Sameer Sudhir G
Bennu, sorry for being vague. Basically I have a fulltext index for :Organization nodes and invoke db.index.fulltext.queryNodes and some extra logic to find the first two potential matches based on the organization name for an author.
Since the query runs much more quickly AFTER data has been inserted once, my assumption is that this part of the query is not a bottleneck
We might get some benefit with call { } IN TRANSACTONS cypher syntax as described at CALL {} (subquery) - Cypher Manual but this is only available in 4.4 forward. And so if you are not running 4.4 then
After revisiting the fulltext lookup, it actually seems that it may be interacting strangely with the rest of the query. The goal was to return the top two fulltext matches if there was a tie, top match otherwise. And if no matches at all, return null so that I could coalesce later in the query. It looks like this:
CALL {
WITH author
CALL db.index.fulltext.queryNodes("OrganizationNameFulltextIndex", org)
YIELD node, score
WITH collect(node) as nodes, collect(score) as scores
UNWIND (
CASE
WHEN size(nodes) >= 2 AND scores[0] = scores[1] THEN nodes[0..2]
ELSE [head(nodes)]
END
) AS orgResult
RETURN orgResult
}
I have no idea if this will be impactful. I assume only if you had a large number of results returned from your queryNode call. Anyway, you could insert 'with node, score limit 2' between your 'call db' and 'with collect' phrases. This would avoid collecting over a large list of irrelevant results and you could replace 'nodes[0..2]' with 'scores', avoid the list reduction too.
That is a good thought and indeed I had tried that as well as introducing a WITH * and LIMIT 2 before the collect but nothing changed.
Indeed the query also runs just as slowly on an empty DB (no Organization nodes yet) as it does with a full one.
The whole point of the subquery really is so that I can merge using a fulltext lookup instead of using traditional merge with the BTREE lookup. Without the subquery, if there were no fulltext index matches, the zero rows would short circuit the query due to the cartesian product with the existing rows
Sorry if my comments are naive, but maybe just throwing out comments will stir up some ideas on your end.
You stated the performance is significantly worse with no data in the db. The big difference in that case would be that the data needs to be created. I doubt the issue is with creating the person and organization nodes. How about the updating of the indexes. Do you have lots of indexes on these nodes? How about the time to update the full-text indexes? Could you test by dropping the full-text index and having the full-text subquery just return a simple test node, so the query runs through?
Yes, that's what really confuses me and initially why I disregarded the index lookup as a source of inefficiency. I figured that if the query ran faster the second time (and the second time it would have to do the fulltext index lookups on more data) then that couldn't be the cause. Seems like a bad assumption though.
There is only one fulltext index on Organization nodes. There is a BTREE index for another field on Organizations as well.
As for your question about dropping the index lookup, I tested replacing the contents of the subquery with a simple return of a test node and the query was indeed fast.
Is there some common approach people have (perhaps an apoc function) that approximates what I am trying to do? Basically just merge / upserting but allowing the lookup to be done with fulltext index