I am having performance trouble working with a simple data model but with complex queries. I have person nodes (p)
and item nodes (i)
and a person can have an unlimited number of item nodes, but the most we really see is 12 per person. The first part of the data model is simply (p)-[conn:HAS]->(i)
relationships, where we store data about the person's relationship to the item on the conn
relationship. Item nodes can have as many has 1 million people pointing to it, but usually in the range of 100 to 10,000.
The problematic queries in question come up I want to find information from one person (x)
and the "most similar" people to that person. I find this using a longer version of the following query (There are a few more FOREACH
loops in my current version to get more features on the SIMILARITY
relationship):
MATCH (x:Person { id: "PERSON_ID" })-[xi:HAS]->(i:Item)<-[ix:HAS]-(m:Person)
MERGE (x)<-[conn:SIMILARITY]->(m)
ON CREATE SET
conn += { sameItem: 0, sameTitle: 0 }
FOREACH(_ IN CASE WHEN i:Consumable THEN [true] ELSE [] END |
SET conn += { sameItem: 1, itemElement: i.name, itemId: i.id, xTitle: xi.title, mTitle: ix.title }
)
FOREACH(_ IN CASE WHEN i:Consumable AND ix.title IS NOT NULL AND xi.title IS NOT NULL AND ix.title = xi.title THEN [true] ELSE [] END |
SET conn.sameTitle = 1
);
When this query is executed from a person with ~1000 SIMILARITY
relationships it takes under a second to execute, but this scales quadratically as ~10000 SIMILARITY
relationships take around thirty seconds and ~180,000 SIMILARITY
relationships took 3 hours.
Is there anything obvious I am missing here to optimize the performance of the longer-running queries?