I'm having a hard time optimizing a query to get co-occurrence data from my graph. The data in my graph represents projects and companies working on them. I'm trying to get data on how often each company worked with each of the other companies on a project located in the same address. Here's the part of my schema that is relevant to this query:
And here's my query:
MATCH (company:Company)-[:WORKED_ON]->(project:Project)-[:LOCATED_IN]->(address:Address) WITH DISTINCT company, address MATCH (other_company:Company)-[:WORKED_ON]->(other_project:Project)-[:LOCATED_IN]->(address) WHERE company <> other_company WITH company, other_company, COUNT(DISTINCT address) AS frequency ORDER BY frequency DESC RETURN company.uuid, COLLECT(other_company.uuid + ': ' + frequency) AS co_occurrence
This query should return all companies in the graph and for each return an array of other companies that worked on a project in the same address, along with a frequency of how often that occurred for that pair of companies.
Here's an example output when I run the query on a smaller subset of companies:
company.uuid, co_occurrence 5CB009264D78E758233282, [5CAFDF9568280959694542: 12,5CB00653EFB7C920999687: 3,5CAFE3F0651CD240072545: 1]
My graph size is ~25GB. I tried running the query with 30GB cache and 30GB heap allocated to Neo4j. I'm running Neo4j 3.5.1.
My graph has 1.7M company nodes, 4.9M project nodes, and 3.5M address nodes. There's about 11M WORKED_ON relationships between companies and projects. Here's the result of running EXPLAIN on my query:
I can get the query to return results within a couple of minutes when I limit the number of companies to 10k. Once I go to 100k I can't get it to return results. It runs forever and eventually brings Neo4j down.
Any suggestions for how I can optimize this query?