Optimizing Co-occurrence Query

I'm having a hard time optimizing a query to get co-occurrence data from my graph. The data in my graph represents projects and companies working on them. I'm trying to get data on how often each company worked with each of the other companies on a project located in the same address. Here's the part of my schema that is relevant to this query:

image

And here's my query:

MATCH (company:Company)-[:WORKED_ON]->(project:Project)-[:LOCATED_IN]->(address:Address)
WITH DISTINCT company, address
MATCH (other_company:Company)-[:WORKED_ON]->(other_project:Project)-[:LOCATED_IN]->(address)
WHERE company <> other_company
WITH company, other_company, COUNT(DISTINCT address) AS frequency ORDER BY frequency DESC
RETURN
company.uuid, 
COLLECT(other_company.uuid + ': ' + frequency) AS co_occurrence

This query should return all companies in the graph and for each return an array of other companies that worked on a project in the same address, along with a frequency of how often that occurred for that pair of companies.

Here's an example output when I run the query on a smaller subset of companies:

company.uuid,	co_occurrence
5CB009264D78E758233282, [5CAFDF9568280959694542: 12,5CB00653EFB7C920999687: 3,5CAFE3F0651CD240072545: 1]

My graph size is ~25GB. I tried running the query with 30GB cache and 30GB heap allocated to Neo4j. I'm running Neo4j 3.5.1.

My graph has 1.7M company nodes, 4.9M project nodes, and 3.5M address nodes. There's about 11M WORKED_ON relationships between companies and projects. Here's the result of running EXPLAIN on my query:

I can get the query to return results within a couple of minutes when I limit the number of companies to 10k. Once I go to 100k I can't get it to return results. It runs forever and eventually brings Neo4j down.

Any suggestions for how I can optimize this query?

Hello @hisham,

Just to know, are you using UNIQUE CONSTRAINTS (https://neo4j.com/docs/cypher-manual/current/administration/constraints/) on your nodes?

Regards,
Cobra

Hi @Cobra,

Thank you for your question. Yes, for the nodes involved in this query, I have one UNIQUE constraint on the Address node:

ON ( address:Address ) ASSERT address.full_address IS UNIQUE

I also have the following indexes:

   ON :Address(street_address_lower) ONLINE 
   ON :Address(uuid) ONLINE 
   ON :Company(cohort_id) ONLINE 
   ON :Company(name_lower) ONLINE 
   ON :Company(publish_score) ONLINE 
   ON :Company(uuid) ONLINE 
   ON :Project(uuid) ONLINE 
   ON NODE:Company(name) ONLINE 
   ON :Address(full_address) ONLINE  (for uniqueness constraint)

Thanks,
Hisham

Only on Address nodes or are there also constraints on Project and Company?

Regards,
Cobra

Only on Address node. No constraints on Project or Company nodes.

Try to add unique constraints on them so and try again:)

Does it matter which property to add the UNIQUE constraint for? How would that help?

I saw there were uuids on Company and Project nodes so you can use them to create the unique constraints:)

Generally, it's a good practice to create a unique constraint on a property that will be different for each node like a product id for example or a company id:)

The unique constraint on id will help Neo4j to speed up the query but you can still add unique constraint on another property if you want them unique:)

Thanks, @Cobra. I added the UNIQUE constraint for the 3 nodes. It seemed to speed things up slightly when I limit the number of companies. But I still can't the query to run on the entire graph.

That's a good news, I have another idea:)

Pur the second MATCH clause in a apoc.cypher.run() procedure :slight_smile:
Doc: https://neo4j.com/docs/labs/apoc/current/cypher-execution/

Regards,
Cobra