Optimizing Co-occurrence Query

I'm having a hard time optimizing a query to get co-occurrence data from my graph. The data in my graph represents projects and companies working on them. I'm trying to get data on how often each company worked with each of the other companies on a project located in the same address. Here's the part of my schema that is relevant to this query:

image

And here's my query:

MATCH (company:Company)-[:WORKED_ON]->(project:Project)-[:LOCATED_IN]->(address:Address)
WITH DISTINCT company, address
MATCH (other_company:Company)-[:WORKED_ON]->(other_project:Project)-[:LOCATED_IN]->(address)
WHERE company <> other_company
WITH company, other_company, COUNT(DISTINCT address) AS frequency ORDER BY frequency DESC
RETURN
company.uuid, 
COLLECT(other_company.uuid + ': ' + frequency) AS co_occurrence

This query should return all companies in the graph and for each return an array of other companies that worked on a project in the same address, along with a frequency of how often that occurred for that pair of companies.

Here's an example output when I run the query on a smaller subset of companies:

company.uuid,	co_occurrence
5CB009264D78E758233282, [5CAFDF9568280959694542: 12,5CB00653EFB7C920999687: 3,5CAFE3F0651CD240072545: 1]

My graph size is ~25GB. I tried running the query with 30GB cache and 30GB heap allocated to Neo4j. I'm running Neo4j 3.5.1.

My graph has 1.7M company nodes, 4.9M project nodes, and 3.5M address nodes. There's about 11M WORKED_ON relationships between companies and projects. Here's the result of running EXPLAIN on my query:

I can get the query to return results within a couple of minutes when I limit the number of companies to 10k. Once I go to 100k I can't get it to return results. It runs forever and eventually brings Neo4j down.

Any suggestions for how I can optimize this query?

Hello @hisham,

Just to know, are you using UNIQUE CONSTRAINTS (Constraints - Cypher Manual) on your nodes?

Regards,
Cobra

Hi @cobra,

Thank you for your question. Yes, for the nodes involved in this query, I have one UNIQUE constraint on the Address node:

ON ( address:Address ) ASSERT address.full_address IS UNIQUE

I also have the following indexes:

   ON :Address(street_address_lower) ONLINE 
   ON :Address(uuid) ONLINE 
   ON :Company(cohort_id) ONLINE 
   ON :Company(name_lower) ONLINE 
   ON :Company(publish_score) ONLINE 
   ON :Company(uuid) ONLINE 
   ON :Project(uuid) ONLINE 
   ON NODE:Company(name) ONLINE 
   ON :Address(full_address) ONLINE  (for uniqueness constraint)

Thanks,
Hisham

Only on Address nodes or are there also constraints on Project and Company?

Regards,
Cobra

Only on Address node. No constraints on Project or Company nodes.

Try to add unique constraints on them so and try again:)

Does it matter which property to add the UNIQUE constraint for? How would that help?

I saw there were uuids on Company and Project nodes so you can use them to create the unique constraints:)

Generally, it's a good practice to create a unique constraint on a property that will be different for each node like a product id for example or a company id:)

The unique constraint on id will help Neo4j to speed up the query but you can still add unique constraint on another property if you want them unique:)

Thanks, @cobra. I added the UNIQUE constraint for the 3 nodes. It seemed to speed things up slightly when I limit the number of companies. But I still can't the query to run on the entire graph.

That's a good news, I have another idea:)

Pur the second MATCH clause in a apoc.cypher.run() procedure :slight_smile:
Doc: https://neo4j.com/docs/labs/apoc/current/cypher-execution/

Regards,
Cobra

I'm having similar issues with co-occurrence link creation. I am trying to create bi-partite links between any two authors that have the same publishers. When the number of items are small, the query finishes in a matter of minutes. But as you increase the numbers, the query almost never finishes. My PC already has 64gb of ram (which is only 80% used) and core i7 10th gen (16 virtual cores that is only 30% used). All drives are SSDs with 1 to 3% usage.

Is there a better way to code this query below:

Match (n:Author)-[:HasPublisher]->(p:Publisher)
WHERE EXISTS(n.Last) AND EXISTS(n.First) AND LEFT(p.Name,1)='A'
WITH n,p as pubs
UNWIND pubs as publish
// find other authors published by same publisher
Match (publish)<-[:HasPublisher]-(a:Author)
WHERE EXISTS(a.Last) AND EXISTS(a.First) AND n.Orcid<>a.Orcid AND n.First <>a.First AND n.Last <> a.Last
Merge (n)-[:SamePublisher{Name:publish.Name}]->(a)

Notice that i already deliberately limited the returned items by filtering for only publishers with names starting with "A". and still the query never finishes!

After 8 hours of running:

Im now wrapping it inside a call apoc.periodic.iterate

CALL apoc.periodic.iterate(
“cypher runtime=slotted
MATCH (n:Author)-[:HasPublisher]->(p:Publisher)<-[:HasPublisher]-(a:Author)
WHERE EXISTS(n.Last) AND EXISTS(n.First) AND EXISTS(a.Last) AND EXISTS(a.First) AND n.Orcid<>’NONE’ AND a.Orcid<>’NONE’ AND n.Orcid<>a.Orcid AND n.First <>a.First AND n.Last <> a.Last
Merge (n)-[r:SamePublisher]->(a)”,
“ON create set r.Name=p.Name”,
{batchsize:500, parallel:true, iterateList:true}
)

Still running as we speak.

Hello @wilson.chua :slight_smile:

I assume that you are using the latest version of Neo4j:

MATCH (a:Author)-[:HasPublisher]->(p:Publisher)<-[:HasPublisher]-(b:Author)
WHERE EXISTS(a.Last) AND EXISTS(a.First) AND EXISTS(b.Last) AND EXISTS(b.First)
MERGE (a)-[r:SamePublisher {Name: p.Name}]->(b)
CALL apoc.periodic.iterate('
    MATCH (a:Author)-[:HasPublisher]->(p:Publisher)<-[:HasPublisher]-(b:Author)
    WHERE EXISTS(a.Last) AND EXISTS(a.First) AND EXISTS(b.Last) AND EXISTS(b.First)
    RETURN a, b, p
    ', '
    MERGE (a)-[r:SamePublisher {Name: p.Name}]->(b)
    ', {batchSize: 1000})

What are the characteristics of your database?

Regards,
Cobra

Hi Cobra Maxime, Thanks for looking into this. To answer your question. I am NOT using the latest 1.3.10 (this version does not allow me to save into a different folder location). I am currently using 1.3.4.

The graph database has 1.3m nodes and 43m relationships so far. The Database folder is about 2.36GB

  • Did you try these queries?
  • I have not been precise, what is the version of Neo4j (not the browser)?
  • How much RAM have you allocated to Neo4j?

Hi Cobra Maxime. Neo4j Desktop is version 1.3.4.
I have 64gb ram
I did run your code, it's been over 13 hours now... and still running.
It is still running. The db size has grown from 2.36 gb to now 239 gb.

I see that relationships has grown from 43m to now 458M and growing.

As of this writing, the relationships has grown to over 700M and is still growing.

Hi there is a difference between Neo4j Database, Neo4j Desktop and Neo4j Browser. I need to know the configuration of the Neo4j Database (version, RAM, etc.). There is a difference between the computer has 64gb of RAM and the database has 64gb of RAM.

Have you used UNIQUE CONSTRAINT to create your nodes?

Regards,
Cobra

It successfully ended at 769M links! Thanks Cobra Maxime. It took almost 48 hours to run.

Do you have indexes on author first and last name?

Better yet, would be a composite index on author's first+last name:
https://neo4j.com/docs/cypher-manual/current/administration/indexes-for-search-performance/#administration-indexes-create-a-composite-index

And you can search using the composite index:

(sorry if this is too obvious...)