Optimizing Co-occurrence Query

hisham · May 20, 2020, 10:15pm

I'm having a hard time optimizing a query to get co-occurrence data from my graph. The data in my graph represents projects and companies working on them. I'm trying to get data on how often each company worked with each of the other companies on a project located in the same address. Here's the part of my schema that is relevant to this query:

And here's my query:

MATCH (company:Company)-[:WORKED_ON]->(project:Project)-[:LOCATED_IN]->(address:Address)
WITH DISTINCT company, address
MATCH (other_company:Company)-[:WORKED_ON]->(other_project:Project)-[:LOCATED_IN]->(address)
WHERE company <> other_company
WITH company, other_company, COUNT(DISTINCT address) AS frequency ORDER BY frequency DESC
RETURN
company.uuid, 
COLLECT(other_company.uuid + ': ' + frequency) AS co_occurrence

This query should return all companies in the graph and for each return an array of other companies that worked on a project in the same address, along with a frequency of how often that occurred for that pair of companies.

Here's an example output when I run the query on a smaller subset of companies:

company.uuid,	co_occurrence
5CB009264D78E758233282, [5CAFDF9568280959694542: 12,5CB00653EFB7C920999687: 3,5CAFE3F0651CD240072545: 1]

My graph size is ~25GB. I tried running the query with 30GB cache and 30GB heap allocated to Neo4j. I'm running Neo4j 3.5.1.

My graph has 1.7M company nodes, 4.9M project nodes, and 3.5M address nodes. There's about 11M WORKED_ON relationships between companies and projects. Here's the result of running EXPLAIN on my query:

I can get the query to return results within a couple of minutes when I limit the number of companies to 10k. Once I go to 100k I can't get it to return results. It runs forever and eventually brings Neo4j down.

Any suggestions for how I can optimize this query?

Cobra · May 21, 2020, 12:58pm

Hello @hisham,

Just to know, are you using UNIQUE CONSTRAINTS (Constraints - Cypher Manual) on your nodes?

Regards,
Cobra

hisham · May 21, 2020, 8:44pm

Hi @Cobra,

Thank you for your question. Yes, for the nodes involved in this query, I have one UNIQUE constraint on the Address node:

ON ( address:Address ) ASSERT address.full_address IS UNIQUE

I also have the following indexes:

   ON :Address(street_address_lower) ONLINE 
   ON :Address(uuid) ONLINE 
   ON :Company(cohort_id) ONLINE 
   ON :Company(name_lower) ONLINE 
   ON :Company(publish_score) ONLINE 
   ON :Company(uuid) ONLINE 
   ON :Project(uuid) ONLINE 
   ON NODE:Company(name) ONLINE 
   ON :Address(full_address) ONLINE  (for uniqueness constraint)

Thanks,
Hisham

Cobra · May 21, 2020, 9:31pm

Only on Address nodes or are there also constraints on Project and Company?

Regards,
Cobra

hisham · May 21, 2020, 10:16pm

Only on Address node. No constraints on Project or Company nodes.

Cobra · May 21, 2020, 10:20pm

Try to add unique constraints on them so and try again:)

hisham · May 21, 2020, 10:23pm

Does it matter which property to add the UNIQUE constraint for? How would that help?

Cobra · May 21, 2020, 10:35pm

I saw there were uuids on Company and Project nodes so you can use them to create the unique constraints:)

Generally, it's a good practice to create a unique constraint on a property that will be different for each node like a product id for example or a company id:)

The unique constraint on id will help Neo4j to speed up the query but you can still add unique constraint on another property if you want them unique:)

hisham · May 22, 2020, 6:05pm

Thanks, @cobra. I added the UNIQUE constraint for the 3 nodes. It seemed to speed things up slightly when I limit the number of companies. But I still can't the query to run on the entire graph.

Cobra · May 22, 2020, 6:21pm

That's a good news, I have another idea:)

Pur the second MATCH clause in a apoc.cypher.run() procedure
Doc: https://neo4j.com/docs/labs/apoc/current/cypher-execution/

Regards,
Cobra

wilson.chua · November 3, 2020, 3:35am

I'm having similar issues with co-occurrence link creation. I am trying to create bi-partite links between any two authors that have the same publishers. When the number of items are small, the query finishes in a matter of minutes. But as you increase the numbers, the query almost never finishes. My PC already has 64gb of ram (which is only 80% used) and core i7 10th gen (16 virtual cores that is only 30% used). All drives are SSDs with 1 to 3% usage.

Is there a better way to code this query below:

Match (n:Author)-[:HasPublisher]->(p:Publisher)
WHERE EXISTS(n.Last) AND EXISTS(n.First) AND LEFT(p.Name,1)='A'
WITH n,p as pubs
UNWIND pubs as publish
// find other authors published by same publisher
Match (publish)<-[:HasPublisher]-(a:Author)
WHERE EXISTS(a.Last) AND EXISTS(a.First) AND n.Orcid<>a.Orcid AND n.First <>a.First AND n.Last <> a.Last
Merge (n)-[:SamePublisher{Name:publish.Name}]->(a)

Notice that i already deliberately limited the returned items by filtering for only publishers with names starting with "A". and still the query never finishes!

After 8 hours of running:

wilson.chua · November 3, 2020, 8:37am

Im now wrapping it inside a call apoc.periodic.iterate

CALL apoc.periodic.iterate(
“cypher runtime=slotted
MATCH (n:Author)-[:HasPublisher]->(p:Publisher)<-[:HasPublisher]-(a:Author)
WHERE EXISTS(n.Last) AND EXISTS(n.First) AND EXISTS(a.Last) AND EXISTS(a.First) AND n.Orcid<>’NONE’ AND a.Orcid<>’NONE’ AND n.Orcid<>a.Orcid AND n.First <>a.First AND n.Last <> a.Last
Merge (n)-[r:SamePublisher]->(a)”,
“ON create set r.Name=p.Name”,
{batchsize:500, parallel:true, iterateList:true}
)

Still running as we speak.

Cobra · November 3, 2020, 1:20pm

Hello @wilson.chua

I assume that you are using the latest version of Neo4j:

MATCH (a:Author)-[:HasPublisher]->(p:Publisher)<-[:HasPublisher]-(b:Author)
WHERE EXISTS(a.Last) AND EXISTS(a.First) AND EXISTS(b.Last) AND EXISTS(b.First)
MERGE (a)-[r:SamePublisher {Name: p.Name}]->(b)

CALL apoc.periodic.iterate('
    MATCH (a:Author)-[:HasPublisher]->(p:Publisher)<-[:HasPublisher]-(b:Author)
    WHERE EXISTS(a.Last) AND EXISTS(a.First) AND EXISTS(b.Last) AND EXISTS(b.First)
    RETURN a, b, p
    ', '
    MERGE (a)-[r:SamePublisher {Name: p.Name}]->(b)
    ', {batchSize: 1000})

What are the characteristics of your database?

Regards,
Cobra

wilson.chua · November 3, 2020, 10:35pm

Hi Cobra Maxime, Thanks for looking into this. To answer your question. I am NOT using the latest 1.3.10 (this version does not allow me to save into a different folder location). I am currently using 1.3.4.

The graph database has 1.3m nodes and 43m relationships so far. The Database folder is about 2.36GB

Cobra · November 4, 2020, 9:24am

Did you try these queries?
I have not been precise, what is the version of Neo4j (not the browser)?
How much RAM have you allocated to Neo4j?

wilson.chua · November 4, 2020, 12:29pm

Hi Cobra Maxime. Neo4j Desktop is version 1.3.4.
I have 64gb ram
I did run your code, it's been over 13 hours now... and still running.
It is still running. The db size has grown from 2.36 gb to now 239 gb.

I see that relationships has grown from 43m to now 458M and growing.

wilson.chua · November 5, 2020, 10:13am

As of this writing, the relationships has grown to over 700M and is still growing.

Cobra · November 5, 2020, 10:24am

Hi there is a difference between Neo4j Database, Neo4j Desktop and Neo4j Browser. I need to know the configuration of the Neo4j Database (version, RAM, etc.). There is a difference between the computer has 64gb of RAM and the database has 64gb of RAM.

Have you used UNIQUE CONSTRAINT to create your nodes?

Regards,
Cobra

wilson.chua · November 6, 2020, 3:00am

It successfully ended at 769M links! Thanks Cobra Maxime. It took almost 48 hours to run.

clem · November 14, 2020, 12:30am

Do you have indexes on author first and last name?

Better yet, would be a composite index on author's first+last name:
https://neo4j.com/docs/cypher-manual/current/administration/indexes-for-search-performance/#administration-indexes-create-a-composite-index

And you can search using the composite index:

(sorry if this is too obvious...)

Topic		Replies	Views
Query Slows down when nodes at double depth relationship are accessed Operations performance , cypher	9	1150	November 28, 2019
Newbie question: How to aggregate and count identical query results? Cypher	4	4236	September 11, 2018
How to use Jaccard similarity algorithm in neo4j to find the similar nodes Procedures & APOC cypher	17	4346	January 17, 2019
Count distinct pairs i.e. Count rows "after" RETURN distinct a, b Cypher	5	4120	May 6, 2020
Help to enhance a query Cypher	2	658	April 2, 2019

Demystifying Neo4j UX Research

Optimizing Co-occurrence Query

Related topics