Hi all,
I have a problematically large cypher query that seems to run forever.
Little background.
I am running neo4j in a docker container as part of docker compose.
The machine this is running on has 4 vCPUs and 32 GiB of memory.
Neo4j currently uses 5gb of disk space.
Neo4j docker container:
neo4j:
image: neo4j:4.4.3
hostname: neo4j
container_name: neo4j
environment:
- NEO4J_dbms_logs_debug_level:debug
- NEO4J_dbms_connector_https_advertised__address:localhost:7473
- NEO4J_dbms_connector_http_advertised__address:localhost:7474
- NEO4J_dbms_connector_bolt_advertised__address:localhost:7687
- NEO4J_apoc_uuid_enabled=true
- NEO4JLABS_PLUGINS=["apoc","graph-data-science"]
- NEO4J_dbms_security_procedures_unrestricted=gds.*, apoc.*
- NEO4J_dbms_memory_heap_initial__size=12000m
- NEO4J_dbms_memory_heap_max__size=12000m
- NEO4J_dbms_memory_pagecache_size=12100m
- NEO4J_dbms_jvm_additional=-XX:+ExitOnOutOfMemoryError
- NEO4J_dbms_logs_gc_enabled=true
Relevant database schema:
(d:Document)<-[:IN_DOC]-(t:Table)<-[:IN_TABLE]-(tr:TableRow)<-[:IN_ROW]-(p:Phrase)
Indexes:
CREATE CONSTRAINT ON (p:Phrase) ASSERT p.content IS UNIQUE;
CREATE INDEX on :Phrase(content);
CALL db.index.fulltext.createNodeIndex("Phrases", ["Phrase"], ["content"]);
CREATE INDEX on :TableRow(encoded_content);
CALL db.index.fulltext.createNodeIndex("TableRows", ["TableRow"], ["text"]);
Different versions of the query I have tried
Query v1:
MATCH (d:Document {id: $doc_id})
MATCH (tr:TableRow)-[:IN_TABLE]->(t:Table)-[:IN_DOC]->(d)
MATCH (p1:Phrase)-[c1:IN_ROW]->(tr)
MATCH (p2:Phrase)-[:IN_ROW]->(tr)
WHERE NOT p1=p2
MERGE (p1)-[:CONNECTED_TO {relation: c1.label}]->(p2)
Query v2:
CALL apoc.periodic.iterate(
'
MATCH (d:Document {id: $doc_id})
MATCH (tr:TableRow)-[:IN_TABLE]->(t:Table)-[:IN_DOC]->(d)
MATCH (p1:Phrase)-[c1:IN_ROW]->(tr) MATCH (p2:Phrase)-[:IN_ROW]->(tr)
WHERE NOT p1=p2
RETURN p1, p2, c1
',
'MERGE (p1)-[:CONNECTED_TO {relation: c1.label}]->(p2)',
{batchSize:10000, parallel:true, params:{doc_id: $doc_id}}
);
Query v3: This was an attempt to use less ram
CALL apoc.periodic.commit(
"
MATCH (d:Document {id: $doc_id})
MATCH (tr:TableRow)-[:IN_TABLE]->(t:Table)-[:IN_DOC]->(d)
MATCH (p1:Phrase)-[c1:IN_ROW]->(tr) MATCH (p2:Phrase)-[:IN_ROW]->(tr)
WHERE NOT p1=p2 AND NOT (p1)-[:CONNECTED_TO]->(p2)
WITH c1, p2, p1 LIMIT $limit
MERGE (p1)-[:CONNECTED_TO {relation: c1.label}]->(p2) RETURN count(*)
",
{limit:1000, doc_id: $doc_id}
);
Some database stats:
MATCH (d:Document {id:$doc_id}) RETURN count(d)
1
MATCH (d:Document {id:$doc_id})--(t:Table) RETURN count(t)
1
MATCH (d:Document {id:$doc_id})--(t:Table)--(tr:TableRow) RETURN count(tr)
253,762
MATCH (d:Document {id:$doc_id})--(t:Table)--(tr:TableRow)--(p:Phrase) RETURN count(p)
7,612,860
MATCH (d:Document {id:$doc_id})--(t:Table)--(tr:TableRow)
MATCH (tr)--(p1:Phrase)
MATCH (tr)--(p2:Phrase)
WHERE NOT p1=p2
RETURN count(p2)
203,009,870
So this is a huge query and while it justifiably should take a long time to run, however I haven't been able to get this query to complete while even giving it 12+ hours to run.
Any recommendations on how to fix this query or what I might be doing wrong are greatly appreciated.