Please help me!
I am working on a project to compare/benchmark neo4j with postgres using a dataset of 176M research citations from Semantic Scholar
I definitely should have asked for help sooner, but I like to try to figure things out for myself as much as I can...
I have a script that converts the original JSON data into CSV format in normalized tables for easy import into both neo4j and postgres. Each JSON file has 1million records. I have tables for Papers (id, title, DOI, year), Authors (id, name), paperAuthors(author_id, paper_id), outCitations (id, outcit_id) and inCitations (id, incit_id).
Importing to postgres took an average of about 5s/1M rows if I turned off all the indexes. Adding indexes and deleting duplicates at the end took about 9hrs for the full database (I later learned that I could speed this up by filtering and creating a new table, rather than deleting...)
However, the neo4j import has been giving me trouble from the start...
I first tried the command-line tool "neo4j-admin import" according to this guide:
I started on an AWS EC2 m5.large instance (8GB RAM), but quickly ran out of memory. I could not find any guidelines about how much RAM was needed, and I did not like the all-or-nothing results with that tool, so I rewrote everything to use Cypher's "LOAD CSV" instead (that way I could at least work incrementally if I ran out of memory along the way).
I was able to get through 50M records this way on an r5.xlarge (32GB RAM), though it ran out of memory building the relations for the last million records. This took ~1.6 hours to build the nodes (Papers, Author), and ~28 hours to build the relations (CITES, IS_CITED_BY, HAS_AUTHOR).
However, I have been running an r5.2xlarge (64GB RAM) all week and it is still not finished... All of the nodes (papers, authors) are loaded, but I am only up to 87M of 176M relations. The process time has been steadily increasing, and it now around 3hrs per million records.
My original query looks like this:
LOAD CSV WITH HEADERS
FROM "data/csv/s2-corpus-090-cites.csv" AS row
FIELDTERMINATOR "|"
MATCH (p1:Paper {id:row.id}), (p2:Paper {id:row.outcit_id})
MERGE (p1)-[:CITES]->(p2);
I paused the process (kill -19) to troubleshoot, trying a single query:
match (p:Paper {id:"892cbd4fe56bd56cbdb69810631648e6f8d1e407"}),
(q:Paper {id:"24c6fc7aa8968d0a6a7a7b8d3bb002c53a29df13"})
return p,q;
1 row available after 14 ms, consumed after another 1 ms
explain match (p:Paper {id:"892cbd4fe56bd56cbdb69810631648e6f8d1e407"}),
(q:Paper {id:"24c6fc7aa8968d0a6a7a7b8d3bb002c53a29df13"}) return p,q;
+------------------------+----------------+-------------+------------+------------+
| Operator | Estimated Rows | Identifiers | Ordered by | Other |
+------------------------+----------------+-------------+------------+------------+
| +ProduceResults | 1 | p, q | p.id ASC | |
| | +----------------+-------------+------------+------------+
| +CartesianProduct | 1 | p, q | p.id ASC | |
| |\ +----------------+-------------+------------+------------+
| | +NodeUniqueIndexSeek | 1 | q | q.id ASC | :Paper(id) |
| | +----------------+-------------+------------+------------+
| +NodeUniqueIndexSeek | 1 | p | p.id ASC | :Paper(id) |
+------------------------+----------------+-------------+------------+------------+
14ms * 1M rows = 14000s ~ 4 hours, so that may explain the holdup...
I read that splitting the matches can be faster:
match (p:Paper {id:"892cbd4fe56bd56cbdb69810631648e6f8d1e407"})
match (q:Paper {id:"24c6fc7aa8968d0a6a7a7b8d3bb002c53a29df13"})
return p,q;
However, it did not seem to make any difference doing it this way. EXPLAIN looks the same...
(it returned in 0ms, but I am pretty sure that was due to caching, since repeating the original query also returned in 0ms when I did it again):
I tried "explain" on my main import query too, though I am not sure how to interpret it:
EXPLAIN LOAD CSV WITH HEADERS
FROM "data/csv/s2-corpus-090-cites.csv" AS row
FIELDTERMINATOR "|"
MATCH (p1:Paper {id:row.id})
MATCH (p2:Paper {id:row.outcit_id})
MERGE (p1)-[:CITES]->(p2);
+-----------------------------------+----------------+------------------------+------------------------------+
| Operator | Estimated Rows | Identifiers | Other |
+-----------------------------------+----------------+------------------------+------------------------------+
| +ProduceResults | 1 | anon[204], p1, p2, row | |
| | +----------------+------------------------+------------------------------+
| +EmptyResult | 1 | anon[204], p1, p2, row | |
| | +----------------+------------------------+------------------------------+
| +Apply | 1 | anon[204], p1, p2, row | |
| |\ +----------------+------------------------+------------------------------+
| | +AntiConditionalApply | 1 | anon[204], p1, p2 | |
| | |\ +----------------+------------------------+------------------------------+
| | | +MergeCreateRelationship | 1 | anon[204], p1, p2 | |
| | | | +----------------+------------------------+------------------------------+
| | | +Argument | 1 | p1, p2 | |
| | | +----------------+------------------------+------------------------------+
| | +AntiConditionalApply | 1 | anon[204], p1, p2 | |
| | |\ +----------------+------------------------+------------------------------+
| | | +Optional | 1 | anon[204], p1, p2 | |
| | | | +----------------+------------------------+------------------------------+
| | | +ActiveRead | 0 | anon[204], p1, p2 | |
| | | | +----------------+------------------------+------------------------------+
| | | +Expand(Into) | 0 | anon[204], p1, p2 | (p1)-[anon[204]:CITES]->(p2) |
| | | | +----------------+------------------------+------------------------------+
| | | +LockNodes | 1 | p1, p2 | p1, p2 |
| | | | +----------------+------------------------+------------------------------+
| | | +Argument | 1 | p1, p2 | |
| | | +----------------+------------------------+------------------------------+
| | +Optional | 1 | anon[204], p1, p2 | |
| | | +----------------+------------------------+------------------------------+
| | +ActiveRead | 0 | anon[204], p1, p2 | |
| | | +----------------+------------------------+------------------------------+
| | +Expand(Into) | 0 | anon[204], p1, p2 | (p1)-[anon[204]:CITES]->(p2) |
| | | +----------------+------------------------+------------------------------+
| | +Argument | 1 | p1, p2 | |
| | +----------------+------------------------+------------------------------+
| +Apply | 1 | p1, p2, row | |
| |\ +----------------+------------------------+------------------------------+
| | +ValueHashJoin | 1 | p1, p2, row | p2.id = row.outcit_id |
| | |\ +----------------+------------------------+------------------------------+
| | | +NodeUniqueIndexSeek(Locking) | 1 | p1, row | :Paper(id) |
| | | +----------------+------------------------+------------------------------+
| | +NodeUniqueIndexSeek(Locking) | 1 | p2, row | :Paper(id) |
| | +----------------+------------------------+------------------------------+
| +LoadCSV | 1 | row | |
+-----------------------------------+----------------+------------------------+------------------------------+
If it makes any difference, I have autocommit turned off... Maybe I should be turning that on? (though I have 64GB RAM, and I haven't crashed yet)
I have adjusted neo4j.conf to account for my available RAM as follows:
dbms.memory.heap.initial_size=23900m
dbms.memory.heap.max_size=23900m
dbms.memory.pagecache.size=27g
Ubuntu (18.04) on AWS has the swapfile disabled, so it should not be a virtual memory issue...
Any ideas??