I'm ingesting about 68K items from a CSV, and I'm wondering if I can make the process go faster. At the moment, it's taking over an hour (1:16:41), or about 14.8 items per second.
I'm using "apoc.periodic.iterate" with a batch size of 1000.
The items query answers a list of labeled nodes ("Datapoint").
Each instance of "Datapoint" has an indexed property ("fipsID") that uniquely identifies another labeled node ("FIPSItem").
The "fipsID" property is a (unique) constraint on "FIPSItem".
The apply query uses the value of "fipsID" on each datapoint to collect the "FIPSItem" instance with that value of "fipsID".
Once "f" is in the cypher context, the apply cypher uses CREATE to instantiate a labeled relationship (:FEATURE) from the "Datapoint" instance to the "FIPSItem" instance.
The apply query therefore has a single MATCH that uses two indexed property values followed by a single CREATE. The apply query is called 69 times to handle all 68,097 Datapoint instances.
I'm seeing an elapsed time of 76:41 -- 4,601 seconds -- for this batch query. That's about 67 msec per item, or about 14.8 items per second.
That is slower than I expect (but perhaps my expectations are unrealistic). I'm running Neo4J Enterprise v4.4.6 on a robust AWS EC2 instance. I've done the usual configuration following the guidance of memrec.
I wonder if the batchsize (1000) is too small for the 68K items this query is handling. According to PROFILE, a single batch of 1000 items without the CREATE uses about 5M and takes 72 msec. The .png of that profile is attached.
The database has a "node_label_lookup_index" and "rel_type_lookup_index" as per the most recent documentation with an indexProvider of "token-lookup-1.0".
The Datapoint index is named "Datapoint_fipsIDIndex", and is a "NONUNIQUE" index on the "fipsID" property of "Datapoint". Its indexProvider is "native-btree-1.0"
The FIPSItem index is named "FIPSItem_fipsIDIndex", and is a "UNIQUE" index on the "fipsID" property of "FIPSItem". Its index provider is "native-btree-1.0".
Here are the log entries (in query.log) for the start and stop of the batch query:
2022-08-10 15:24:06.036+0000 INFO Query started: id:3 - 90 ms: 0 B - bolt-session bolt neo4j-python/4.4.1 Python/3.9.6-final-0 (linux) client/172.30.2.147:40924 server/172.30.2.147:7687> tms-covid-a - tms_neo4j - call apoc.periodic.iterate("
MATCH(dp:Datapoint)-[:DATASET]->(dataset:Dataset {datasetID: 'nytimes.20220301.us-counties'})-[:SOURCE]->(source:Source {sourceID: 'nytimes.us-counties'})
RETURN dp ORDER BY dp.stateCountyFIPSCode
", "
MATCH (f:FIPSItem {fipsID: dp.fipsID})
CREATE (dp)-[:FEATURE]->(f)
", {batchSize:1000}) - {} - runtime=null - {}
...
2022-08-10 16:40:47.504+0000 INFO id:3 - 4601558 ms: 176 B - bolt-session bolt neo4j-python/4.4.1 Python/3.9.6-final-0 (linux) client/172.30.2.147:40924 server/172.30.2.147:7687> tms-covid-a - tms_neo4j - call apoc.periodic.iterate("
MATCH(dp:Datapoint)-[:DATASET]->(dataset:Dataset {datasetID: 'nytimes.20220301.us-counties'})-[:SOURCE]->(source:Source {sourceID: 'nytimes.us-counties'})
RETURN dp ORDER BY dp.stateCountyFIPSCode
", "
MATCH (f:FIPSItem {fipsID: dp.fipsID})
CREATE (dp)-[:FEATURE]->(f)
", {batchSize:1000}) - {} - runtime=pipelined - {}
Is 15 items per second about as good as can be done by Neo4J Enterprise?
I invite and appreciate any guidance about how I can improve the performance of
operations like this.