cancel
Showing results for 
Search instead for 
Did you mean: 

Performance issue while load data into Neo4j

nitinjkt
Node Link

Hi All,

We are uploading bulk data into neo4j using below script in which we are using apoc.load.jdbc with periodic iterate to connect our oracle database and create nodes and relationships into Neo4j.
we are using neo4j enterprise 3.5.5 version in our testing environment and have set heap initial and max heap memory 8 GB and page cache as 10 GB as recommended by neo4j admin memrec .
Our target is to load aroud 2000 million (node+property+relationships) .
once we execute our script , initially it execute with very fast speed around 10000 nodes per second and gradually performance degrades.
For further investigation we are using Jconsole for memory and thread utilization tracking in which we identified that in heap memory Old Gen is not getting cleared by GC.
Find below attachment for reference.

We need to know what is the issue in script i ran profile also but didn't observed any eager operation in it.

My question here are

1- Is it right approach to do bulk data insert or is any other method in neo4j ?
2- if we use this method then why old gen area is not getting cleared by GC?

 with "select 'JioSite-XX-XX-XX-'||SITE_INST_ID uuid,a.SITE_INST_ID, a.SITE_HUM_ID, a.CLLI, 
a.NUM, a.BASE_NUM, a.LATITUDE, a.LONGITUDE, a.RESTRICTIONS, a.CONTACTS, a.ADDRESS, a.POST_CODE_1, a.POST_CODE_2, a.CITY, a.STATE_PROV, a.COUNTY,
 a.COUNTRY, a.ROOM, a.FLOOR, a.NPA_NXX, a.COMMENTS, TO_CHAR(nvl(LAST_MOD_TS,'01-JAN-9999'),'YYYY-MM-DD HH24:MI:SS')LAST_MOD_TS, a.LAST_MOD_BY, a.INST_VER,
 TO_CHAR(nvl(LOCK_TS,'01-JAN-9999'),'YYYY-MM-DD HH24:MI:SS')LOCK_TS, a.LOCK_BY, a.PARENT_SITE_INST_ID, a.STATUS, a.FPLAN_DEPTH, a.FPLAN_WIDTH, a.FPLAN_HEIGHT,
 a.FPLAN_NAME, a.FPLAN_ORIGIN_X, a.FPLAN_ORIGIN_Y from SITE_INST a 
 where a.num<>'CUSTOMER' AND a.STATUS<> 'IN USE' and SITE_INST_ID >=50000000 AND SITE_INST_ID <= 55000000 " as query
CALL apoc.periodic.iterate('
CALL apoc.load.jdbc("jdbc:oracle:thin:rilprod/jioprod@10.147.161.47:1522/XCOMSTB2","'+query+'") YIELD row ',
"MATCH (m:classes{name:'gFacility'}) 
MERGE (n:Facility { _uuid: row.UUID})
SET 
n.site_inst_id = toInteger(row.SITE_INST_ID),
n.name = row.SITE_HUM_ID,
n.clli = row.CLLI,
n.num = row.NUM,
n.base_num = row.BASE_NUM,
n.site_type = row.SITE_TYPE,
n.site_id = row.SITE_ID,
n.parent_site_inst_id = toInteger(row.PARENT_SITE_INST_ID),
n.status = row.STATUS
MERGE (n)-[r:INSTANCE_OF]->(m)
",
{ batchSize:2000, parallel:false}) 
YIELD batches, total , timeTaken , failedOperations , failedBatches , errorMessages , committedOperations , operations
RETURN '-----SITE_INST--------',batches, total , timeTaken , failedOperations , failedBatches , errorMessages , committedOperations , operations;
7 REPLIES 7

First of all for that volume I would recommend exporting the table to a gzipped CSV and use neo4j-admin import bulk importer to import the data into your neo4j database.

If possible you should also update to a current neo4j which is 4.4.x

If you really want to import this volume transactionally, it will take quite some time, make sure you have a fast NVMe SSD for writing.

Remove the match and the relationship-creation, the type of facility is already in the label, that class node becomes a gigantic supernode for no purpose at all.

I presume you have an unique constraint on :Facility(_uuid) ?

If you have not yet imported data for such an initial import then I suggest switching MERGE to CREATE to avoid the extra lookup.

Afterwards, increate batch size to 100000 and set parallel true.

You can also do a test run with say 100M rows first.

CALL apoc.periodic.iterate('
CALL apoc.load.jdbc("jdbc:oracle:thin:rilprod/jioprod@10.147.161.47:1522/XCOMSTB2","'+query+'") YIELD row ',
"
CREATE (n:Facility { _uuid: row.UUID})
SET 
n.site_inst_id = toInteger(row.SITE_INST_ID),
n.name = row.SITE_HUM_ID,
n.clli = row.CLLI,
n.num = row.NUM,
n.base_num = row.BASE_NUM,
n.site_type = row.SITE_TYPE,
n.site_id = row.SITE_ID,
n.parent_site_inst_id = toInteger(row.PARENT_SITE_INST_ID),
n.status = row.STATUS
",
{ batchSize:100000, parallel:true}) 
YIELD batches, total , timeTaken , failedOperations , failedBatches , errorMessages , committedOperations , operations
RETURN '-----SITE_INST--------',batches, total , timeTaken , failedOperations , failedBatches , errorMessages , committedOperations , operations;

nitinjkt
Node Link

Hi Michael,

Thanks for revert.
Sure we will try to load data through admin import as first time data upload in new neo4j Database. But in this case csv generation is major challenge for such large data with number of relationship and property. Please advise if you have any mechanism to generate csv faster.
Currently we can't upgrade on 4.4.x.
Yes , we already have index on Facility(_uuid) column.
I tried with your suggestion with batch size 100000 and parallel true and remove relationship part from my script and ran for 1M records for testing it but still heap memory not getting released , it still be more than 6 GB . I still clueless about heap memory utilization (in which G1 Old Gen) consuming almost 5 GB. We are using G1GC as default garbage collector for database.
Please let me know if anything else need to specify in neo4j.conf or do we need to change anything else in scripts.
I am attaching neo4j.conf file for further reference.

neo4j_conf.txt (39.3 KB)

Hi Michael,

One more observation, we were using open jdk 1.8.0_275 and we read some article about issues in Open JDK regarding garbage collection. So we changed from open jdk to oracle Jdk

3X_5_b_5bf7e936288559dcb7d4dea1ca39418862b6ebd9.png

after change on oracle Jdk, we ran the same load but still getting the same issue about Heap memory (G1 old Gen) as show below in which almost 3 minutes is being taken to clear old Gen system almost dead .

nitinjkt
Node Link

Hi Michael,

for further reference please find enclosed debug.log and gc.logs in which we clearly observed VM stop the world pause error and system dead. To resume the services, we will have to stop the services and start it to clear the cache .
gc_log.txt (1.4 MB)
debug.txt (425.5 KB)

With Neo4j 4.x, you have to use Java 11.
You can deactivate the retention of the txlogs in the neo4j.conf. It will reduce significantly the I/O. You'd better create all the nodes first and create the relationships in a second time.

You should not put the jdbc password of your production database in a forum.

mengjiakang2017
Node Clone

I was at the similar situation - I think apoc.periodic.iterate tends to load all the rows in the first query and then execute in batches based on neo4j - Difference between apoc.periodic.iterate and apoc.periodic.commit when deleting lots of node...
it's hard to find the best heap size - I got a memory recommendations of 459GB for a few million which is not practical. neo4j-admin seems to me the best solution for large datasets and loaded within 30s for 8million nodes and 15million edges for me.

nitinjkt
Node Link

Hi All,

Can anybody help me to find solution of problem ? as suggested to use neo4j admin import but we can not use this utility as this is the first phase of data load , post successful data load and live the application we will have to load other phase data that will be 3 times larger that mentioned data load and as mentioned in neo4j operation guide, we can only use neo4j ad min import utility only once at the time of creating new database . in that case it might be possible that my first phase problem will be handled by neo4j admin import but we will face the same problem in my rest phases . please suggest