Loading very a very large dataset

I am a novice. I have tried to create a small model with some test data.

The next step was to use a much larger set of input, about 300,000 rows.

I tried using the USING PERIODIC COMMIT but received an error, stating
"Executing queries that use periodic commit in an open transaction is not possible"

I am stumped concerning how to address my problem. I will assume I am doing something wrong or should use some other approach.
Using Windows 10 and Neo4j 4.1.3

Through what client are you submitting the cypher statement/load csv?

If using the Neo4j Browser, can you preface the statement with :auto similar to what is described at Importing CSV Data into Neo4j - Developer Guides and for example

:auto USING PERIODIC COMMIT 500
LOAD CSV WITH HEADERS FROM 'file:///data.csv' AS row

and does this address your issue

Hi @fredwat3

I think loading 300,000 records is still a small number.
I have loaded 30 million records in one case.
If processing speed is an issue for you, show us your Cypher.
And I can make some good comments.

Fantastic. This worked. Thank you very much.

Hi @koji , thank you for sharing your use case of 30 million rows. I have a similar large dataset and I kept seeing some fatal error or connection time out issue (also asked here VM stop-the-world pause for simple LOAD csv query for creating a graph - #26 by mengjiakang2017). it would be great if you could share your experience in loading large datasets, I'm posting my cypher here. Thanks!

CALL apoc.periodic.iterate('call apoc.load.jdbc($azure_url, 
"select distinct  m.measurement_id, m.visit_occurrence_id, m.measurement_concept_id, v.visit_concept_id, v.visit_start_date, v.visit_end_date
from omop.measurement m
join omop.visit_occurrence v 
on v.visit_occurrence_id = m.visit_occurrence_id
where measurement_id < 30000000")
yield row
','
match (m:Measurement {measurement_concept_id: row.measurement_concept_id})
match (v:VisitOccurrence {visit_concept_id:row.visit_concept_id})
merge (m)-[r:ASSOCIATED_DURING_VISIT {visit_occurrence_id:row.visit_occurrence_id, 
visit_start_date:row.visit_start_date, 
visit_end_date:row.visit_end_date,
measurement_id:row.measurement_id}]->(v)
',{batchSize:1000000, iterateList:True, parallel:False, params:{azure_url:$azure_url}, retries:10});

@mengjiakang2017

when you use apoc.load.jdbc you need to make sure your batch size of data can be processed without having timeouts.

Also, would it be possible to post your schema statements. most of the time when you have so many GC stop the world statements, then the indexes may not exist.

Also this statement

merge (m)-[r:ASSOCIATED_DURING_VISIT {visit_occurrence_id:row.visit_occurrence_id, 
visit_start_date:row.visit_start_date, 
visit_end_date:row.visit_end_date,
measurement_id:row.measurement_id}]->(v)

is going to be very expensive as more and more relationships are created between m and v. It has to check all the relationships before it can decide it can create the relationship.

Do you need to add the properties there?

Thank you @anthapu ! looks like I might want to decrease the batch size and give it a try

below is the schema and index I created (I didn't add all the attributes in the whiteboard so you might see fewer properties in the schema statements)

to answer your question -

Do you need to add the properties there?

Yes, I need these edge attributes to store the details of a healthcare visit. but I think updating to

merge (m)-[r:ASSOCIATED_DURING_VISIT {visit_occurrence_id:row.visit_occurrence_id,}
on create set the rest edge attributes might be better

CREATE 
  (`0` :ObservationPeriod {period_type_concept_id:'Integer',period_type_concept_name:'String'}),
  (`1` :Gender {gender_concept_id:'Integer',gender_concept_name:'String'}),
  (`2` :Measurement {measurement_concept_id:'Integer',measurement_concept_name:'String'}),
  (`3` :Race {race_concept_id:'Integer',race_concept_name:'String'}),
  (`4` :Person {person_id:'Integer',year_of_birth:'Integer',SCRIPT_case_number:'Integer'}),
  (`5` :Ethnicity {ethnicity_concept_id:'Integer',ethnicity_concept_name:'String'}),
  (`6` :VisitOccurrence {visit_concept_id:'Integer',visit_concept_name:'String'}),
  (`7` :ProcedureOccurrence {procedure_concept_id:'Integer',procedure_concept_name:'String'}),
  (`8` :ConditionOccurrence {condition_concept_id:'Integer',condition_concept_name:'String',condition_type_concept_id:'Integer',condition_type_concept_name:'String'}),
  (`9` :Observation {observation_concept_id:'Integer',observation_concept_name:'String'}),
  (`10` :DrugExposure {drug_concept_id:'Integer',drug_concept_name:'String',drug_type_concept_id:'Integer',drug_type_concept_name:'String',route_concept_id:'Integer',route_concept_name:'String'}),
  (`4`)-[:HAS_OBSERVATION_PERIOD {observation_period_start_date:'Date',observation_period_end_date:'Date',observation_period_id:'Integer'}]->(`0`),
  (`4`)-[:HAS_ETHNICITY ]->(`5`),
  (`4`)-[:HAS_RACE ]->(`1`),
  (`4`)-[:HAS_MEASUREMENT {measurement_date:'Date',measurement_id:'Integer'}]->(`2`),
  (`4`)-[:HAS_RACE ]->(`3`),
  (`4`)-[:HAS_VISIT_OCCURRENCE {visit_occurrence_id:'Integer',visit_start_date:'String',visit_end_date:'String'}]->(`6`),
  (`4`)-[:HAS_PROCEDURE_OCCURRENCE {procedure_occurrence_id:'Integer',procedure_date:'Date'}]->(`7`),
  (`4`)-[:HAS_CONDITION_OCCURRENCE {condition_occurrence_id:'Integer',condition_start_date:'Date',condition_end_date:'Date'}]->(`8`),
  (`4`)-[:HAS_OBSERVATION {observation_id:'Integer',observation_date:'String'}]->(`9`),
  (`4`)-[:HAS_DRUG_EXPOSURE {drug_exposure_id:'Integer',drug_exposure_start_date:'String',drug_exposure_end_date:'String'}]->(`10`),
  (`0`)-[:ASSOCIATED_DURING_VISIT {visit_occurrence_id:'Integer'}]->(`6`),
  (`2`)-[:ASSOCIATED_DURING_VISIT {visit_occurrence_id:'Integer'}]->(`6`),
  (`8`)-[:ASSOCIATED_DURING_VISIT {visit_occurrence_id:'Integer'}]->(`6`),
  (`10`)-[:ASSOCIATED_DURING_VISIT {visit_occurrence_id:'Integer'}]->(`6`),
  (`9`)-[:ASSOCIATED_DURING_VISIT {visit_occurrence_id:'Integer'}]->(`6`),
  (`7`)-[:ASSOCIATED_DURING_VISIT {visit_occurrence_id:'Integer'}]->(`6`);




// -----------------------create constraints for import -----------------------//

CREATE CONSTRAINT ON (p:Person) ASSERT p.person_id IS UNIQUE;
CREATE CONSTRAINT ON (o:Observation) ASSERT o.observation_concept_id IS UNIQUE;
CREATE CONSTRAINT ON (o1:ObservationPeriod) ASSERT o1.period_type_concept_id IS UNIQUE;
CREATE CONSTRAINT ON (d1:DrugExposure) ASSERT d1.drug_concept_id IS UNIQUE;
CREATE CONSTRAINT ON (v:VisitOccurrence) ASSERT v.visit_concept_id IS UNIQUE;
CREATE CONSTRAINT ON (c:ConditionOccurrence) ASSERT c.condition_concept_id IS UNIQUE;
CREATE CONSTRAINT ON (m:Measurement) ASSERT m.measurement_concept_id IS UNIQUE;
CREATE CONSTRAINT ON (p:ProcedureOccurrence) ASSERT p.procedure_concept_id IS UNIQUE;


CREATE INDEX obs_visit FOR ()-[r:HAS_MEASUREMENT]->() ON (r.measurement_id);
CREATE INDEX cond_visit FOR ()-[r:HAS_CONDITION_OCCURRENCE]->() ON (r.condition_occurrence_id);```

@mengjiakang2017 If visit_occurrence_id is unique and you want to use that on relation then you are better off doing

merge (m)-[r:ASSOCIATED_DURING_VISIT {visit_occurrence_id:row.visit_occurrence_id,}

and create index/constraint on it.

If you do not have index it will do db scans for all relationships between those nodes and it might be costlier as the number of relationships grow.

Thanks @anthapu , actually I made a mistake in the example, measurement_id is unique so I'd better try index and merge on it.

CREATE INDEX meas_visit FOR (m:Measurement)-[r:ASSOCIATED_DURING_VISIT]->(v:VisitOccurrence) ON (r.measurement_id);

UPDATE

Looks like I cannot use (m:Measurement) label match in creating edge indexes, I got error of


since I have multiple associated_during_visit type of edges between various nodes, composite indexes won't work either. I will need to update the schema to 'meas_associated_during_visit'. Curious if there are other suggestions on loading large datasets.

Thanks!

You won't be able specify the start and end nodes for relationship index. It is on a given relationship type. That's the error you are seeing.

CREATE INDEX meas_visit FOR ()-[r:ASSOCIATED_DURING_VISIT]->() ON (r.measurement_id);

Also, we should be careful about how much work we are doing when we are ingesting work loads.

Cypher makes it easy to express traversal patterns. Sometimes that simplicity can hide the amount of work you have to do when you are ingesting data.

For example say you didn't have index on relationship

merge (m)-[r:ASSOCIATED_DURING_VISIT {visit_occurrence_id:row.visit_occurrence_id,}

when there were no relationships then it is quick.

When you have 10 relationships, then it has to traverse all the 10 relationships between those nodes and retrieve the property and compare and find out if it can create the relationship or not.

So, progressively you are doing more and more work as the data keeps increasing.

This is where relying on index makes life bit easier.

1 Like