Import plain file with a format

dbranco · January 30, 2020, 10:44pm

Hello everyone,

I have a question about loading data from a file that is plained and has some rules:This is my sample data

00001 A00 0 Cholera Cholera
00002 A000 1 Cholera due to Vibrio cholerae 01, biovar cholerae Cholera due to Vibrio cholerae 01, biovar cholerae
00003 A001 1 Cholera due to Vibrio cholerae 01, biovar eltor Cholera due to Vibrio cholerae 01, biovar eltor
00004 A009 1 Cholera, unspecified Cholera, unspecified
00005 A01 0 Typhoid and paratyphoid fevers Typhoid and paratyphoid fevers

The rules are the following

So the question is how can I tell neo4j to follow that space format?
also I would like to ignore maybe the first column (the id should be autogenerated

dbranco · January 30, 2020, 10:47pm

After a while studying CYPHER i came to a conclusion. I needed to do it in several steps. I am not sure if this is the best one, can you give me an opinion?

CREATE CONSTRAINT ON (disease:PathologicalDisease) ASSERT disease.code IS UNIQUE;

USING PERIODIC COMMIT 500
LOAD CSV FROM "file:////icd10cm_order_2020.txt" AS line FIELDTERMINATOR '$'
WITH line[0] as contentLine
WITH contentLine, trim(substring(contentLine, 6, 7)) as code
WITH contentLine, code, trim(substring(contentLine, 14, 1)) as isGroup
WITH contentLine, code, isGroup, trim(substring(contentLine, 16, 60)) as shortDescription
WITH contentLine, code, isGroup, shortDescription, trim(substring(contentLine, 77)) as longDescription
MERGE (node:PathologicalDisease {
  code: code,
    parent: (CASE isGroup WHEN "0" THEN true ELSE false END),
    shortDescription:
    shortDescription,
    longDescription: longDescription
});

MATCH(n)
WHERE size(n.code) > 3
WITH n, reverse(range(3, size(n.code)-1)) as rangeIdx
UNWIND rangeIdx as idx
MATCH(n2) WHERE n2.code = substring(n.code, 0, idx)
WITH n, collect(n2)[0] as _1stNode
MERGE (_1stNode)-[:GROUPS]->(n)

I must add that in step 3 I needed to add the label (n:PathologicalDisease) in order to speed the creation of the relations

ameyasoft · January 31, 2020, 7:37am

It's a brilliant idea!!.I really like this.

dbranco · January 31, 2020, 9:58am

Thanks, as I said I am newbie and I am still looking for the best solution when importing data; any new idea is very welcome.

Also, I have changed my import in order to separate the node from the description in order to create the translation node and get more flexibility when adding other languages. The problem is that the import took 2 hours and 6 minutes, so I probably need to create an index somewhere for the translation but haven't think about where.

This is my new modification:

// Load all PDs into CVS
USING PERIODIC COMMIT 500
LOAD CSV FROM "file:////icd10cm_order_2020.txt" AS line FIELDTERMINATOR '$'
WITH line[0] as contentLine
WITH contentLine, trim(substring(contentLine, 6, 7)) as code
WITH contentLine, code, trim(substring(contentLine, 14, 1)) as isGroup
WITH contentLine, code, isGroup, trim(substring(contentLine, 16, 60)) as shortDescription
WITH contentLine, code, isGroup, shortDescription, trim(substring(contentLine, 77)) as longDescription
MERGE (pd:PathologicalDisease {
  code: code, 
  parent: (CASE isGroup WHEN "0" THEN true ELSE false END)
})
MERGE (pd_translation:PDTranslation{
  shortDescription: shortDescription, 
  longDescription: longDescription
})

// Link PD to its translation
MERGE (pd)-[:TRANSLATE_TO {language: "en"}]->(pd_translation);

I believe that the problem is related to the MERGE of the translation. Since there are previous matches between some nodes (690 nodes are reused for the ICD-10 revision 2020) so that might do that it takes more time for the importing process.

If someone has an opinion of indexing the translation nodes or the relation please help me.

dbranco · January 31, 2020, 9:58am

Don't forget to like the post that helps with the visibility I guess.

Topic		Replies	Views
Data Import Neo4j Website	0	428	August 5, 2020
Import csv file (dictionary format?) Cypher	6	804	January 12, 2021
LOAD CSV From a Local File Windows10 Import / Export import	13	5893	July 19, 2019
Can't LOAD CSV, error about undefined variable or whitespace Cypher	20	2228	December 17, 2019
Null (global id space) is missing data neo4j-admin import Import / Export import	1	508	July 30, 2021

Import plain file with a format

Related topics