So the question is how can I tell neo4j to follow that space format?
also I would like to ignore maybe the first column (the id should be autogenerated
After a while studying CYPHER i came to a conclusion. I needed to do it in several steps. I am not sure if this is the best one, can you give me an opinion?
CREATE CONSTRAINT ON (disease:PathologicalDisease) ASSERT disease.code IS UNIQUE;
USING PERIODIC COMMIT 500
LOAD CSV FROM "file:////icd10cm_order_2020.txt" AS line FIELDTERMINATOR '$'
WITH line[0] as contentLine
WITH contentLine, trim(substring(contentLine, 6, 7)) as code
WITH contentLine, code, trim(substring(contentLine, 14, 1)) as isGroup
WITH contentLine, code, isGroup, trim(substring(contentLine, 16, 60)) as shortDescription
WITH contentLine, code, isGroup, shortDescription, trim(substring(contentLine, 77)) as longDescription
MERGE (node:PathologicalDisease {
code: code,
parent: (CASE isGroup WHEN "0" THEN true ELSE false END),
shortDescription:
shortDescription,
longDescription: longDescription
});
MATCH(n)
WHERE size(n.code) > 3
WITH n, reverse(range(3, size(n.code)-1)) as rangeIdx
UNWIND rangeIdx as idx
MATCH(n2) WHERE n2.code = substring(n.code, 0, idx)
WITH n, collect(n2)[0] as _1stNode
MERGE (_1stNode)-[:GROUPS]->(n)
I must add that in step 3 I needed to add the label (n:PathologicalDisease) in order to speed the creation of the relations
Thanks, as I said I am newbie and I am still looking for the best solution when importing data; any new idea is very welcome.
Also, I have changed my import in order to separate the node from the description in order to create the translation node and get more flexibility when adding other languages. The problem is that the import took 2 hours and 6 minutes, so I probably need to create an index somewhere for the translation but haven't think about where.
This is my new modification:
// Load all PDs into CVS
USING PERIODIC COMMIT 500
LOAD CSV FROM "file:////icd10cm_order_2020.txt" AS line FIELDTERMINATOR '$'
WITH line[0] as contentLine
WITH contentLine, trim(substring(contentLine, 6, 7)) as code
WITH contentLine, code, trim(substring(contentLine, 14, 1)) as isGroup
WITH contentLine, code, isGroup, trim(substring(contentLine, 16, 60)) as shortDescription
WITH contentLine, code, isGroup, shortDescription, trim(substring(contentLine, 77)) as longDescription
MERGE (pd:PathologicalDisease {
code: code,
parent: (CASE isGroup WHEN "0" THEN true ELSE false END)
})
MERGE (pd_translation:PDTranslation{
shortDescription: shortDescription,
longDescription: longDescription
})
// Link PD to its translation
MERGE (pd)-[:TRANSLATE_TO {language: "en"}]->(pd_translation);
I believe that the problem is related to the MERGE of the translation. Since there are previous matches between some nodes (690 nodes are reused for the ICD-10 revision 2020) so that might do that it takes more time for the importing process.
If someone has an opinion of indexing the translation nodes or the relation please help me.