Hello everyone, in our Neo4j 4.4.14 database we want to create relationships from a CSV.
We have two types of nodes (sources, with SourceName and SourceURL subtype) and (locations, with various subtypes, which can be addressed directly with Location).
Both object have a uri
property with a UNIQUE constraint.
Now we have a CSV that links a source to a location, e.g:
http://www.yukkalab.com/yukka-ontology#source_name.10b-5-daily-the http://www.yukkalab.com/yukka-ontology#municipality.estado_de_baja_california
Naturally my thought to insert this is:
USING PERIODIC COMMIT 1000
LOAD CSV FROM 'file:///source_locations.tsv' AS line FIELDTERMINATOR ' '
MATCH (source), (location:Resource:ns0__Location { uri: line[1] })
WHERE (source:ns0__SourceName OR source:ns0__SourceURL) AND source.uri = line[0]
MERGE (source)-[r:ns0__isLocatedIn]->(location)
ON CREATE
SET r.ns0__deduplicationCreatedDate = datetime()
ON MATCH
SET r.ns0__deduplicationUpdatedDate = datetime()
RETURN source.uri, location.uri
Running this however creates a cartesian product between sources and locations which is large. We have around 170k sources and 200k locations. In fact, on our 32GB RAM configuration (single instance), we can not run this query, at least not within a reasonable time (several hours).
A potential idea is to fold the input to a map as such:
sources_map[location, [list_of_sources]]
This would reduce the runtime from quadratic to O(n*m), and may allow us to insert the desired data.
Unfortunately, to me, there appears to be no straightforward way to implement this (from the given CSV) in Neo4j.
I am a bit surprised as one could argue this is a really common operation, and the query planner should recognise this and optimise accordingly (in fact, running a DISTINCT
on sources will return less than 300 sources :)).