Optimising cartesian merge operations

Hello everyone, in our Neo4j 4.4.14 database we want to create relationships from a CSV.

We have two types of nodes (sources, with SourceName and SourceURL subtype) and (locations, with various subtypes, which can be addressed directly with Location).

Both object have a uri property with a UNIQUE constraint.
Now we have a CSV that links a source to a location, e.g:

http://www.yukkalab.com/yukka-ontology#source_name.10b-5-daily-the	http://www.yukkalab.com/yukka-ontology#municipality.estado_de_baja_california

Naturally my thought to insert this is:

USING PERIODIC COMMIT 1000
LOAD CSV FROM 'file:///source_locations.tsv' AS line FIELDTERMINATOR '	'
MATCH (source), (location:Resource:ns0__Location { uri: line[1] })
WHERE (source:ns0__SourceName OR source:ns0__SourceURL) AND source.uri = line[0]

MERGE (source)-[r:ns0__isLocatedIn]->(location)
ON CREATE
SET r.ns0__deduplicationCreatedDate = datetime()
ON MATCH
SET r.ns0__deduplicationUpdatedDate = datetime()

RETURN source.uri, location.uri

Running this however creates a cartesian product between sources and locations which is large. We have around 170k sources and 200k locations. In fact, on our 32GB RAM configuration (single instance), we can not run this query, at least not within a reasonable time (several hours).
A potential idea is to fold the input to a map as such:

sources_map[location, [list_of_sources]]

This would reduce the runtime from quadratic to O(n*m), and may allow us to insert the desired data.
Unfortunately, to me, there appears to be no straightforward way to implement this (from the given CSV) in Neo4j.

I am a bit surprised as one could argue this is a really common operation, and the query planner should recognise this and optimise accordingly (in fact, running a DISTINCT on sources will return less than 300 sources :)).

If you are stating that a single location can have multiple source uri values, then you could do the following. You will have to see if performance is any better and if memory is an issue, as it has to group all the data by location first.

Also, you may want to try just removing the return statement to see how the performance is. This can add a lot of time and memory needs for a large import. Especially if you are doing it on the browser.

:auto
LOAD CSV FROM 'file:///source_locations.tsv' AS line FIELDTERMINATOR '	'
WITH line[1] as location_uri, collect(line[0]) as source_uris
CALL {
    WITH location_uri, source_uris
    MATCH (location:Resource:ns0__Location { uri: location_uri })
    UNWIND source_uris as source_uri
    MATCH (source)
    WHERE (source:ns0__SourceName OR source:ns0__SourceURL) AND source.uri = source_uri
    MERGE (source)-[r:ns0__isLocatedIn]->(location)
    ON CREATE
    SET r.ns0__deduplicationCreatedDate = datetime()
    ON MATCH
    SET r.ns0__deduplicationUpdatedDate = datetime()
} in transactions of 1000 rows

Hi Garry,

Indeed we did not see any performance improvements with your query, but the following recipe could help alleviate some pain (it is still very slow).

  • sort your input file by group, in our case this is done using sort -t$"\t" -k 2
  • increase the transaction size (if your memory allows)
  • split your input into several files and run them concurrently (again, if your memory allows)

Having spent the better part of a day investigating this it is absolutely bewildering to me. We can insert hundreds of thousands of nodes into our graph, with edges, in seconds. We cannot insert edges between a list of existing nodes. It appears this is indeed an edge case of Neo4j and not how it is designed to be used.
We will plan accordingly in the future.

Next question, do you have indexed on your uri properties for the relevant labels?

Hi Gary, thanks for following up!

Indeed we do have a constraint on the uri field:

| "n10s_unique_uri" | "CONSTRAINT ON ( resource:Resource ) ASSERT (resource.uri) IS UNIQUE" | "Constraint( id=4, name='n10s_unique_uri', type='UNIQUENESS', schema=(:Resource {uri}), ownedIndex=3 )" |

I should have noted this in the original question. ns0__SourceName and ns0__SourceURL and ns0__Location nodes also all share the :Resource label.

Good to know. The query does not have Source as a Resource node. Maybe the following will help use your index.

:auto
LOAD CSV FROM 'file:///source_locations.tsv' AS line FIELDTERMINATOR '	'
WITH line[1] as location_uri, collect(line[0]) as source_uris
CALL {
    WITH location_uri, source_uris
    MATCH (location:Resource:ns0__Location {uri:location_uri})
    UNWIND source_uris as source_uri
    MATCH (source:Resource{uri:source_uri})
    WHERE (source:ns0__SourceName OR source:ns0__SourceURL)
    MERGE (source)-[r:ns0__isLocatedIn]->(location)
    ON CREATE
    SET r.ns0__deduplicationCreatedDate = datetime()
    ON MATCH
    SET r.ns0__deduplicationUpdatedDate = datetime()
} in transactions of 1000 rows

Have you looked at the query plan to see if it is picking up the index?