Importing relationships from multiple csv file

I am new to neo4j. I have a problem statement where I need to import a CALL dataset. The dataset is in csv format with following columns (Source, Target, Timestamp and Duration). For a similar small dataset (5000 rows), I created Source and Target nodes (with constraints of unique source and target nodes) and CALL relationship with timestamp and duration as the properties.
Now, in the large dataset, I have ~70 millions rows and 200,000 nodes. I have a separate csv with the node ids from which I already created the nodes. I didn't completely understand the working of bulk import so I wrote a python script to split my csv into 70 csv's where each csv has 1 million nodes (saved as calls_0, calls_1, .... calls_69). I took the initiative to manually run a cypher query changing the filename every time. It worked well for first 10 files but then I noticed that after adding relationship from a file, the import is getting slower for the next file. Now it is taking almost 25 minutes for importing a file.
Can someone link me to an efficient and easy way of doing it?
Here is the query :

:auto USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM 'file:///calls/calls_28.csv' AS line
WITH toInteger(line.Source) AS Source, 
datetime(replace(line.Time,' ','T')) AS time,
toInteger(line.Target) AS Target,
toInteger(line.Duration) AS Duration
MATCH (p1:Person {pid: Source})
MATCH (p2:Person {pid: Target})
MERGE (p1)-[rel:CALLS {time: time, duration: Duration}]->(p2)
RETURN count(rel)

Note: I am using Neo4j 4.0.3

Do you have an index on Person.pid?

I have created uniqueness constraint on :Person(pid)

CREATE CONSTRAINT ON ( person:Person ) ASSERT (person.pid) IS UNIQUE

Have you tried using EXPLAIN on the cypher statement? Look at the part where the two filters for :Person join, the toInteger() function maybe stopping the query using the index, if so you will see a Cartesian product at that point in the plan.

1 Like

I didn't knew about use of EXPLAIN. Thanks.
Here is the output from EXPLAIN.

Yes, there is a cartesian product.

So in this case, what should be optimal way so that I can use index?

You could take out the toInteger() function on Source and Target and then look at EXPLAIN for that and compare to the one above. When I tried it that changed the cartesian product to a hash join.
There is also PROFILE which will tell you the actual db hits for each step and will tell you exactly where the problem is and confirm that it is in the cartesian product step.
If you need the Person.pid to be an integer you would then have to convert it after the load using apoc.periodic.iterate().

Thanks for quick response. Even after taking out toInteger(), it's still showing cartesian product?
(Note: While creating only the nodes from the csv containing all nodes, I specified toInteger() for the pid)
Also, I created a new test database where I didn't specify toInteger() while creating the nodes and also while importing relationship. EXPLAIN again gave me cartesianProduct. In this database, I also ran PROFILE for two loads (2nd and 3rd csv) and found db hits to be similar.
Here is the PROFILE while importing 2nd csv in test database:

Here is the PROFILE while importing 3rd csv in test database:

I also tried PROFILE on initial database (with 30 csv imported), even this looks same.

I was wondering how removing toInteger() gave you hash join while I am getting cartesian product? Can you just give me steps of how you recreated it. Here is my sequence

  1. Create uniqueness constraint on Person.pid (Document says that pid gets indexed as well).
  2. Import the person nodes (i mentioned toInteger() for the pid here)
  3. Import the relationships

This:
EXPLAIN LOAD CSV WITH HEADERS FROM 'file:///calls/calls_28.csv' AS line
WITH line.Source AS Source,
datetime(replace(line.Time,' ','T')) AS time,
line.Target AS Target,
toInteger(line.Duration) AS Duration
MATCH (p1:Person {pid: Source})
MATCH (p2:Person {pid: Target})
MERGE (p1)-[rel:CALLS {time: time, duration: Duration}]->(p2)
RETURN count(rel)
Gives a cartesian product, but this:
EXPLAIN LOAD CSV WITH HEADERS FROM 'file:///calls/calls_28.csv' AS line
WITH line,
datetime(replace(line.Time,' ','T')) AS time,
toInteger(line.Duration) AS Duration
MATCH (p1:Person {pid: line.Source})
MATCH (p2:Person {pid: line.Target})
MERGE (p1)-[rel:CALLS {time: time, duration: Duration}]->(p2)
RETURN count(rel)
gives me a hash join.

1 Like

Thanks for the help. It worked. I also incorporated the suggestions from cypher - Efficient way to import multiple csv's in neo4j - Stack Overflow and the import is much faster now.

Just wondering if its possible to write iterate on the filename and run the import csv on 70 files. I read that it is not possible with Periodic Commit.

At that scale, if you are starting from an empty database, you should really look at Neo4j Import.

You have to do a lot of the pre-processing yourself in your Python script, but the import speed will remarkably fast. You would write out one (or more) person.csv files. The trick is here that you'll have to filter out duplicates when you write out the files, so each phone number only appears once in the file, regardless of whether it was a source or target. You should have 200,000 lines.

pid:ID
4035551212
4038675309
. . .

Then you create calls.csv to build all the relationships, with one row for each of the 70 million calls. The page I linked to above shows how you can have a single row of headers in one file, then calls01.csv, calls02.csv, etc and call import them all, even with a fileglob.

:START_ID,time,duration,:END_ID
4035551212,"2020-06-04 20:01:00",915,4038675309
4035551212,"2020-06-05 08:10:12",123,4038675309

Then the import is just a matter of running this command, with the database empty and stopped.

bin/neo4j-admin import --nodes=Person=import/person.csv  --relationships=CALLS=import/calls.csv

It just streams out all the data files based on your input. Then you start the database and all your data is loaded. Because we did the work of the database (de-duping phone numbers) and are now bypassing it, the advantage to this is (nearly) linear scaling as the data volume grows. As you noted in your post, even with constraints and indexes, the imports are slowing down significantly as the database grows in size.

I hope that helps. Those documents seem much better than they were a couple years ago when I wrote our data loading routine.

Dave

1 Like

Wow :smiley:. Thanks Dave. I need to create a similar graph for EMAILS data. I will surely try this.

You have to start from an empty database (in Community), so it all all or nothing proposition.

The tipping point used to be 100,000 or 1 million(ish) rows. I can't remember which. It was a big effort to code up the file generation, but the speed of loading is well worth the effort.

We have eight types of nodes (300,000 nodes in total) and 14 types of relationships (400,000 rows). Building those files with Node from all our input takes about a minute. The import takes eight seconds.

I then startup the server and run a cypher script in to cypher-shell that creates indexes, constraints, etc.

It also interprets all the fields as string unless you tell it otherwise. In my previous example, you need to change duration to duration:int to force the data type to integer.

Dave

1 Like