Neo4j relationship creation gets stuck during import

dti · October 24, 2018, 9:21am

Hello guys,

I am facing an issues with my local neo4j database. I am trying to insert around 100.000 nodes and 380.000 relationships using the official C# neo4j driver. Most of the time the data import takes around 23 seconds (which is fine for me). However, from time to time (~every 3 time!) the import gets stuck. This means that I send a request to create new Relations (via http, see Wireshark screenshot below) but I don’t get a response back from the database (after some time i run into a timeout at my application). This always happens when I try to insert new relationships. Creating new nodes seems to work fine. The strange thing is, that sometimes the import finishes successfully without getting stuck in ~23 seconds!

The query that gets stuck is the following:

string query = "UNWIND {relations} AS relation " +
              $"MATCH (source:label1), (target:label2) " +
              $"WHERE source.ID = relation.ID AND target.ID = relation.ID " +
              $"CREATE (source)-[:newRelation]->(target)";
return ExecuteQuery(client, query, new Dictionary<string, object> { { "relations", ParameterSerializer.ToDictionary(relationInfos) } });

This is the query to create new nodes which doesn't get stuck:

 string query = "UNWIND {nodes} AS node " +
               $"CREATE (n:label1) " +
                "SET n = node";
            return ExecuteQuery(client, query, new Dictionary<string, object> { { "nodes", ParameterSerializer.ToDictionary(elements) } });

I always process 1000 nodes and relationships at once. I fiddled around with different heap and pagecache sizes but I doesn't seem to have any effect on my problem. If you look at the second screenshot it looks like the memory size is sufficient anyways. I also tried using the Bolt protocol instead of http. The problem still persists.

Do you have any explanations what is causing this behavior? Or do you have some ideas what I can try to do in order to circumvent this problem?

Thank you very much for your help,
kind regards,
Daniel

Additional Infos:

neo4j is running on a Window 10 VM with 4 CPU Cores assigned and 6GB of RAM
neo4j desktop version: 3.2.10
neo4j server version: 3.4.9 (community)
neo4j C# driver version: 1.7.0
Configuration:
- dbms.memory.heap.initial_size=4G
- dbms.memory.heap.max_size=4G
- dbms.memory.pagecache.size=1G

This is how it looks like on the wire:

This is the workload of the VM:
taskmanager

michael.hunger · October 24, 2018, 10:28pm

Do you execute the queries concurrently?
Perhaps it makes sense to try smaller batches?

Would it be possible to take a thread dump with jstack <pid> of the server when it is stuck?

dti · October 25, 2018, 10:14am

Thanks for your reply Michael!

No, at the moment I do the following:

 using (ISession session = client.Session())
{
        session.Run(query, value);
}

I already tried using bigger and smaller batches. But the problem still remains.

Good point. I will do that later and post the dump here.

dti · October 27, 2018, 8:44am

Hello again,

I have some News. After some changes in my code the database doesn't stuck anymore. It simply takes a little longer to process all the data.

However, what strange is, is that the first batch I insert take a whole lot longer then the rest. The first time I execute the following query I takes much longer then the following ones.

string query = "UNWIND {relations} AS relation " +
              $"MATCH (source:label1), (target:label2) " +
              $"WHERE source.ID = relation.ID AND target.ID = relation.ID " +
              $"CREATE (source)-[:newRelation]->(target)";
return ExecuteQuery(client, query, new Dictionary<string, object> { { "relations", ParameterSerializer.ToDictionary(relationInfos) } });

Is there any explanation or some way to avoid that?

Kind regards and thanks for your help,
Daniel

michael.hunger · October 27, 2018, 10:16pm

How much longer?
It should take a few Millis for compiling the query.

What I could imagine is that it has to load the existing graph-data initially from disk into the page-cache and index.

dti · October 27, 2018, 10:35pm

I would say around three times as long.

I first create all nodes (around 100.000 in batches of 1000) and after that I create all relationships (around 300.000 in batches of 1000). Node creation seems to run fine. It seems that the first query creating the first 1000 relationships (see example in my other post) takes three times as long as the following queries.

michael.hunger · October 28, 2018, 1:06am

Do you create the indexes upfront?
Do you wait for them to become online?

Can you run an EXPLAIN of your relationship-statemetn after your created the nodes?

It could be that it helps to call db.resampleOutdatedIndexes(); after you created the nodes
So that the first query plan it creates is already correct.

dti · October 28, 2018, 8:49am

Do you create the indexes upfront?

Yes.

Do you wait for them to become online?

No. I create the indexes up front when the DB is empty. Then I start node & relationship creation directly afterwards. I actually don't know how to check if the index is online. Calling db.resampleOutdatedIndexes(); doesn't seems to help either.

Can you run an EXPLAIN of your relationship-statemetn after your created the nodes?

This is the EXPLAIN of the query that takes extra long:

The query itself looks like this (shortened a bit):

UNWIND [
  {
    Type: "Create",
    SourceLabel: "PortEvent",
    SourceID: 3,
    TargetLabel: "Trace",
    TargetID: 1,
    RelationName: "SAVED_IN"
  },
  {
    Type: "Create",
    SourceLabel: "PortEvent",
    SourceID: 4,
    TargetLabel: "Trace",
    TargetID: 1,
    RelationName: "SAVED_IN"
  },
  ...
  {
    Type: "Create",
    SourceLabel: "PortEvent",
    SourceID: 1002,
    TargetLabel: "Trace",
    TargetID: 1,
    RelationName: "SAVED_IN"
  }
] AS info
MATCH (source:PortEvent), (target:Trace)
WHERE source.ID=info.SourceID
AND target.ID = info.TargetID
CREATE (source)-[:SAVED_IN]->(target)

michael.hunger · November 21, 2018, 10:52am

I suggest to switch to parameters, pass your list in as parameters not as a gigantic Cypher statement with WITH.

I think ti's more of a parser/planner issue than runtime.

UNWIND $infos AS info
WHERE source.ID=info.SourceID
AND target.ID = info.TargetID
CREATE (source)-[:SAVED_IN]->(target)

dti · November 22, 2018, 12:29pm

I don't really get your point. I use parameters in my .NET code. But in the end these parameters are getting resolved and serialized to a JSON string to form the query.

Can you be a bit more specific on what I actually should do?

michael.hunger · November 23, 2018, 12:34am

The resolving should happen within the database server during query execution.

Not in your client during query text construction.

Topic		Replies	Views
Importing Relationships / Nodes very slow Import / Export performance , cypher , import	3	1092	March 5, 2020
Data not inserting into neo4j Operations	21	3697	December 24, 2018
Performance issue when importing CSV relationships Import / Export performance , import , csv , index	2	2087	January 28, 2019
Neo4j getting disconnected again and again Neo4j Graph Platform	7	1162	January 2, 2019
Inserting a Relationship Post Database Setup Import / Export import	21	1798	June 24, 2019

Neo4j relationship creation gets stuck during import

Related topics