Neo4j relationship creation gets stuck during import

performance
cypher
import

(Dti) #1

Hello guys,

I am facing an issues with my local neo4j database. I am trying to insert around 100.000 nodes and 380.000 relationships using the official C# neo4j driver. Most of the time the data import takes around 23 seconds (which is fine for me). However, from time to time (~every 3 time!) the import gets stuck. This means that I send a request to create new Relations (via http, see Wireshark screenshot below) but I don’t get a response back from the database (after some time i run into a timeout at my application). This always happens when I try to insert new relationships. Creating new nodes seems to work fine. The strange thing is, that sometimes the import finishes successfully without getting stuck in ~23 seconds!

The query that gets stuck is the following:

string query = "UNWIND {relations} AS relation " +
              $"MATCH (source:label1), (target:label2) " +
              $"WHERE source.ID = relation.ID AND target.ID = relation.ID " +
              $"CREATE (source)-[:newRelation]->(target)";
return ExecuteQuery(client, query, new Dictionary<string, object> { { "relations", ParameterSerializer.ToDictionary(relationInfos) } });

This is the query to create new nodes which doesn't get stuck:

 string query = "UNWIND {nodes} AS node " +
               $"CREATE (n:label1) " +
                "SET n = node";
            return ExecuteQuery(client, query, new Dictionary<string, object> { { "nodes", ParameterSerializer.ToDictionary(elements) } });

I always process 1000 nodes and relationships at once. I fiddled around with different heap and pagecache sizes but I doesn't seem to have any effect on my problem. If you look at the second screenshot it looks like the memory size is sufficient anyways. I also tried using the Bolt protocol instead of http. The problem still persists.

Do you have any explanations what is causing this behavior? Or do you have some ideas what I can try to do in order to circumvent this problem?

Thank you very much for your help,
kind regards,
Daniel

Additional Infos:

  • neo4j is running on a Window 10 VM with 4 CPU Cores assigned and 6GB of RAM
  • neo4j desktop version: 3.2.10
  • neo4j server version: 3.4.9 (community)
  • neo4j C# driver version: 1.7.0
  • Configuration:
    • dbms.memory.heap.initial_size=4G
    • dbms.memory.heap.max_size=4G
    • dbms.memory.pagecache.size=1G

This is how it looks like on the wire:

This is the workload of the VM:
taskmanager


(Michael Hunger) #2

Do you execute the queries concurrently?
Perhaps it makes sense to try smaller batches?

Would it be possible to take a thread dump with jstack <pid> of the server when it is stuck?


(Dti) #3

Thanks for your reply Michael!

No, at the moment I do the following:

 using (ISession session = client.Session())
{
        session.Run(query, value);
}

I already tried using bigger and smaller batches. But the problem still remains.

Good point. I will do that later and post the dump here.


(Dti) #4

Hello again,

I have some News. After some changes in my code the database doesn't stuck anymore. It simply takes a little longer to process all the data.

However, what strange is, is that the first batch I insert take a whole lot longer then the rest. The first time I execute the following query I takes much longer then the following ones.

string query = "UNWIND {relations} AS relation " +
              $"MATCH (source:label1), (target:label2) " +
              $"WHERE source.ID = relation.ID AND target.ID = relation.ID " +
              $"CREATE (source)-[:newRelation]->(target)";
return ExecuteQuery(client, query, new Dictionary<string, object> { { "relations", ParameterSerializer.ToDictionary(relationInfos) } });

Is there any explanation or some way to avoid that?

Kind regards and thanks for your help,
Daniel


(Michael Hunger) #5

How much longer?
It should take a few Millis for compiling the query.

What I could imagine is that it has to load the existing graph-data initially from disk into the page-cache and index.


(Dti) #6

I would say around three times as long.

I first create all nodes (around 100.000 in batches of 1000) and after that I create all relationships (around 300.000 in batches of 1000). Node creation seems to run fine. It seems that the first query creating the first 1000 relationships (see example in my other post) takes three times as long as the following queries.


(Michael Hunger) #7

Do you create the indexes upfront?
Do you wait for them to become online?

Can you run an EXPLAIN of your relationship-statemetn after your created the nodes?

It could be that it helps to call db.resampleOutdatedIndexes(); after you created the nodes
So that the first query plan it creates is already correct.


(Dti) #8

Do you create the indexes upfront?

Yes.

Do you wait for them to become online?

No. I create the indexes up front when the DB is empty. Then I start node & relationship creation directly afterwards. I actually don't know how to check if the index is online. Calling db.resampleOutdatedIndexes(); doesn't seems to help either.

Can you run an EXPLAIN of your relationship-statemetn after your created the nodes?

This is the EXPLAIN of the query that takes extra long:

The query itself looks like this (shortened a bit):

UNWIND [
  {
    Type: "Create",
    SourceLabel: "PortEvent",
    SourceID: 3,
    TargetLabel: "Trace",
    TargetID: 1,
    RelationName: "SAVED_IN"
  },
  {
    Type: "Create",
    SourceLabel: "PortEvent",
    SourceID: 4,
    TargetLabel: "Trace",
    TargetID: 1,
    RelationName: "SAVED_IN"
  },
  ...
  {
    Type: "Create",
    SourceLabel: "PortEvent",
    SourceID: 1002,
    TargetLabel: "Trace",
    TargetID: 1,
    RelationName: "SAVED_IN"
  }
] AS info
MATCH (source:PortEvent), (target:Trace)
WHERE source.ID=info.SourceID
AND target.ID = info.TargetID
CREATE (source)-[:SAVED_IN]->(target)