How long should it take to create several millions of relationships?

I am using the latest Neo4j.Driver package (4.2.0) and the latest community edition of the Neo4j server (4.2.3).

I must be doing something wrong, because my query takes hours to complete.

I have 4 CSV files:

  1. XyzTypes.csv - defines 96,328 type nodes.
  2. XyzMethods.csv - defines 975,507 methods across all the types.
  3. XyzTypeTypeDependencies.csv - defines 121,834 type-type DEPENDS_ON relationships.
  4. XyzTypeMethods.csv - defines 973,972 type-method DECLARES relationships.

The following code should be very simple. It just needs to load all the CSV and create the respective Types, Methods and the relationships.

Here is my code:

var driver = GraphDatabase.Driver("bolt://localhost:7687", AuthTokens.Basic("neo4j", "1"));
var session = driver.AsyncSession(o => o.WithDatabase("neo4j"));
try
{
    Console.Write("[DI");
    await session.RunAsync("DROP INDEX type_id_index IF EXISTS");
    await session.RunAsync("DROP INDEX method_id_index IF EXISTS");

    Console.Write("][C");
    await session.WriteTransactionAsync(async tx =>
    {
        await tx.RunAsync("match ()-[r]->() delete r");
        await tx.RunAsync("match (n) delete n");
        return default(object);
    });

    Console.Write("][T");
    await session.WriteTransactionAsync(async tx =>
    {
        await tx.RunAsync(@"
LOAD CSV WITH HEADERS FROM 'file:///C:/Temp/XyzTypes.csv' AS line
CREATE (:Type {
    typeId: toInteger(line.id),
    name: line.name,
    fullName: line.fullName,
    isCompilerGenerated: toBoolean(line.isCompilerGenerated),
    asmName: line.asmName
})");
        return default(object);
    });

    Console.Write("][M");
    await session.WriteTransactionAsync(async tx =>
    {
        await tx.RunAsync(@"
LOAD CSV WITH HEADERS FROM 'file:///C:/Temp/XyzMethods.csv' AS line
CREATE (:Method {
    methodId: toInteger(line.id),
    name: line.name,
    fullName: line.fullName,
    isCompilerGenerated: toBoolean(line.isCompilerGenerated)
})");
        return default(object);
    });

    Console.Write("][CI");
    await session.RunAsync("CREATE INDEX type_id_index FOR (t:Type) ON (t.typeId)");
    await session.RunAsync("CREATE INDEX method_id_index FOR (m:Method) ON (m.methodId)");
    
    Console.Write("][TT");
    await session.RunAsync(@"
USING PERIODIC COMMIT LOAD CSV WITH HEADERS FROM 'file:///C:/Temp/XyzTypeTypeDependencies.csv' AS line
MATCH (src:Type), (dst:Type)
WHERE src.typeId = toInteger(line.src) AND dst.typeId = toInteger(line.dst)
CREATE (src)-[:DEPENDS_ON]->(dst)
");

    Console.Write("][TM");
    await session.RunAsync(@"
USING PERIODIC COMMIT LOAD CSV WITH HEADERS FROM 'file:///C:/Temp/XyzTypeMethods.csv' AS line
MATCH (src:Type), (dst:Method)
WHERE src.typeId = toInteger(line.src) AND dst.methodId = toInteger(line.dst)
CREATE (src)-[:DECLARES]->(dst)
");

    Console.Write("] ... ");
}
finally
{
    await session.CloseAsync();
    await driver.CloseAsync();
}

The CREATE INDEX queries return immediately. Could be legit, I do not know how fast Neo4j can index a number property in about 1M nodes. Running :schema in the browser confirms the two indices, but I have a feeling they are not working.

Running the above code takes almost 3 hours. What am I doing wrong?

EDIT 1

So I changed the last two queries to use the MERGE clause:

    Console.Write("][TT");
    await session.RunAsync(@"
USING PERIODIC COMMIT LOAD CSV WITH HEADERS FROM 'file:///C:/Temp/XyzTypeTypeDependencies.csv' AS line
MERGE (src:Type {typeId: toInteger(line.src)})-[:DEPENDS_ON]->(dst:Type {typeId: toInteger(line.dst)})
");
    
    Console.Write("][TM");
    await session.RunAsync(@"
USING PERIODIC COMMIT LOAD CSV WITH HEADERS FROM 'file:///C:/Temp/XyzTypeMethods.csv' AS line
MERGE (src:Type {typeId: toInteger(line.src)})-[:DECLARES]->(dst:Method {methodId: toInteger(line.dst)})
");

It is supposed to be much better now, because I think what I did before caused cartesian multiplication between the nodes. Yet the last query is taking an unknown amount of time (no idea how long at the moment) - still bad.

I also asked this question on SO - .net - Creating several millions of relationships in Neo4j takes a very long time - Stack Overflow

Did the reply on StackOverflow sort it out?

Yes, it does. Thank you very much.

1 Like

To tie this one up, the critical piece was calling CALL db.awaitIndexes() after the index creation, to ensure that we wait until the indexes are online before making the query that will rely on those indexes.

The cartesian product warning can also be disregarded, as that is required when you're matching on the nodes with the intent to create the relationship between them (it just ends up being a 1 x 1 cartesian product per row, so no issues with cardinality).