How long should it take to create several millions of relationships?

mark.kharitonov · March 3, 2021, 6:49pm

I am using the latest Neo4j.Driver package (4.2.0) and the latest community edition of the Neo4j server (4.2.3).

I must be doing something wrong, because my query takes hours to complete.

I have 4 CSV files:

XyzTypes.csv - defines 96,328 type nodes.
XyzMethods.csv - defines 975,507 methods across all the types.
XyzTypeTypeDependencies.csv - defines 121,834 type-type DEPENDS_ON relationships.
XyzTypeMethods.csv - defines 973,972 type-method DECLARES relationships.

The following code should be very simple. It just needs to load all the CSV and create the respective Types, Methods and the relationships.

Here is my code:

var driver = GraphDatabase.Driver("bolt://localhost:7687", AuthTokens.Basic("neo4j", "1"));
var session = driver.AsyncSession(o => o.WithDatabase("neo4j"));
try
{
    Console.Write("[DI");
    await session.RunAsync("DROP INDEX type_id_index IF EXISTS");
    await session.RunAsync("DROP INDEX method_id_index IF EXISTS");

    Console.Write("][C");
    await session.WriteTransactionAsync(async tx =>
    {
        await tx.RunAsync("match ()-[r]->() delete r");
        await tx.RunAsync("match (n) delete n");
        return default(object);
    });

    Console.Write("][T");
    await session.WriteTransactionAsync(async tx =>
    {
        await tx.RunAsync(@"
LOAD CSV WITH HEADERS FROM 'file:///C:/Temp/XyzTypes.csv' AS line
CREATE (:Type {
    typeId: toInteger(line.id),
    name: line.name,
    fullName: line.fullName,
    isCompilerGenerated: toBoolean(line.isCompilerGenerated),
    asmName: line.asmName
})");
        return default(object);
    });

    Console.Write("][M");
    await session.WriteTransactionAsync(async tx =>
    {
        await tx.RunAsync(@"
LOAD CSV WITH HEADERS FROM 'file:///C:/Temp/XyzMethods.csv' AS line
CREATE (:Method {
    methodId: toInteger(line.id),
    name: line.name,
    fullName: line.fullName,
    isCompilerGenerated: toBoolean(line.isCompilerGenerated)
})");
        return default(object);
    });

    Console.Write("][CI");
    await session.RunAsync("CREATE INDEX type_id_index FOR (t:Type) ON (t.typeId)");
    await session.RunAsync("CREATE INDEX method_id_index FOR (m:Method) ON (m.methodId)");
    
    Console.Write("][TT");
    await session.RunAsync(@"
USING PERIODIC COMMIT LOAD CSV WITH HEADERS FROM 'file:///C:/Temp/XyzTypeTypeDependencies.csv' AS line
MATCH (src:Type), (dst:Type)
WHERE src.typeId = toInteger(line.src) AND dst.typeId = toInteger(line.dst)
CREATE (src)-[:DEPENDS_ON]->(dst)
");

    Console.Write("][TM");
    await session.RunAsync(@"
USING PERIODIC COMMIT LOAD CSV WITH HEADERS FROM 'file:///C:/Temp/XyzTypeMethods.csv' AS line
MATCH (src:Type), (dst:Method)
WHERE src.typeId = toInteger(line.src) AND dst.methodId = toInteger(line.dst)
CREATE (src)-[:DECLARES]->(dst)
");

    Console.Write("] ... ");
}
finally
{
    await session.CloseAsync();
    await driver.CloseAsync();
}

The CREATE INDEX queries return immediately. Could be legit, I do not know how fast Neo4j can index a number property in about 1M nodes. Running :schema in the browser confirms the two indices, but I have a feeling they are not working.

Running the above code takes almost 3 hours. What am I doing wrong?

EDIT 1

So I changed the last two queries to use the MERGE clause:

    Console.Write("][TT");
    await session.RunAsync(@"
USING PERIODIC COMMIT LOAD CSV WITH HEADERS FROM 'file:///C:/Temp/XyzTypeTypeDependencies.csv' AS line
MERGE (src:Type {typeId: toInteger(line.src)})-[:DEPENDS_ON]->(dst:Type {typeId: toInteger(line.dst)})
");
    
    Console.Write("][TM");
    await session.RunAsync(@"
USING PERIODIC COMMIT LOAD CSV WITH HEADERS FROM 'file:///C:/Temp/XyzTypeMethods.csv' AS line
MERGE (src:Type {typeId: toInteger(line.src)})-[:DECLARES]->(dst:Method {methodId: toInteger(line.dst)})
");

It is supposed to be much better now, because I think what I did before caused cartesian multiplication between the nodes. Yet the last query is taking an unknown amount of time (no idea how long at the moment) - still bad.

I also asked this question on SO - .net - Creating several millions of relationships in Neo4j takes a very long time - Stack Overflow

markhneedham · March 4, 2021, 10:33am

Did the reply on StackOverflow sort it out?

mark.kharitonov · March 4, 2021, 3:18pm

Yes, it does. Thank you very much.

andrew_bowman · March 5, 2021, 5:02am

To tie this one up, the critical piece was calling CALL db.awaitIndexes() after the index creation, to ensure that we wait until the indexes are online before making the query that will rely on those indexes.

The cartesian product warning can also be disregarded, as that is required when you're matching on the nodes with the intent to create the relationship between them (it just ends up being a 1 x 1 cartesian product per row, so no issues with cardinality).

Topic		Replies	Views
Loading Neo4J Relationships takes literally ages Cypher	9	1314	March 12, 2020
Query taking a very long time to create relationships Neo4j Graph Platform migrated	16	599	February 7, 2023
Creating Relationships takes a very long time Neo4j Graph Platform migrated	3	130	November 12, 2022
Create millions of relationships in less time Neo4j Graph Platform relationship , import , driver , neo4j , migrated , python-tagged	0	278	November 2, 2022
Load-CSV very slow with millions of nodes Import / Export load-csv , import , neo4j-import , csv , neo4j	10	11519	April 7, 2022

Get Certified in June!

How long should it take to create several millions of relationships?

Related topics