I am using the latest Neo4j.Driver package (4.2.0) and the latest community edition of the Neo4j server (4.2.3).
I must be doing something wrong, because my query takes hours to complete.
I have 4 CSV files:
- XyzTypes.csv - defines 96,328 type nodes.
- XyzMethods.csv - defines 975,507 methods across all the types.
- XyzTypeTypeDependencies.csv - defines 121,834 type-type DEPENDS_ON relationships.
- XyzTypeMethods.csv - defines 973,972 type-method DECLARES relationships.
The following code should be very simple. It just needs to load all the CSV and create the respective Types, Methods and the relationships.
Here is my code:
var driver = GraphDatabase.Driver("bolt://localhost:7687", AuthTokens.Basic("neo4j", "1"));
var session = driver.AsyncSession(o => o.WithDatabase("neo4j"));
try
{
Console.Write("[DI");
await session.RunAsync("DROP INDEX type_id_index IF EXISTS");
await session.RunAsync("DROP INDEX method_id_index IF EXISTS");
Console.Write("][C");
await session.WriteTransactionAsync(async tx =>
{
await tx.RunAsync("match ()-[r]->() delete r");
await tx.RunAsync("match (n) delete n");
return default(object);
});
Console.Write("][T");
await session.WriteTransactionAsync(async tx =>
{
await tx.RunAsync(@"
LOAD CSV WITH HEADERS FROM 'file:///C:/Temp/XyzTypes.csv' AS line
CREATE (:Type {
typeId: toInteger(line.id),
name: line.name,
fullName: line.fullName,
isCompilerGenerated: toBoolean(line.isCompilerGenerated),
asmName: line.asmName
})");
return default(object);
});
Console.Write("][M");
await session.WriteTransactionAsync(async tx =>
{
await tx.RunAsync(@"
LOAD CSV WITH HEADERS FROM 'file:///C:/Temp/XyzMethods.csv' AS line
CREATE (:Method {
methodId: toInteger(line.id),
name: line.name,
fullName: line.fullName,
isCompilerGenerated: toBoolean(line.isCompilerGenerated)
})");
return default(object);
});
Console.Write("][CI");
await session.RunAsync("CREATE INDEX type_id_index FOR (t:Type) ON (t.typeId)");
await session.RunAsync("CREATE INDEX method_id_index FOR (m:Method) ON (m.methodId)");
Console.Write("][TT");
await session.RunAsync(@"
USING PERIODIC COMMIT LOAD CSV WITH HEADERS FROM 'file:///C:/Temp/XyzTypeTypeDependencies.csv' AS line
MATCH (src:Type), (dst:Type)
WHERE src.typeId = toInteger(line.src) AND dst.typeId = toInteger(line.dst)
CREATE (src)-[:DEPENDS_ON]->(dst)
");
Console.Write("][TM");
await session.RunAsync(@"
USING PERIODIC COMMIT LOAD CSV WITH HEADERS FROM 'file:///C:/Temp/XyzTypeMethods.csv' AS line
MATCH (src:Type), (dst:Method)
WHERE src.typeId = toInteger(line.src) AND dst.methodId = toInteger(line.dst)
CREATE (src)-[:DECLARES]->(dst)
");
Console.Write("] ... ");
}
finally
{
await session.CloseAsync();
await driver.CloseAsync();
}
The CREATE INDEX
queries return immediately. Could be legit, I do not know how fast Neo4j can index a number property in about 1M nodes. Running :schema
in the browser confirms the two indices, but I have a feeling they are not working.
Running the above code takes almost 3 hours. What am I doing wrong?
EDIT 1
So I changed the last two queries to use the MERGE
clause:
Console.Write("][TT");
await session.RunAsync(@"
USING PERIODIC COMMIT LOAD CSV WITH HEADERS FROM 'file:///C:/Temp/XyzTypeTypeDependencies.csv' AS line
MERGE (src:Type {typeId: toInteger(line.src)})-[:DEPENDS_ON]->(dst:Type {typeId: toInteger(line.dst)})
");
Console.Write("][TM");
await session.RunAsync(@"
USING PERIODIC COMMIT LOAD CSV WITH HEADERS FROM 'file:///C:/Temp/XyzTypeMethods.csv' AS line
MERGE (src:Type {typeId: toInteger(line.src)})-[:DECLARES]->(dst:Method {methodId: toInteger(line.dst)})
");
It is supposed to be much better now, because I think what I did before caused cartesian multiplication between the nodes. Yet the last query is taking an unknown amount of time (no idea how long at the moment) - still bad.
I also asked this question on SO - .net - Creating several millions of relationships in Neo4j takes a very long time - Stack Overflow