Building a binary tree

Hi,
I'm kind of getting into neo4j, I saw that this tool is very scalable and I guess its not so clear how to create an efficient query for what I need.
What i'm trying to do is build a full binary tree. For that I have a .csv format as follows:

C1,SC1
C1,SC2
SC1,SSC1
SC1,SSC2
SC2,SSC3
SC3,SSC4
...etc...

Such that the first column are parents of the nodes in the second column.
At first, I tried creating the nodes+relationships at the same query but got a warning that its an Eager operator.
Then, I tried to settle for the next two queries:

LOAD CSV FROM 'file:///neonodesdistinct.csv' AS line 
CREATE (:f {name: line[0]});

Where neonodesdistinct.csv has all the distinct node names seperated by lines.
And then I ran:

USING PERIODIC COMMIT 						 		
LOAD CSV  FROM 'file:///neotree.csv' AS line 		
MATCH (n1:f {name: line[0]}), (n2:f {name: line[1]})
MERGE (n1)-[:Parent_off]->(n2);

Where neotree.csv has all the relationship data described at the top.
I started with a tree consisting of 8191 nodes.

The first query is very fast (adding nodes only), but the second one takes up to 2 minutes.
I'm running the latest neo4j-client version on a linux VM that has 49152 MB of RAM and the graph.db data was moved to an SSD with a lot of volume in it.

Here is the screenshot of EXPLAIN for second query:

These are file samples for my example (8191 nodes and relationships). Sorry for the .txt (couldn't upload .csv).
neonodesdistinct.txt (123.3 KB) neotree.txt (236.0 KB)

Any idea on how to build this tree efficiently?

Thanks a lot!

Goodevening,
I would add a index on the name property. And wait for the index to finish.
And then run the second query.

Yours Kindly
Omer

1 Like

You should also be aware that having a binary tree structure in a graph database may not be of much use with Cypher alone.

Neo4j MATCHes use dfs by default (only shortestPath() matches use bfs, but you need to pre-match the start and end nodes), and there's nothing in Cypher which allows triggering of a binary search on a segment of the graph, or insertion/deletion to a binary tree structure. That's best handled in actual code, not in the database.

If you could explain the need for a binary tree structure, maybe we can help you figure out an alternate means of meeting your goals.

1 Like

What do you mean by "add index on a name"?
The "index" is a new property to a node?
or a feature ?

Thanks for the reply!

My final goal is to describe a file system via a graph.
Neo4j is one direction that i'm checking.
I know that in reality the file system resembles something more of a forest data structure, but I wanted to have some similar graph in both query languages i'm checking, so I can compare benchmarks (other one is document store - MongoDB).

The target DB (to describe the file system) will consist of about ~1 billion nodes with every node having some file metadata.

Thank you for the reply!

Good morning,
Neo4j supports creating indexes. You could try this to speed up the second import proces. The syntaxis = CREATE INDEX ON :label(propertyname);

1 Like

Thank you very much! it surly did the trick.
Why is this optional and not made by default, though?

In your case with the first import you make all the Nodes which has no cost (linear).
But the second import has to search through the existing Node property name to find the node to make the relation with.
Neo4j is a Label Property Graph example: (x:Person{name: "Carol", age:"23", etc....}) "Person" is the label and "name" and "age" are properties and your free to create properties and create indexes on them. But creating a index has a price.

I think you should create indexes when needed example when you have search for some property value.

Yours kindly
Omer

1 Like

Indexes introduce an extra write time cost (when data is added or changed, any indexes that rely on that data must be updated in the transaction). If we indexed all properties of all nodes for all labels by default, we would be adding an excessive write time cost that would grow with the number of properties per node, and with the number of labels per node.

Also, indexing on every field is excessive, as not every field will be used for initial lookup of the nodes. The general rule is, if you intend to do initial lookups of nodes of certain labels by certain properties, add indexes for just those properties.

Here's our documentation on indexes.
https://neo4j.com/docs/cypher-manual/current/administration/indexes-for-search-performance/