Causal Cluster Performance Impact

Hi,
I have a causal cluster and a HA instance of neo4j where I am trying to load huge datasets (probably 1 TB of data) and trying to identify or compare the metrics getting for loading data and for some HOP queries.
HA Instance: 1 AWS r4.2xlarge instance
Causal CLuster configs: 3 core servers (3 x r4.2xlarge)

While loading the data, the cluster taking too much time as I could see the same data is replicated to 3 instances in the causal cluster's case. Is there any way to reduce the replication and avoid this performance impact in the distributed mode?

My use case is to build a causal cluster for neo4j in such a way that it should have performance capability while comparing to single HA neo4j instance.

Note: I am using Neo4j Enterprise Edition 3.5.14

If you want to optimize for absolute highest throughput writes, you should do your mega load into a single instance, and then use the resulting database to seed a cluster.

Fundamentally -- in a cluster, before your write can be acknowledged, a majority of cluster members have to agree on the write. This implies network round-tripping between nodes, and slower overall write performance, but this is the same thing that also guarantees consistency and safety of your data.

There are various causal_clustering.* configuration parameters that can be tuned, and in your particular instance, you can probably get improved overall writes with causal cluster, but because of the fundamental technical approach, single instance loads will be fastest.

Hi David,

Thanks for explaining why an HA setup can be slow. However, I'm not sure if I understand the following sentence:

If you want to optimize for absolute highest throughput writes, you should do your mega load into a single instance, and then use the resulting database to seed a cluster.

This looks to me like a common requirement and it's a bit disappointing that this does not come out of the box. Also can you briefly explain what do you mean by seeding the cluster?

Seeding a cluster means to start a new cluster with an existing dataset.

Fundamentally, Neo4j is replicating your data in order to ensure that it's safe. If Neo4j did not do this, it would not be able to guarantee the durability of your data, or the causal consistency model itself. It should not be surprising that replicating the data to more than one machine takes longer than saving the data to just one machine. Typically, when users encounter speed issues with importing data, the problem can be in a number of places other than cluster topology. So for example, you might check the cypher queries you're using, heap/page Cache settings, parallelism, and data model concerns before looking to cluster replication as the source of the issue. While it takes longer in the case of a cluster, that's usually a very small % of the total time, and write performance is typically dominated by other factors.