GDS and Causal Cluster Lack of Integration

stu_v_kerr · April 26, 2021, 4:48pm

Now that we have been using the Neo4J graph database for a few years and have integrated and using the GDS library we are now moving to a causal cluster. But Neo4J has not integrated the GDS library to run on a cluster. What? GDS algorithms are allegedly parallelized, but they are unable to run on a cluster? I remember that being the entire point of using Hadoop and Spark - eliminate the transport of data across the network. Yet here we are again? Starting to look at TIgerGraph.

When will this integration issue be fixed?

alicia_frame1 · June 14, 2022, 6:33pm

Hi @taffyb - I think you're conflating a few different topics here. Let me break them down and see if I can help.

For GDS & causal clusters: we introduced a cluster compatible deployment option for GDS in 2.0 - https://neo4j.com/docs/graph-data-science/current/production-deployment/causal-cluster/

From the perspective of the core database and fabric, we use logical sharding which means that by design, there shouldn't be too many cross shard edges. However, you can match entities across shards in order to allow cross shard joins. This developer guide gives a great overview: https://neo4j.com/developer/neo4j-fabric-sharding/

For running graph algorithms on sharded data, we support the ability to run an algorithm on an individual shard - https://neo4j.com/docs/graph-data-science/current/production-deployment/fabric/

As I'm sure you understand, many graph algorithms don't distribute very well (eg. https://www.cs.bgu.ac.il/~elkinm/book.pdf). That's why libraries like Spark's GraphX/GraphFrames only ever offered a few algorithms. Knowing this limitation, our focus has been on developing compression techniques and highly performant algorithms that can operate in a scale up context. We've successfully benchmarked our algorithms on graphs with hundreds of billions of nodes and relationships using widely available cloud compute instances.

taffyb · June 14, 2022, 7:53am

I appreciate that this topic is a little dated now. So I was wondering if there has been any change/progress on this topic. As I understand it a limitation of the Neo4j sharding approach (please correct me if I am wrong) is that it is not possible to create cross shard relationships. @alicia_frame1 am I correct that the proposed approach (link above) requires that the entire graph being analysed is in a single shard? How is this scaling?

Topic		Replies	Views
GDS and Causal Cluster Lack of Integration Graph Algorithms/Graph Data Science	1	353	April 26, 2021
Install GDS plugin on Neo4j Causal Cluster Cluster	1	708	October 8, 2020
Introduction to Graph Algorithms in Neo4j: No Correct Answer to 'Check Your Understanding" question Graph Academy & Certifications	1	252	February 28, 2022
Causal clustering Neo4j Graph Platform	1	314	April 13, 2020
Distributed data storage in Neo4j Newbie Questions	2	824	June 12, 2019

GDS and Causal Cluster Lack of Integration

Related topics