Is Neo4j the right solution for a large aggregation problem?

tomasv · June 16, 2019, 6:39pm

We are evaluating Neo4j as a potential solution to generate summaries/aggregations from a large data set that is somewhat hierarchical (employer => employees => records).

The specific purpose is to generate aggregate descriptors across a loosely hierarchical data set with millions to tens of millions of entities and hundreds of millions to billions of properties. This data is fed into a machine learning classifier for training so latency is not a huge concern.

Example of a query we'd like to run:

Select companies of criteria X, select all employees of those companies of criteria Y, select all records submitted by those employees of type Z, and generate min/mean/median/stddev/count aggregates on all numeric properties for the selected set of records.

We would like to run as many as millions of unique queries like the one above (many permutations of similar queries). The results of the queries could either be cached, stored, or queried live. For example, the aggregates would be separated by gender, age, state.

The other option is not use a GraphDB, instead to write custom logic to generate many denormalized relational tables, and from those, pre-compute aggregates into an aggregate table. This would require an engineering effort and designing an API. Neo4j has a powerful DSL and ecosystem of tools so it looks like a better fit if performance is acceptable.

mike_r_black · June 16, 2019, 6:52pm

Sounds like it'd be a good fit. You have a defined entry point, a single node (:Company) and from there traverse the paths to other nodes and collect information along the way. So you'll get a fast index seek to that first node and from there you'll be only traversing the relationships specific to that node. Neo4j will scale better as the dataset grows than you'll get with RDBMS using a bunch of many-to-many bridge tables in SQL.

Give it a try and let us know how it works out for you.

Topic		Replies	Views
Very slow performance when aggregate on node property Neo4j Graph Platform performance , migrated , cypher-tagged	2	319	March 5, 2023
Very slow performance when aggregate on node property Cypher performance	21	4739	December 21, 2020
Slow Aggregation when dealing with 1M+ node Cypher	2	253	May 31, 2023
How to Aggregate calculation of data faster? Cypher cypher	3	1116	March 3, 2019
support for aggregation queries in neo4j-graphql-java GraphQL & GRANDstack graphql	0	296	April 17, 2023

July Summer Fun!

Is Neo4j the right solution for a large aggregation problem?

Related topics