Is Neo4j the right solution for a large aggregation problem?

We are evaluating Neo4j as a potential solution to generate summaries/aggregations from a large data set that is somewhat hierarchical (employer => employees => records).

The specific purpose is to generate aggregate descriptors across a loosely hierarchical data set with millions to tens of millions of entities and hundreds of millions to billions of properties. This data is fed into a machine learning classifier for training so latency is not a huge concern.

Example of a query we'd like to run:

Select companies of criteria X, select all employees of those companies of criteria Y, select all records submitted by those employees of type Z, and generate min/mean/median/stddev/count aggregates on all numeric properties for the selected set of records.

We would like to run as many as millions of unique queries like the one above (many permutations of similar queries). The results of the queries could either be cached, stored, or queried live. For example, the aggregates would be separated by gender, age, state.

The other option is not use a GraphDB, instead to write custom logic to generate many denormalized relational tables, and from those, pre-compute aggregates into an aggregate table. This would require an engineering effort and designing an API. Neo4j has a powerful DSL and ecosystem of tools so it looks like a better fit if performance is acceptable.

Sounds like it'd be a good fit. You have a defined entry point, a single node (:Company) and from there traverse the paths to other nodes and collect information along the way. So you'll get a fast index seek to that first node and from there you'll be only traversing the relationships specific to that node. Neo4j will scale better as the dataset grows than you'll get with RDBMS using a bunch of many-to-many bridge tables in SQL.

Give it a try and let us know how it works out for you.