We are evaluating Neo4j as a potential solution to generate summaries/aggregations from a large data set that is somewhat hierarchical (employer => employees => records).
The specific purpose is to generate aggregate descriptors across a loosely hierarchical data set with millions to tens of millions of entities and hundreds of millions to billions of properties. This data is fed into a machine learning classifier for training so latency is not a huge concern.
Example of a query we'd like to run:
Select companies of criteria X, select all employees of those companies of criteria Y, select all records submitted by those employees of type Z, and generate min/mean/median/stddev/count aggregates on all numeric properties for the selected set of records.
We would like to run as many as millions of unique queries like the one above (many permutations of similar queries). The results of the queries could either be cached, stored, or queried live. For example, the aggregates would be separated by gender, age, state.
The other option is not use a GraphDB, instead to write custom logic to generate many denormalized relational tables, and from those, pre-compute aggregates into an aggregate table. This would require an engineering effort and designing an API. Neo4j has a powerful DSL and ecosystem of tools so it looks like a better fit if performance is acceptable.