Graph-OLAP - An attempt to model an OLAP cube with Neo4j

projects

(Mcaradec) #1

As a former OLAP developer, and recent adopter of graph databases, I was curious to check how Neo4j could handle Online Analytical Processing structures, dedicated to reporting.

These concepts will be experimented by creating an OLAP cube, with the only use of Cypher language.

Check more on Github: https://github.com/michelcaradec/Graph-OLAP


(Michael Hunger) #2

Really well done. Did you consider posting this as an article too? E.g. to medium (freecodecamp or neo4j publications)?

Did you see: http://blog.bruggen.com/2014/03/using-neo4j-to-manage-and-calculate.html

There is the discussion of computing the values live if the runtime is fast enough vs. storing aggregates (of any kind) on the higher level tree nodes (update path to root for new entries).

You mention the problem with PERIODIC commit, did you check out apoc.periodic.iterate which should handle this nicely also in combination with LOAD CSV as driving statement returning the rows to be processed.

There are also a bunch of new aggregation functions in APOC for statistics incl. percentiles and similarity functions (cosine, euclidean).

This could be interesting too: http://blog.bruggen.com/2015/07/hierarchies-and-google-product-taxonomy.html

On that OLAP cube one could also implement facetted search for later drill down / aggregation of the other dimensions, by e.g. also returning a collect of the other dimension’s values as a map.

Would love to hear if there are any missing statistics aggregation or other functions/procedures in APOC that would make the OLAP work easier, or if we could wrap some of your approaches into a procedure, like “window” functions.


(Michael Hunger) #3

Some more comments:

you should also add other indexes like :Year(year) to make the load scripts fast.

For your months, you can also use a local MERGE like where the month is created uniquely within the scope of the year.

MERGE (y:Year {year: toInteger(row.Year)})
MERGE (m:Month {month: toInteger(row.Month)})-[:OF_YEAR]->(y)

You can combine the ON CREATE / ON MATCH for measures with SET m.xxx = coalesce(m.xxx,0) + toFloat(row.xxx)

For your updates of Measures, can’t you just MERGE on the measure-id as you did before and then use the expression above? Instead of the two FOREACH.

For some reason I couldn’t explain, existing measures were never detected, leading to systematic measure node creation (i.e. last part of the script was never executed).

Do you have a reproducible example for the above?

The aggregate approach is definitely something that could be encoded in a procedure, both for writing/updating and reading.
I wonder if it would make sense to have a more Human-readable bitmask? Like Aggregate_PT_X_LP with a letter combination for each level and the levels separated by, e.g. an _. We could also use-multi-labels for aggregates, .e.g. :Aggregate:1_Product:1_Time to indicate where the aggregation happened. I like that even more.

Also it would be interesting to see at which data sizes the performance of regular neo4j (enterprise with slotted runtime) for aggregation across a sub-tree drops off to have non-realtime answers.

I’d love to see on-the-fly generation of aggregates based on needs, perhaps even combined with TTL (like materialized view).
And the use of other aggregation functions like min/max/avg/stdev/percentiles. But also computation of other sales measures like reach etc.

What do you plan to use for visualization? There is https://github.com/jexp/spoon-neo4j for quick ones or Jupyter notebooks, or something like: https://github.com/softvis-research/jqa-dashboard (Which I plan to generalize as a “dashboard” - graph-app.)