Big data and time series modeling for high performance querying

Hello,

I am working on a modelling problem for a time series requirement. Just to be clear, the time series is a type of information we have to display, but is not our primary use-case.

In my use-case, a person is member of a department, and a person has a list of information in its possession. For a scale idea, there can be thousands of people in a department, and a person can have hundreds of thousands of information.

As you can see, there are two relations going from the person to the information.

  • HAS_INFO is a shortcut relation to know if the person has the information RIGHT NOW
  • HAS_INFO_HISTORY indicates if the person has ever had this is information

So from this model, I have all necessary data to create a time series of the the total information a person had at any time

Furthermore, I need to be able to calculate the average information per person for the same department, to be able to compare both of those inside a graph.

Here is a sample of an expected time series

I understand this is a tricky problem, but do you think Neo4j is an appropriate database for such calculations?

Since it is really easy and performance wise to calculate this at a given point in time (with relation HAS_INFO), we believe we should calculate those metrics periodically and store them in another source of data for time series, but we would like to avoid having multiple data sources, so i'm asking here if someone has an idea that could help us better to achieve this goal.

Thank you very much for your help!

Hello @sunny.pelletier :slight_smile:

I understand this is a tricky problem, but do you think Neo4j is an appropriate database for such calculations?

Yes, Neo4j is appropriate :slight_smile:

Since it is really easy and performance wise to calculate this at a given point in time (with relation HAS_INFO ), we believe we should calculate those metrics periodically and store them in another source of data for time series, but we would like to avoid having multiple data sources, so i'm asking here if someone has an idea that could help us better to achieve this goal.

You could store the result in a new node with the value and the date, like this when you need them, you will just need to call these nodes. For example:

MATCH (n:Metric)
RETURN n.value AS value
       n.date AS date
ORDER BY date ASC

Regards,
Cobra

Yes, that's exactly what I thought about.

But ideally we wouldn't want to return all the metrics for a date range.

For example, we would want divide a period in a number x of chunks and perform an average for every chunk. This is what time series database offer with a really high performance and scalability, and I wonder if this would be any comparable in neo4j.

But I will try doing the tests on my side for this test case and compare with a time series database to take my decision.

The query you will have to write will be tricky but you should be able to achieve it :slight_smile:

1 Like

Just wondering if there is any good news that neo4j can suit this situation :smiley: