Big data and time series modeling for high performance querying

sunny_pelletier · August 13, 2020, 7:06pm

Hello,

I am working on a modelling problem for a time series requirement. Just to be clear, the time series is a type of information we have to display, but is not our primary use-case.

In my use-case, a person is member of a department, and a person has a list of information in its possession. For a scale idea, there can be thousands of people in a department, and a person can have hundreds of thousands of information.

As you can see, there are two relations going from the person to the information.

HAS_INFO is a shortcut relation to know if the person has the information RIGHT NOW
HAS_INFO_HISTORY indicates if the person has ever had this is information

So from this model, I have all necessary data to create a time series of the the total information a person had at any time

Furthermore, I need to be able to calculate the average information per person for the same department, to be able to compare both of those inside a graph.

Here is a sample of an expected time series

I understand this is a tricky problem, but do you think Neo4j is an appropriate database for such calculations?

Since it is really easy and performance wise to calculate this at a given point in time (with relation HAS_INFO), we believe we should calculate those metrics periodically and store them in another source of data for time series, but we would like to avoid having multiple data sources, so i'm asking here if someone has an idea that could help us better to achieve this goal.

Thank you very much for your help!

Cobra · August 15, 2020, 7:23am

Hello @sunny_pelletier

I understand this is a tricky problem, but do you think Neo4j is an appropriate database for such calculations?

Yes, Neo4j is appropriate

Since it is really easy and performance wise to calculate this at a given point in time (with relation HAS_INFO ), we believe we should calculate those metrics periodically and store them in another source of data for time series, but we would like to avoid having multiple data sources, so i'm asking here if someone has an idea that could help us better to achieve this goal.

You could store the result in a new node with the value and the date, like this when you need them, you will just need to call these nodes. For example:

MATCH (n:Metric)
RETURN n.value AS value
       n.date AS date
ORDER BY date ASC

Regards,
Cobra

sunny_pelletier · August 15, 2020, 1:00pm

Yes, that's exactly what I thought about.

But ideally we wouldn't want to return all the metrics for a date range.

For example, we would want divide a period in a number x of chunks and perform an average for every chunk. This is what time series database offer with a really high performance and scalability, and I wonder if this would be any comparable in neo4j.

But I will try doing the tests on my side for this test case and compare with a time series database to take my decision.

Cobra · August 15, 2020, 1:05pm

The query you will have to write will be tricky but you should be able to achieve it

slai1988 · November 11, 2020, 1:48pm

Just wondering if there is any good news that neo4j can suit this situation

Topic		Replies	Views
Modeling large dataset with high frequency timestamps Neo4j Graph Platform migrated	3	196	November 28, 2022
Timeseries Daily and high frequency - Securities Neo4j Graph Platform	11	2217	October 6, 2020
Time Series Quantitative Data Newbie Questions	4	575	January 11, 2021
Slow performance on time series queries Cypher performance , cypher	6	2952	January 28, 2019
Neo4j Use Cases Newbie Questions	11	1838	April 30, 2020

Big data and time series modeling for high performance querying

Related topics