How do you handle timestamps and UIDs in Neo4j?

What I mean is, how do you know if you should create a new node, with it's own unique timestamp and UID.... or not?

For example, I want to use Neo4j to look at process explorer data. It's obvious that processes will have parents and children. That's the easy part. But how do I represent that a specific process on on host, started a specific child process on that same host, just now? Not yesterday... and not on another host out there. Today, I only know that a host ran a process, that started a child, at least once in the past.

Another example: how do I know that a single process on one host, produced 10 of the same command lines on that same host? Today, I only see one Process node, related to 1 command line node. I need to know that all 10 happened, and when.

This is really difficult, and if I use each event's timestamp or UID, this makes the amount of nodes explode.

Is this just not a good use case for Graph DBs, or can they handle this?

Hello Nodey!

Unfortunately, I would say that the long answer is that it depends. And when dealing with such high-velocity event data, I can imagine that it generates a lot of events, and the cost of storing those could get high fast. A couple of ways of tackling this would be only to keep events for a certain amount of time, archive events that are no longer interesting or be specific with your filtering only to record the events you are interested in.

Not knowing your schema, I would assume that no data is annotated with host data since that is a networking concept and irrelevant to some processes. The best way to record host data would be to annotate your source with the appropriate data.

If you are working with many timestamps, consider using a time-based index and the native neo4j temporal types.

Ok, that makes sense to me.

Are there any articles out there that discuss performance and cost based on amount of nodes? I really don't see a way of getting around the timestamp/uid exploding the amount of nodes at this point.

It's almost like graph DBs need a better way of handling time, that doesn't create a ton of new nodes per unique timestamp. Or, some type of additional data saved between nodes that remembers what the timestamp was for a specific node -> node relationship.

So that I could query something like "(Host:Win10David) -STARTED-> (Process:Malware.exe) BETWEEN -1d AND now", and feel confident that I'm only getting the relevant nodes and relationships from that time.

And just for context: this data is coming from a time based index already. I was hoping to use the power of cypher query on the data to draw new conclusions over relationships, where SQL joins are too massive, non performant, or simply confusing. I'm hamstrung unless I can accurately represent unique instances of nodes based on timestamp or UID, though.

I'm also brand new to this, so let me know if I'm not understanding the terminology! For example, you mention a time based index, but I don't see that as a choice in neo4j: https://neo4j.com/docs/cypher-manual/current/indexes-for-search-performance/

Edit: Can time be an extra attribute on the relationships themselves? That would work, I think. Then, there could be multiple relationships from one specific node to another specific node. Where you don't need to store extra nodes, just the dates between them, and then draw the multiple nodes up later when queried baed on time.

This still wouldn't save you if you need to save nodes based on their PIDs or UIDs, but it would work for everything else, like a file name.