Neo4j Use Cases

Just trying to understand appropriate use cases for Neo4j. Any general comments about the kinds of use cases that are / not good for Neo4j would be helpful.

Is Neo4j intended to be used as an operational data store? Would it be common to sync a graph in Neo4j with an operational data store on a batch or near real-time basis? Or is it intended to be used more tactically as an analytical platform where you load a specific data set of interest, run some queries and analytics, and then decommission that particular graph? All are of interest, but would be helpful to understand the intent of the Neo4j designers and avoid doing things that are really not intended or well supported. No product is the right answer for everything...

Here are a couple scenarios we have. If these are not anti-patterns for Neo4j in the first place, what would generally be the best way to accomplish these operations in Neo4j?

#1 - Sync with operational data store in batch
We have processes that run over night that generate sizable batches of data representing new objects and relationships.

For example, a batch of committed orders for existing customers might be generated. I think that would require creating new Committed Order nodes and linking them to existing Customer nodes. For the sake of the example, let's say there could be 10,000 orders in a batch that may correlate to 3,000+ customers.

Obviously another related scenario would be orders for new customers in which case the Customer nodes would have to be created first and then linked to the orders.

#2 - Sync with operational data store in near real time
We have transactions that produce streams of events that might cause creation of new nodes or relationships. For the sake of understanding Neo4j, let's say the scenario is 3,000 events per minute each requiring the creation of a new node and a relationship between that new node and an existing node.

Again, aside from saying whether these are appropriate uses of Neo4j, please comment on how good approaches for each so we can try them out as experiments. Thanks!

1 Like

Hey welcome to the community! I wanted to go ahead and say that most of the time the commonly accepted wisdom is that Neo4j isn't great for time series data, with that caveat, that you're not interested in adding additional context to the data. There are some good examples of people mapping out time and space type problems with Neo4j but they're the exceptions rather than the rule. You're provided scenarios though seem like things that a perfect fit for Neo4j. Creating costumer nodes, product nodes, and creating relationships between them, etc. are very much in the Neo4j wheel house.

I would go for the near real time scenario. But it depends on the use case of your database. If the application which sits on the Neo4j is intended to provide (near) real time information to the users, that is the best. It can be implemented easliy, for example if you use Kafka connector, it can be easily hooked into your relation database and on the other side you can sink messages and create nodes in Neo4j. If you already have Kafka streams for orders, then it is much easier.
If you want to implement a solution where there are heavy computation before using Neo4j, and the data is not changing rapidly, then you can consider #1 solution. For example if you are implementing a recommendation system where there is a complex similarity computation, then it is possible that batch processing is more suitable for that use case. If you consider the nature of the data, like user similarity, or communities in a graph will not change significantly day by day, so you can do these computation, page ranking, community detection algorithms,etc. daily. But it depends on the use case and the nature of your data.

1 Like

Thank you for those thoughts.

Does Neo4j perform well when creating large numbers of relationships between existing nodes? Let's say you want to create relationships in both directions between each Customer and their Orders. Would you suggest something like the expression below, or is there some better way to accomplish this, particularly when the number of Customers and Orders is very large?

MATCH (c:Customer), (o:Order {customerId: c.customerId}) 
CREATE (c)-[:HAS_ORDER]->(o), 
CREATE (o)-[:ORDER_OF]->(c)

"Very large" is relative. Neo4j can handle large complex graphs, and the data model should support your use case, which means you have to know what questions should be answered by your graph. In general large number of relationships shouldn't be a problem, but if you have super nodes with massive number of relationships that can be a problem when you try to do traversals.
You can check this discussion and the stackoverflow link there:

About your "data model":
I don't recommend to use this type of data model. One relationship implies the other.
You can see good examples of this in this blog post:

1 Like

Also check this deal out: How Do You Know If a Graph Database Solves the Problem? | by Jennifer Reif | Neo4j Developer Blog | Medium

1 Like

I take your point on a relationship implying the inverse relationship. Thank you for that.

By "very large" I am thinking 10's to 100's of millions of nodes having 10's to 100's of relationships to other nodes (which, obviously implies that there are billions of the other nodes).

My experience in trying to calculate (case statement with a few clauses) and set a new property for 60 million nodes takes hours. Similar for creating relationships between a similar number of (existing) nodes. Example:

MATCH (c:Customer)
SET c.status = CASE
  WHEN c.spend >= 0 AND c.spend <= 50 THEN "BRONZE"
  WHEN c.spend >= 51 AND c.spend <= 250 THEN "SILVER"
  WHEN c.spend >= 251 AND c.spend <= 1000 then "GOLD"
  ELSE "PLATINUM"
END;

For 60 million nodes, with 32GB heap allocated to Neo4j, this operation ran out of heap and failed after hours.

Is there a better way (aside from allocating a lot more heap)?

You need to batch the requests. Best way to do it in Cypher is to use APOC.

CALL apoc.periodic.iterate(
"MATCH (c:Customer) RETURN c",
"SET c.status = CASE
  WHEN c.spend >= 0 AND c.spend <= 50 THEN 'BRONZE'
  WHEN c.spend >= 51 AND c.spend <= 250 THEN 'SILVER'
  WHEN c.spend >= 251 AND c.spend <= 1000 then 'GOLD'
  ELSE 'PLATINUM'", {batchSize:10000, parallel:true})

This will execute the SET statement in a batch of 10,000 records at a time. You don't need to increase the Heap.

Thank you @anthapu, I am trying this approach out. Makes lots of sense.

I note that, when I run the above in cypher-shell, it returns immediately. It doesn't seem to matter if I use the parallel config param or not. After running the command, I don't see any queries in process. It's hard to tell if it error'd out or if it's running in the background or what.

Is there any way to get logging or debug output for apoc.period.iterate calls? Or how can you even tell when the operation is complete?

FWIW, I do not see the status property added to any Customer at all based on running match (c:Customer) where exists(c.status) return c limit 1;

Sorry there is a mistake in the code. It needs an END in the query for CASE for it to work. I'm assuming you have added it.

CALL apoc.periodic.iterate(
"MATCH (c:Customer) RETURN c",
"SET c.status = CASE
  WHEN c.spend >= 0 AND c.spend <= 50 THEN 'BRONZE'
  WHEN c.spend >= 51 AND c.spend <= 250 THEN 'SILVER'
  WHEN c.spend >= 251 AND c.spend <= 1000 then 'GOLD'
  ELSE 'PLATINUM' END", {batchSize:10000, parallel:true})

apoc.periodic.iterate will only give results at the end.

You should see an output like this

batches	total	timeTaken	committedOperations	failedOperations	failedBatches	retries	errorMessages	batch	operations	wasTerminated	failedParams
1	3	0	3	0	0	0	
{

}
{
  "total": 1,
  "committed": 1,
  "failed": 0,
  "errors": {

  }
}
{
  "total": 3,
  "committed": 3,
  "failed": 0,
  "errors": {

  }
}
false	
{

}

Even if there is an error it should return that error immediately.

When it is running you can run ":queries" in the browser to see if it is running.

You should see 2 queries. One is the first one MATCH, and second one SET statement.

Hi Mike,

Just curious about your statement "I wanted to go ahead and say that most of the time the commonly accepted wisdom is that Neo4j isn't great for time series data". I'm just getting my feet wet with graph dbs in general. One of my potential use cases involves a lot of timestamps so I'm a little concerned. Can you expand your thoughts on this?

Here's a basic description of the kind of project I'm considering.
Right now I suppose the project could most easily be described as a system monitoring tool. (yes I know there are bazillions of them out there) More specifically we're looking to monitor specific processes on each client across the network and take action the data.

So as best as I can figure I'll have the following basic nodes:
:Person
:Computer
:Application
:ApplicationSession

The first 3 are simple enough but I'm not sure how to map the sessions b/c each Person/Computer/Application will be related to many, many sessions (maybe 100,000s per year).

I have a good idea how I would approach the problem with a traditional RDMS but am new to graph dbs. I'm not clear on the best ways to handle many instances to the same relationship (I think that's a good way to say it). I'm hoping to gain greater insight into the data when doing analysis by using a graph db.

Hopefully that gives you a good general idea about my project.

The caveat being that if you need extra context and relationship tracking. Then you can make it work. But for just straight time series data there are better options. That seems to be the consensus.

1 Like