How do you know when the application you’re building is a good fit for a graph database?
Graph databases are not always a good fit for everything. In this second part of the Will It Graph blog series, we’re going to show you some examples of good and poor fits for graph databases, how to identify a graph-shaped problem, and how the graph-native architecture helps solve graphy problems.
Last time, we discussed how graph databases work under the hood and how they’re different from relational databases. (Or listen to the GraphStuff.FM podcast!)
I think at a high level, graph databases are a good fit when we have the equivalent of lots of JOINs in our typical workload in relational databases. In this episode, we’re focusing on transactional use cases, so we’ll ignore the analytical use cases.
The canonical example that I always think of is the concept of personalized recommendations. If a customer is browsing an e-commerce store, we want to show them personalized recommendations based on their shopping history, the item that they’re currently looking at, etc.
One of the things I can do is traverse my graph of orders, users, and products to see people who bought this thing that the customer is currently looking at — and what other things those users bought — which might be a good recommendation for the current user.
Traversal through the graph is very performant in a graph database, whereas if I look at the massive order and product database in a relational database, it might need some very expensive JOINs to do that multi-hop traversal.
Typically, those have been overnight batch processes — that’s how a lot of retail companies have dealt with the challenge around slow queries. You run this process overnight. You get some data. Those are your recommendations. That’s what you use the following business day.
But obviously, the big challenge behind that is you’re potentially dealing with stale data. What happens when products go out of stock, etc. The real game changer is to be able to do it in real time. You can adapt according to any promotions you have, or do some real-time dynamic pricing, and so forth.
Another really powerful one is fraud detection. We touched on this idea of looking for patterns. A great example of this would be looking for fraud rings. If you think about retail banking, where typically you would have an account customer who would have a number of products with you — maybe a bank account, a loan, or credit cards. What do you do to look out for warning signs if somebody’s looking to request a loan? How do you know if it’s a typical situation or if it’s out of the blue?
To start looking for patterns of fraud rings, you take a step back and have a look at things such as details being recycled, social security numbers being recycled — pretty straightforward. What if we step back a little bit further to understand, for example, whether the same phone number for a landline for a house is being used by two different people. There’s nothing unusual there, but if you then keep traversing through the graph, you discover that the same phone number is being split across different properties, so that might trigger alarm bells.
There’s lots of different things that you can start to piece together by being able to look for certain patterns. In the example of the fraud on here, you wouldn’t expect to find long connections from a retail customer. You’d expect to find a fairly small, sparse, star-shaped graph.
When you start to see a long line of connections, you can query it as a pattern that may bring back something that you need to investigate further. You can start to think of it from a real-time perspective as people putting applications in, or as they’re trying to set up a new account. You can have a look at the data you have in your existing graph compared to the new data coming in to see if there’s something that needs to be investigated further.
Focusing again on this concept of real-time query performance, fraud detection, just like personalized recommendations, is a great example where we need to have the results immediately. Is this a fraudulent transaction? Yes or no? The answer should be in milliseconds, because when someone swipes their credit card, they don’t want to be standing there waiting multiple minutes for their credit card transaction to be approved.
There’s a very small window of time where the bank, or the credit card processing company, needs to go through this process, just like in the personalized recommendations use case where I need to be able to serve those personalized recommendations to the user in real time, as they’re browsing the website. They’re not going to wait two minutes for me to go fetch and find relevant recommendations as they’re browsing my product catalog; I need to be able to show those in milliseconds to the user.
Another really graphy use case would be network and IT management. It’s all about the devices you’ve got in your network: servers, routers, the load balances, all the applications you’re running, virtual machines, etc.
If you have an outage or a series of events happening within your network that may be triggering an outage, it’s not going to know that a certain server’s going to go out. You want to be able to determine what the impact is going to be.
Your network can have varied levels of depth. You’re not necessarily going to know how far you need to traverse into your network to find impacted resources. But if you can do this easily in real time, you can start to trigger services to deal with that.
If you start to see things happening in your network, you can do real-time root cause analysis — then avert or redirect resources. Eventually, you would even be able to predict outages. So that’s a really powerful graphy real-time application that would be quite hard to do in a relational database.
Now that we’ve talked through a few examples where graph databases make sense, we can start to see some general themes to help us identify if we have a graph-shaped problem to work with.
At a high level, any time that we’re trying to understand how our different entities are connected to each other, where the relationships in the data are just as important as the entities in the data, graph databases really offer an advantage.
In personalized recommendations, this is traversing the orders and users. The connections in the data are important to answer my question: What products would the current user looking at this product be interested in based on the orders of the other users who bought the same product?
Another aspect is we don’t necessarily know how many connections we’re interested in at query time. This is the concept of a variable length graph traversal. In the network management use case, let’s say we have a service that goes down and we want to know what are all of the applications that are somehow dependent on this service. There may be nested dependencies in the graph that represents the dependencies for this service and the different applications, maybe through different data providers or dependencies on applications that depend on other services.
The final applications and products that may be impacted on my site could be directly connected to the service that’s going down, or they may be the dependency of a dependency of this service. And I want to just traverse that piece of the graph connected to this service that is impacted at multiple depths. A graph database is going to be really efficient at finding all of those downstream impacted applications, where I don’t know ahead of time how deep, but I just want to know all of the children dependencies of this service.
There’s also this idea of finding the pattern. In the fraud detection example, we’re looking for suspicious patterns in the graph that might represent a fraud ring. If we see multiple accounts sharing a social security number or an address, and we see suspicious transactions connected to those accounts, it might be a fraud ring. We probably wouldn’t have a specific starting point for our graph traversal. Rather, we’re looking for the bigger patterns. This is another case where the graph databases can be extremely helpful.
We talked about this idea of index-free adjacency in a graph database in our previous blog post. In short, it means that we’re not using an index at query time to traverse relationships in the graph. But that doesn’t mean that we don’t use indexes at all in a graph database.
There is still a place for using indexes in a graph database. Rather than using them to find the nodes that are connected, a graph database typically uses an index to find the starting point for the traversal in our graph.
For example, we may create an index on the unique ID for a node. Going back to our person example, if we’re using their social security number as the unique identifier for the person node, we may create an index on social security numbers so that when we’re looking up an operation, we can quickly find the node with that social security number using an index. But once we start traversing out to maybe the address or whatever other pieces of our graph that we’re interested in, we’re not using the index to see where those relationships exist.
We’ve talked a bit about the good uses for graph databases, but there are other situations where the performance of graph databases may not be as good as you might hope.
There’s one example I always think about from a colleague of ours, Max & Ozzie, who always do this talk about the Hollywood actors’ heights. There’s a list of the really tall actors down to the not-so-tall actors — this is a really great way to talk about the various strengths and weaknesses of different databases as well.
For example, things that are really great fits for graph databases would be things around who knows who, or which actors are friends with which other actors. If we’re trying to find new co-actors for the latest movie that’s going to be recorded, we could leverage those connections, like who’s worked with who previously, what works well, and then try and generate some recommendations from that. Or if we want to try and figure out what films we should watch based on actors that we’ve liked again, that’s another good fit, too. So this is all about using those connections and understanding the relationships between data.
But there are also other instances where the performance of a graph database will be less than great for the query. If you want to ask questions such as what is the average height or average salaries of the actors, you can still run these queries in a graph database, but it wouldn’t be as performant as you’d expect a relational database.
Why is that the case? A graph database has nodes and relationships, which can have labels as well as properties stored as key-value pairs, and pointers that point to where they connect. What happens is when you run a query and you want to bring back the properties of a node, the engine will go away and pick up all the nodes that are related to that query. And then, for each node, it has to go away and look up in the store the property that has been requested.
So in this example of average heights, we would bring back all of the nodes, (all of the actors) for our query, and then for each node we would then go off and do a lookup in the store to bring back the value for property. Once we have gathered all of those up, we would then perform the operation. This is quite different from how a tabular database works, where all it needs to do is pull up that single table, which would have all of the actors’ heights and then run down the column to do the aggregation or calculation required. This means that you are going to have a slower query doing that in a graph database than in a relational database.
But does this mean we should never do averages on a graph database? Not at all. This is just a reminder about how you’re using the database. Next week, we’ll move on to discuss using a graph database as a general purpose database.
Don’t miss Part 3 of this Will It Graph? series coming up next!![|1x1](upload://6w7HOLoKuTDtEXRteNiYA53kW94.gif)
Will It Graph? Identifying A Good Fit For Graph Databases — Part 2 was originally published in Neo4j Developer Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.