Graph Modeling: Labels

This article is the latest in a series on advanced graph modeling; the other articles in the series deal with keys, relationships, super nodes, and categorical variables.

Today we’re going to talk about labels in Neo4j, what they are, how to use them, how they get abused, and how to avoid that. Neo4j’s data model is fundamentally called the “Labeled Property Graph (LPG) Model”, and labels are pretty important.

We will cover:

  1. What is a label? What do they mean?
  2. How Neo4j Treats Labels
  3. How should I use them in my models?

What is a label? What do they mean?

Labels are a kind of naming that can be applied to any node in the graph. They are a name only — and so labels are either present or absent. From graph database concepts:

Labels are used to shape the domain by grouping nodes into sets where all nodes that have a certain label belongs to the same set.

If you’ve ever used Cypher you’ve seen them: CREATE (p:Person { name: "David" }) specifies a label “Person”. It can be any UTF-8 string that you like. By using back-ticks to escape your string in Cypher, you could have a

(:`Longer Spaced Label`)

Or you could even use table flipping guy (╯°□°)╯︵ ┻━┻ or emoji as your labels, though these wouldn’t be very convenient to type in Neo4j Browser, they can make for some fun with visualizations like Neo4j Bloom. 😉

Set membership

The quote above mentioned “grouping nodes into sets”, and this is really the key concept to understand about what labels do for you. Consider this simple, unlabeled graph:

![](upload://7DCQKSLNNHjwJoOs7FeuumNqgb5.png)An unlabeled graph

Now what if we were to divide all of the nodes into two groups?

![](upload://5POSVrdWe77O0p9XmgFqfy0AWm.png)Venn diagram of two kinds of nodes

OK, that’s already more clear. So now a label in Neo4j is just a “set membership marker”. Once we apply real Cypher labels, we get automatic coloring; Neo4j browser shows us the same thing as the Venn diagram above, just that it’s easier to see what’s going on.

![](upload://4vNPTuWH7OWnguBuAlo30Ems9XH.png)

Multiple set membership

Sets can have subsets, so we could further divide our (:Job) nodes into (:Job:Technology) or (:Job:Healthcare) if we wanted. Any node can be in any number of sets that you want. This gets harder to represent visually with colors, but the concept is no different than the Venn diagram above.

How Neo4j treats labels

Every node, when stored on disk, has a few slots where Neo4j can store an identifier that lets it know which labels the node has. There’s a pre-allocated certain amount of storage (I believe it’s 4 slots). As labels are applied to a node, the database puts that label reference into one of those slots to provide a form of semi-free indexing, and for database statistics, to inform the Cypher query planner.

“Semi-Free” indexing

Back to my toy database I was using for the images above, I created 10,000 extra nodes in addition to my handful of “Job” and “Person” nodes. Then I ran an EXPLAIN on the query MATCH (n { name: "A" }) RETURN n. What did the database do?

![](upload://cifuJ94fr8wzZVCLj8DTNUvgCB2.png)Note the AllNodesScan at the top

At the very top, AllNodesScan is exactly what it sounds like. The database looks through every single node in the entire database (10,005 estimated) to find the ones we wanted. That’s pretty inefficient. But what if I we ask it to explain its plan for finding the (:Person) with the name “A”? EXPLAIN MATCH (p:Person { name: "A"}) RETURN p

![](upload://pGMMaOD7qKyoCmX0roEMqkwlxEf.png)NodeByLabelScan

The result is a NodeByLabelScan which needs to consider an estimated 10 rows. This is exactly what’s meant by using labels as indexes — every time you specify one in a read query, the database has far less to look at to get the job done for you.

The reason I’m saying they’re semi-free and not totally free is of course the database still has to maintain the label store. Writing those labels (like writing & maintaining an index) is more work than not doing it, so by applying labels to nodes you do more up-front work at write time, in order to win at read time.

As a form of indexing, compare these two options and notice how they’re equivalent, but how the label is easier to read (it’ll probably also be more performant).

/* Option 1: bad */
CREATE (n:Everything { name: "A", nodeType: "Job" })
CREATE INDEX nodeType_idx ON :Everything(nodeType)
/* Option 2: much better, ultimately the same thing */
CREATE (n:Job { name: "A" })

Database Statistics

![](upload://nJYnRu7R2HpnJN48iOZjXm82eqE.png)

Neo4j maintains a “count store” for holding count metadata for a number of things. The count store is used to inform the query planner so it can make educated choices on how to plan the query. Getting counts from the count store is just fetching a single number, so it goes very fast. For you (or the cypher query planner), the count store is handy. You can read more about it in this knowledge base article.

In that EXPLAIN plan above, Cypher knew roughly how many nodes there were going to be because of the labels and the count store. In Halin, you can see the basics of the count store quite simply — the same stats the database sees.

OK with all of that out of the way, it’s time to use this information and talk about how to build better labeled property graph data models.

How should I use labels in my models?

Sometimes the simplest advice is the best: 9 times out of 10, if you attach one and only one label to every node, you’ll do just fine, and don’t need further advice. It really can be that simple. The rest of this article deals with more complicated situations, and how to think in terms of using data modeling principles.

The general principles are simple: Use labels to indicate semantic class of information and set membership, and use them to get the speed-ups of “semi-free” indexing described above. To make this more concrete, let’s look at a few best and worst practices. Summarized, they are:

  • Label every node, no exceptions
  • Always have a query use case for a label
  • Multiple labels should be semantically orthogonal
  • Avoid label overload

Label every node

That means, use at least one label and avoid unlabeled nodes wherever possible. Unlabeled nodes are semantically indistinct; What is a node like that even supposed to mean? And they’re harder to differentiate from other nodes.

If you just stick with strictly one label per node, in 95% of cases, you’re good to go! Almost all of the rest of this article deals with the issues surrounding multiple labels per node. But if you don’t need them, you can stop here, you’re done.

Always have a query use case for a label

Don’t design a data model if you don’t know what queries you want to ask of the database. If you have an idea that you would like to use label Foo, make sure it connects to a real question you need to ask of the database.

One of the most important facets of data modeling:

Data models exist to facilitate answering questions from the databases — they are not for creating pristine semantic models of domains

This is a major mistake a lot of folks make, a source of a lot of data model errors and pain: Trying to create a great semantic model of a domain, instead of focusing on what questions they need the database to answer. Creating great semantic models is arguably futile, because a model is a map and the map is not the territory. We are data modelers, not philosophers, and we have questions we need this database to answer, yesterday.

If you can’t figure which query will use the label you have in mind, don’t use it. You can always apply more labels later. (The YAGNI principle)

Multiple labels should be semantically orthogonal

“Semantically orthogonal” is a fancy term that means that labels should have nothing to do with one another. Imagine we had a tiny report about an international business, showing the number of products it sells in different markets.

![](upload://Z8JnmRXiS0hMbnp7DQdsu4ZNJH.png)

The “business region” (USA vs. EU) is visually and semantically orthogonal to the count (Customer vs. Product).

There’s no real relationship between the concepts of customer & geography; this is what might make them good intersecting set labels.

And so labeling nodes (:Customer:USA), (:Customer:EU), (:Product:USA), (:Product:EU)might make for good data partitions. Notice how with orthogonal labels, we support 2 different use cases: we have an entire partitioned graph by geography. We can pull an entire sub-graph like this:

MATCH (n:USA) RETURN n

And yet the set intersection is also highly selective, and makes for faster queries:

MATCH (p:Product:EU) RETURN p

These properties come from the orthogonality of the label semantics.

Don’t overdo it — Avoid label overload

![](upload://3uM0x4sKsDEbygcm73biNfdLVms.png)

Above we mentioned that Neo4j pre-allocates slots for about 4 labels. This much you get for “semi-free” as we were describing; but if you go over that limit, Neo4j starts having to allocate extra space just to store your pile of labels. And remember that usually you get your fastest query speedups by using the most specific label when you query; so if you have 10 labels on a node, you’re usually just asking the database to do lots of unnecessary work that isn’t going to speed up your query.

As a general rule of thumb, past 4 labels per node, expect overall performance to get worse, not better

In the majority of well-thought out models I’ve seen, one or two labels per node is sufficient. Some exotic cases with good reasons might need three or four; but above this, you should really question whether the model makes sense and is tuned to your actual query needs.

Avoid class hierarchies

An idea that people sometimes get about labels is to use them to model class hierarchies. This is often called “inheritance” or “IS-A” relationships. If you understood the section above about “semantic orthogonality” you should immediately spot that class hierarchies are not semantically orthogonal.

Imagine you have a zoo of information like this:

![](upload://gKY3lajrfEdmCINfMoq4A37bvVg.png)

It’s tempting to now go create a lot of nodes that are labeled (:Bat:Mammal:Animal), (:Pelican:Oviparous:Animal) and so on. Using our “set membership” argument above, folks who do this will point out how easy it is to MATCH (a:Animal) as a set, sweeping in all of the crocodiles & whales, and so forth.

This is generally not a good idea for a lot of reasons:

  • Neo4j doesn’t enforce co-label constraints, i.e. which labels can occur together. Which means the database won’t give you any help in avoiding mammalian crocodiles, or non-animal whales, which are actually nonsense under your semantic model.
  • It will tend to overload the number of labels as your hierarchy grows.
  • You can do the same anyway with set intersection. If you just labeled everything with Mammalian or Oviparous, you could always get all of the super-class members (Animal) anyway, by MATCH (n) WHERE n:Mammal OR n:Oviparous.
  • Most of the time, people like this model, but when they look into their actual query patterns, they don’t really have a use case for matching abstract super-classes in a real world query, and it’s just over-complicated. Class hierarchies are often a case of someone creating a semantic model as opposed to focusing on how to answer questions.
If a node has an IS-A relationship somewhere (for example, a whale IS-A mammal) — use a relationship to a different node indicating Mammal. It should not be a label.

Avoid composition relationships

Alternatively, there are “HAS-A” relationships, which can indicate ownership. Again, the word “relationship” is in here, so if in your model you have two distinct classes of things (for example a “Person” and a “Car”) it can sometimes be tempting to model the relationship as a label, such as (:Person:CarOwner).

Notice the name of that label, CarOwner; a pattern to look for are the “noun verbers”. (noun=Car verb=own). A “noun verber” term in a data model is expressing a relationship. The better way to go with that is to use a relationship.

Often you’ll find composition relationships won’t be semantically orthogonal. The good example above (:Customer:EU), (:Customer:USA) wasn’t a “noun verber” example. It wasn’t composition (EU doesn’t “have” a customer, the business does), and it also wasn’t inheritance (EU isn’t a customer), which is part of why it was orthogonal.

Conclusion

Following these guidelines is a good way to get the best balance of clarity and performance in your model, and should help whether your model is just a simple toy, or a giant enterprise model.

Happy graph hacking!

![|1x1](upload://6w7HOLoKuTDtEXRteNiYA53kW94.gif)

Graph Modeling: Labels was originally published in Neo4j Developer Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.