Performance of Node Labels VS attributes

Hello

I am trying to understand if the use of a node label as a "tag" has any better performance in querying than using an indexed attribute.

Consider for example a typical blogging platform where some Article node might have to be "tagged" as Draft, versus adding a status attribute (with an index) to serve the same purpose.

This has come up as a potential feature in neomodel but it has some potentially messy implications. Therefore, it would be good if we had some more information on this prior to proceeding with a feature.

My naive thinking is that there is no difference in indexing Labels than indexing attributes, therefore, there is no reason why this functionality should not be implemented via an attribute. But if there is reason for a Label search to be faster, this might add a "vote" to adopting this "Optional Labels" feature.

For more information, please see here and maybe here too.

All the best

2 Likes

For simple queries, node labels perform identical to indexes on properties. If you have a more complex query, you will find that node labels will out perform using an index on a property.

Elaine

1 Like

Hi

Thank you for your reply.

The first part of your answer confirms my suspicion and I would not think that a different algorithm is used for indices on labels or properties (at least for strings).

For the second part, I do not think that it would be "fair comparison" to compare a complex query with a "has / does not have a label" query. But just in case I misunderstood, would it be possible to mention an example of "...a more complex query [where] node labels will outperform using an index on a property"?

Hi there,

Usually indexes and/or label scans are used to find starting places in the graph, then expansion and filtering is used to match to desired patterns.

If you aren't starting from draft articles, but instead traversing to article nodes via some more complex MATCH, then you would need to filter to ensure the article nodes are drafts. This would require either property access and comparison, or a label check, depending on the implementation.

In this case, label checks will outperform filtering on properties, but you'll usually only notice this if you're filtering nodes in the neighborhood of 100's of thousands or more. Property access and filtering is usually one of the more expensive operations in a Cypher query, so when dealing with a large amount of rows/nodes, looking for ways to avoid property access (such as through usage of labels, or utilization of more specific relationship types with high selectivity) often pays off.

In this case you'll need to balance this with the additional complexity of label usage. In the case of a boolean status, draft or not a draft, then this is fairly easy to work with in your query. If there are more options besides draft or no draft, then performing filtering or getting the status may become more complex than what you'd want, as you can't simply return article.status, you'd have to explicitly check all the possible labels that are involved, and it's not clear from the model which labels are options for the status...for this case it would probably be best to keep the status property, though you may want to use the :Draft label as an addition to this rather than a replacement.

4 Likes

That's great, thank you very much, as far as I am concerned these two responses are very clear on how this would be handled by the DBMS.

Regarding your last paragraph, for neomodel, the added complexity is that treating the Label as a tag, breaks the convention that establishes Labels as "Types" and this has a number of implications based on how things work for the moment.

Just out of curiosity: My perception is that if one adds .status as an indexed integer and treat the N-bit integer as N-flags, then the performance hit cannot be that big. In fact, if you were to express your query with logic operations, then (I suppose) this would result in a scan (because the expression has to be evaluated before it can be decided if it satisfies the criterion). But if you can express it via comparison or equality operators then this can be resolved purely by looking at the index (Tree entries) and retrieving those nodes that are within the specified range. Another way to do this would be to assign every combination of "flags" to a separate "status" value, which then (in this particular case) translates the problem to a simple comparison, which can be handled directly by the index. (This of course covers part of the problem, I do hear what you are saying on having queries with multiple nodes where it might be necessary to "jump" between node categories, in that case, the label will be faster indeed). Is this more or less on the right path?

While this was written a while back, I think this is still very relevant as general reading : Modelling Data in Neo4j: Labels vs. Indexed Properties | GraphAware

6 Likes

Thanks @Christophe_Willemsen, I have come across this article and read it but as you say, it was getting a bit "old" by now and I thought it would be better to confirm.

Thanks @Christophe_Willemsen for the article. Very interesting and important basis to understand query optimisation.