Trying to understand some of the tradeoffs when refactoring duplicate data

Hi! I'm working my way through the courses, and I'm on the Refactoring Duplicate Data module in the Modeling Fundamentals course. I just learned about a great example where the languages in which a movie is available are listed as properties on the Movie nodes, but since most of them have "English" in the list, it's creating duplicate data. So this is refactored by making a Language node for "English" that can have a relationship with the Movie nodes. I understand the concept, but it brought two questions to mind:

  1. In this case, we are trading a property of "English" for an IN_LANGUAGE relationship to "English". Why is this better? I know it is situational and depends on my use cases, but in general, are "duplicate" relationships better than duplicate properties?

  2. Would this create the "super-node" issue cautioned against earlier in the courses? It would mean that we'd be creating a Language node for "English" and basically all the movies would point to it. Would that cause scalability problems? Or is it still better to do that than have "English" in the properties of basically every movie?

Hello Sara,

Welcome to the Neo4j Community!

It's always better to eliminate duplicate data as we teach in the course.

Since Neo4j does not yet support indexes on elements of a list, the best solution is to create an English node that the Movie node points to.

if your data is such that you may have "super" nodes, you may want to model the data to avoid the super nodes, but you need to have milions of relationships this to happen.

The bottom line is that you should profile your important queries to make sure that they are covered by your data model.

That's the beauty of Neo4j, it is easy to refactor the graph to support a data model.

Elaine

1 Like