My graph is, well, busy. What now?

Sorry if this is vague! I think I'm asking for brainstorming ideas...
This graph is a mySQL schema: tables, queries, attributes and the connections between them. When it loads in the browser it looks unintelligible. What now? Is it simply too much data? I can filter it, but are there other tips and tricks I can use?

I am very new to this, so hopefully my answer is OK, or someone more knowledgeable will come along and give you a better answer! It looks to me like you have attributes such as first_name as nodes, when they should in fact be adjectives to the nodes, or adverbs to the relationships. This would thin out your image significantly. My take on moving from an SQL RDBMS to a graph database is that the tables become nodes, and the fields become adjectives/adverbs. There is a good series of introductory videos over on YouTube which I found useful:

While Paul is still getting used to the terminology we use with property graphs and Neo4j, his advice is good, and the video series (they link to the next in the series at the end) is a great place to become familiar with graph db use cases, terminology, and function.

At the very general level, when translating from SQL to Cypher, you would translate this way:

A SQL table name would become a node label.
A row in a table would become a node (labeled appropriately as in the previous line)
A column in the table would become a property of the corresponding node for that label.

In place of a join table (or a column in a table with an FK relationship to an entry in another table) you would instead create a relationship between each pair of related nodes.
Any column on a join table aside from the FKs that correspond to some characteristic of the relationship would become a property of the relationship between the two nodes.

So for your graph, you wouldn't want :Table nodes or :Attribute nodes, but nodes like :Film, :Store, :Customer, :Address, :Actor, etc. And each of the nodes of those labels would have corresponding properties (instead of the :Attribute nodes you currently have).

I can see one of your nodes in the picture is film_actor. Instead of changing that to a :FilmActor node, we could recognize that this is a join table, which should instead be represented by relationships between :Film nodes and :Actor nodes. You would want to first make sure your nodes are in place, then process your data to match to those existing nodes and create relationships between them.

As far as the :Schema and :Query nodes, I'm not quite sure what these are supposed to be. If these are actual queries, and actual schema for your previous db, then they should not be modeled as nodes in your graph. We'd need more detail to see what indexes or constraints would be the equivalent for your schema, and you'd be creating actual Cypher queries in place of those query nodes.

You may want to check out our Movies Graph tutorial as a good example of how similar property graph data could be modeled.

In the Neo4j Browser, execute the following:
:play movies

This will bring up the tutorial that you can use to create and play with the movies graph, that should help.

1 Like

Great feedback, I appreciate the depth of your analysis! I am modeling the schema rather than the data so I think I'm ok with nodes and relationships. My quandary is how to make it manageable and useful to my users. Anyone looking at it is overwhelmed by the
volume of information and this is a small schema.

Bill Nicholson

Assistant Professor Educator, IT Program

University of Cincinnati, Clermont College

281 West Woods

nicholdw@ucmail.uc.edu

(513) 558-5342 (please email for faster response)

If you need an overview of a graph's schema, rather than creating a separate graph to represent the schema manually, you may want to take advantage of some existing procedures that can infer it and reveal a graph-level view of it from the data itself.

You can use CALL db.schema() to do this, or, if you have APOC Procedures installed, you can use CALL apoc.meta.graph(), which tends to work better.

I worked on a similar project for my company. There was 100's of tables and many thousands of attributes which had various transformations that were stored in a Data Warehouse, with data marts down stream, and analytics tables where models were being built. The goal of my database was to show the various transformations between variables. I.e. - given an analytics table attribute for a model, where did it come from, what were the transformations involved, etc.

If your graph is unintelligible, try to focus the graph on specific tables or types of attributes, so that you can answer or investigate subsets of the connections. Neo4j browser doesn't have the most elegant organization features, perhaps consider Bloom if you are on Enterprise. If not you can also use APOC to export as a graphml file, and then visualize in yEd which has more self -organization algorithms to show structure better. I have a python script that with basic edits can colorize your graphml file so that the output has colors and shapes to show differences like tables, vs attribute nodes etc: https://github.com/bsquire0128/Squire-Product-Catalog/blob/master/colorize_graphml.py

Hi Benjamin,

How do I use the script? I have Neo4j Community Edition and I'm doing everything locally. I just open Chrome, browse to 127.0.0.1:7474, and the graph loads.

Ah, Benjamin makes a great point, if the purpose of your graph is for data lineage or similar, where the dbs, tables, and properties involved are actual entities (for example, to query for which db/tables/columns were used to source certain data found in some report), then your initial approach of using :Table and :Attribute nodes and similar would make more sense.

Is this the type of graph and data you're trying to model? Or did I have it right before, and you're just trying to import SQL data (movie related data?) into Neo4j?

I am modeling the schema, not the data in the schema.

The way it works is add the APOC plugin as shown https://neo4j-contrib.github.io/neo4j-apoc-procedures/#installation

With APOC you can run apoc.export.graphml.all(file,config)
seen here https://neo4j-contrib.github.io/neo4j-apoc-procedures/#graphml

When you save the file mygraph.graphml, you input that into python colorize_graphml.py mygraph. First you need to edit cololrize_graphml.py to fit your data. This is done by changing visual_dict={'Node1':{'id':'node1_id','color':'#ff6666','shape':'ellipse'}, 'Node2':{'id':'node2_id','color':'#fff666','shape':'ellipse'} }
Node1 might be 'Table', Node 2 might be 'Schema' and you could add a "Attribute" so it looks like:visual_dict={'Table':{'id':'table_id','color':'#ff6666','shape':'ellipse'}, 'Schema':{'id':'schema_id','color':'#fff666','shape':'ellipse'}, 'Attribute':{'id':'attribute_id','color':'#ffff66','shape':'ellipse'} }
Note- I don't know what the id's in your graph are called, I just assumed the properties that define unique nodes were table_id, etc. Make sure they match waht your data actually has. Once you run python colorize_graph.py mygraph you will get an output file mygraph_colored. Download the free version of yEd here https://www.yworks.com/products/yed and then open the graphml file mygraph_colored in that software. It has lots of amazing layouts for different graph structures. I prefer circular or hierarchal.