JSON Output File Cypher Query Readable by D3 Sankey

Greetings,

Apologies if info around this topic exists. I'm using Neo4j Community 3.5.1 w/ apoc-3.5.0.3-all.

Goal
Here is what I am trying to achieve... I want to run a cypher query that outputs a JSON file that is in the format that is consumable/readable by D3.js to display a Sankey Diagram. This site displays the D3 source files. Here is the JSON formatted structure I am hoping apoc can output that D3 requires...

{
"nodes":[
{"node":0,"name":"node0"},
{"node":1,"name":"node1"},
{"node":2,"name":"node2"},
{"node":3,"name":"node3"},
{"node":4,"name":"node4"}
],
"links":[
{"source":0,"target":2,"value":2},
{"source":1,"target":2,"value":2},
{"source":1,"target":3,"value":2},
{"source":0,"target":4,"value":2},
{"source":2,"target":3,"value":2},
{"source":2,"target":4,"value":2},
{"source":3,"target":4,"value":4}
]}

This is what the properly formatted json file in D3 Sankey will look like...

What I'm trying
I've got a working cypher query that returns Source Target Value, but I don't know how to add to the query so that it first groups all of the "nodes" (and numbers the nodes starting at 0) and then groups all of the "links" as seen above.

With "MATCH (bry:Brewery)-[r:BREWS]->(b:Beer)<-[r2:IS_BARREL_AGED]-(ba:BarrelAged) 
         With bry as brewery, size(collect(b.name)) as number 
         Where number > 1 
         Match (brewery)-[:BREWS]->(b2:Beer)<-[:IS_BARREL_AGED]-(ba2:BarrelAged) 
         With brewery, ba2, count(b2) as cnt 
         Return brewery.name as source, ba2.name as target, cnt as value Order by target ASC 
               UNION All 
         MATCH (bry:Brewery)-[r:BREWS]->(b:Beer)<-[r2:IS_BARREL_AGED]-(ba:BarrelAged) 
         With bry as brewery, size(collect(b.name)) as number 
         Where number > 1 
         Match (brewery)-[:BREWS]->(b2:Beer)<-[:IS_BARREL_AGED]-(ba2:BarrelAged) 
         Optional Match (brewery)-[:BREWS]->(b2)-[:IS_A]-(bt:BeerType) 
         With brewery, ba2, count(b2) as cnt, bt 
         Return ba2.name as source, bt.name as target, cnt as value Order by source ASC" 
               as query 
         CALL apoc.export.json.query(query, "/filelocation/barrelsankey.json") 
         YIELD file, source, format, nodes, relationships, properties, time, rows, batchSize, batches, done, data 
         RETURN file, source, format, nodes, relationships, properties, time, rows, batchSize, batches, done, data

Does anyone have a suggestion as to what I need to do to my query to make its JSON output the same structure required of D3 Sankey?

Hi Keith,

I'm a SANKEY fan, so I had to take a quick look at this. I'd like to see the answer too! :slight_smile:

My first thought was that there are more than a couple challenges here, and more than a few ways to solve this too. I'd personally sketch out a strategy, break it into pieces then solve each one at at time, finally string it together for a working solution. That still might be best, as I started to pick apart my first thought (virtual nodes/relationships sent to APOC JSON export), I then saw the APOC JSON output

{"nodes(p)":[{"type":"node","id":"0","labels":["....

It looks quite verbose, graph specific and would require post processing in javascript before use, so then I took a look at just generating the JSON directly.

Note, since I don't have your DB I tried to do something generic with one my own DBs that could be run as a starter / demonstration exercise. Just replace the match query clause to find the nodes and relationships you want to send to D3. (and remove the LIMIT 2 of course...)

match p=(a)-[]->(b)
with p limit 2
unwind nodes(p) as n unwind relationships(p) as r
with collect( distinct {node: ID( n), name: n.name}) as nl, 
collect( distinct {source: ID(startnode(r)), target: ID(endnode(r)), value: r.score}) as rl
RETURN {nodes: nl, links: rl}

Here is the output

{
  "nodes": [
    {
      "node": 0,
      "name": "A"
    },
    {
      "node": 526658,
      "name": "B"
    },
    {
      "node": 526657,
      "name": "C"
    }
  ],
  "links": [
    {
      "source": 0,
      "value": null,
      "target": 526658
    },
    {
      "source": 0,
      "value": null,
      "target": 526657
    }
  ]
}
2 Likes

Thank you so much Joel for your time and energy with this!
One issue remains and I am unable to solve it or find information about it... most of the D3 Sankey demos found on the web based on Mike Bostock's version...



...are using a zero-based index for organizing each unique json object and its source to target linkage.
They are looking for the json objects to start with 0 (zero) and iterate up from there for each unique object.
Example:

{
"nodes":[
{"node":0,"name":"node0"},
{"node":1,"name":"node1"},
{"node":2,"name":"node2"},
{"node":3,"name":"node3"},
{"node":4,"name":"node4"}
],
"links":[
{"source":0,"target":2,"value":2},
{"source":1,"target":2,"value":2},
{"source":1,"target":3,"value":2},
{"source":0,"target":4,"value":2},
{"source":2,"target":3,"value":2},
{"source":2,"target":4,"value":2},
{"source":3,"target":4,"value":4}
]}

While it's true my nodes have both an ID and a Uid property, without a json file organized like this, the Sankey rendering fails. There are versions of Bostock's Sankey out there that are based on a unique name, but I was really hoping to base it on the numbering version. If that's not possible, no worries.
...
Are you using D3 for your Sankeys? If so, how have you solved this unique identifier issue?

I've not created SANKEYs from Neo4j, so I was eager to see you do it...

At the moment I'm thinking

  1. personally I'd try to figure out how to use the Identifiers "as is", since this is a universal issue with every graph to render from every source (sankey lib ought to do this for us, can't we fix it there?)
  2. if the (latest) sankey library simply can't handle or be updated to handle typical graph identifiers then renumber them in javascript? It may be easier than in Cypher
  3. if it must be on the neo4j side, There is probably a clever way to do it using an in-memory graph projection (assign new sequential _id or a new id property?), or by creating virtual nodes/rels where the ids are renumbered automatically. However I tinkered with virtual nodes/rels for a few minutes, it wouldn't be entirely straightforward, the automatically assigned virtual nodes receive negative identifiers and they start at -1 (not zero) so they'd still need an abs(ID()+1) transform (for both nodes and rel start/end identifiers) With some work a virtual nodes/rels approach may be possible?

Aside from #3, what is required is a cross reference table for the node IDs, applied to both nodes and rel start/end, somewhere in the process... In the past I've oft built tables like this on the fly while iterating over the input list of numbers, look for orig_id entry in table if found use that new_id, if not there add a new entry to the table and use it, repeat until done. If run on the unique node list, it should then be fully populated when used to translate the rel IDs...

orig_id     new_id
123        0
432        1
31         2
7          3
9991       4
....

Note: It feels like there ought to be another simple tricky way to do this in cypher, maybe using an obscure apoc function, but I didn't find it, yet....