Question: Dense Nodes with Millions of Relations?


(Alec Muffett) #1

There are perhaps 250 countries in the world.

  • If you have People nodes, and Country nodes
  • and if you have millions of people (eg: representing the 1+ countries which people have ever visited)
  • then some of these Country nodes will have tens of thousands, perhaps hundreds of thousands or millions of relationships ...

What I am wondering is "I am sure that Neo can cope with this, but is this truly a sane way to model data"?

It strikes me that perhaps it is more sane / less likely to cause query explosions, if Country nodes are somehow "sharded", eg: with something like:

(p:Person {uid:42})-[:Visited]->(c:Country {name:"France", uid:42})

...so that (assuming we are most interested in the :Visited forward relationship, each Person has a small cluster of per-Person-sharded Country nodes associated with them.

What do other folks think, please? Both approaches have pros/cons that I can see...


(Alec Muffett) #2

Apparently relevant link:


(Alec Muffett) #3

Further, from @mark.needham a few years ago, touching on CREATE UNIQUE (which may be defunct now?) but regarding an effect that I may be encountering, that csv-importing and merging relationships to dense/ish nodes may be slowing as the relationship count rises.

Sample code:

USING PERIODIC COMMIT LOAD CSV WITH HEADERS FROM "file:/foo.csv" AS batch
WITH batch WHERE (batch.FOO <> "")
MATCH (a:Person {uid:toInteger(batch.UID)}), (x:Foo {nonce:batch.FOO})
MERGE (a)-[ax:PersonToFoo]->(x)
ON CREATE SET ax.weight = toInteger(batch.WEIGHT)
ON MATCH SET ax.weight = ax.weight + toInteger(batch.WEIGHT)
; 

...where a and x have been pre-created in previous runs, per the suggestion from @michael.hunger in Tip: Avoiding Slow & Messy Conditionals (or: splitting input) in Cypher for bulk import LOAD CSV?

(edit: there are tens-of-millions of relations to create, from a multi-gigabyte CSV file; there are also UNIQUE constraints on a.uid and x.nonce)

Mark's Blog Link: https://markhneedham.com/blog/2015/07/28/neo4j-mergeing-on-super-nodes/


(Michael Hunger) #4

Actually there is a specific cypher operator for that MERGE(INTO) which takes the two node degrees into account.

And starts from the smaller side to check if a relationship exists between the two.

Your statement also has to write for every line.

Something that can be helpful in general for this kind of statement is to aggregate first
and then create the data after.
(But that would not work with USING PERIODIC COMMIT (for > 1M rels), so you'd have to use apoc.periodic.iterate)

call apoc.periodic.iterate('
LOAD CSV WITH HEADERS FROM "file:/foo.csv" AS batch 
WITH batch WHERE (batch.FOO <> "") 
RETURN toInteger(batch.UID) as person, batch.FOO as foo, sum(toInteger(batch.WEIGHT)) as weight
','
MATCH (a:Person {uid:person}), (x:Foo {nonce:foo}) 
MERGE (a)-[ax:PersonToFoo]->(x) 
SET ax.weight = ax.weight + weight
',{batchSize:10000})

(Alec Muffett) #5

Hi @michael.hunger - thank you for the response; and I take your point about the aggregation, that makes sense, I may try it. But where you say:

...I don't really understand what you mean; is there a different kind of MERGE that I should be using?


(Michael Hunger) #6

Nope that's automatic, you see the difference in the query plan.