Filtering by Relationship Type - Contains

We are designing a model, and for performance reason, it has come up that we could filter with the Relation type instead of storing these attribute in the end node.

For example

(User)-[HAS_CREDIT_CARD_VISA_REGULATION_GDPR_DETECTED_ON_2012_01_01]-(SensibleInfo)
(User)-[HAS_CREDIT_CARD__MASTER_CARD_REGULATION_GDPR_DETECTED_ON_2012_05_01]-(SensibleInfo)

Would it be ok to have this kind of long information in the Relationship name?
I am curious to the performance of running Cypher query that filter with the relationshion type. For example type(r) CONTAINS "VISA".
Is that a bad idea?
The idea was to not store the attribute in the node "SensibleInfo" because we want to have fast lookup time for these kind of query

  • Give me all the User that have Visa card.
  • Give me all the User that have a Visa card between two date
  • Give me all the sensible info of User "John"

Yes it is a valid technique to "embed" a value as part of the relationship name. The trade off of this now managing more relationship type and if you're able to keep them all straight and if you decide to change your model at all, it's not a flexible design and you'll be in for a lot of refactoring.

Max De Marzi has an excellent video explaining this technique in greater detail.

1 Like

I have created two DBs.
It took quite a bit of time :slight_smile:

User count: 3,000 users
neo4j.conf: default
Neo4j version: 4.0.4
PC: Mac is a 2010 model Core2Duo with 8GB memory.

Surely you can search faster if you embed the card type and date.

Long Relationship Name (41 ms)

(:User)-[:HAS_CREDIT_CARD_VISA_REGULATION_GDPR_DETECTED_ON_2012_01_01]->(:Sensibleinfo)

PROFILE MATCH (n:User {name:'Braylen Wilkinson'})-[r:HAS_CREDIT_CARD_AMEX_REGULATION_GDPR_DETECTED_ON_2020_03_18]->(s:SensibleInfo)
RETURN *
Cypher version: CYPHER 4.0, planner: COST, runtime: PIPELINED. 29 total db hits in 41 ms.

Short Same Relationship Name (91 ms)

(:User)-[:HAS_CREDIT_CARD]->(:SensibleInfo)

PROFILE MATCH (n:User {name:'Braylen Wilkinson'})-[r:HAS_CREDIT_CARD]->(s:SensibleInfo {type:'AmEx',gdpr:'2020-03-18'})
RETURN *

Cypher version: CYPHER 4.0, planner: COST, runtime: PIPELINED. 66 total db hits in 91 ms.

However, I think people with long relationships are at a disadvantage when they specify multiple cards or time periods.
The reason is that when you search, the names of the relationships have to be a perfect match.
If the User has Visa and AmEx, I don't know how to designate them.
I think it will be slower than the one that has a common relationship with [:HAS_CREDIT_CARD].

The data set with long relationships was created by 3000 users as follows.

CREATE INDEX name FOR (n:User) ON (n.name);

CREATE (:User {name: 'Nickolas Mahoney'})-[:HAS_CREDIT_CARD_VISA_REGULATION_GDPR_DETECTED_ON_2012_01_01]->(:Sensibleinfo),
(:User {name: 'Holden Esparza'})-[:HAS_CREDIT_CARD_VISA_REGULATION_GDPR_DETECTED_ON_2012_01_02]->(:SensibleInfo),
(:User {name: 'Danny Vaughan'})-[:HAS_CREDIT_CARD_VISA_REGULATION_GDPR_DETECTED_ON_2012_01_03]->(:SensibleInfo),
(:User {name: 'Denisse Cochran'})-[:HAS_CREDIT_CARD_VISA_REGULATION_GDPR_DETECTED_ON_2012_01_04]->(:SensibleInfo)
etc..

The data set with short relationships was created as follows.

CREATE INDEX name FOR (n:User) ON (n.name);
CREATE INDEX type_gdpr FOR (n:SensibleInfo) ON (n.type,n.gdpr);

LOAD CSV WITH HEADERS FROM 'file:///CreditCard.csv' AS line
CREATE (:User {name: line.User})-[:HAS_CREDIT_CARD]->(:SensibleInfo {type: line.Type, gdpr: line.Date})

Short Same Relationship Name is very Easy!!

  • Give me all the User that have Visa card.
MATCH (n:User)-[r:HAS_CREDIT_CARD]->(s:SensibleInfo {type: 'Visa'})
RETURN DISTINCT (n.name)
  • Give me all the User that have a Visa card between two date
MATCH (n:User)-[r:HAS_CREDIT_CARD]->(s:SensibleInfo {type: 'Visa'})
  WHERE s.gdpr >= '2012-01-01' AND s.gdpr <= '2012-02-28'
RETURN DISTINCT (n.name)
  • Give me all the sensible info of User "John"
MATCH (n:User {name:'John'})-[r:HAS_CREDIT_CARD]->(s:SensibleInfo)
RETURN s
1 Like

Thanks guys!

I guess it's a trade off between keeping your model readable and not overcomplex just to improve some performance. I like the one with HAS_CREDIT_CARD Better and having indexed nodes attributes instead. I did not know the name of the relationship needs to be perfect match, I tought you could use START WITH or CONTAIN with good performance, but does not look to be the case.

I might go with an hybrid, have HAS_CREDIT_CARD and also a HAS_CREDIT_CARD_NUMBER_XX
so I can check quickly if a user has a specific Card without having to hop on the Info node.

Big thanks

Im wondering if having HAS_CREDIT_CARD AND HAS_CREDIT_CARD_NUMBER_XX may be duplicated info.. just having HAS_CREDIT_CARD_NUMBER_XX would suffice
anyway there is only one type of relation between a User and Info, so I can just check for any relation between the node, without specifying the name

Hi @maxime.blais,

Based on the queries you want to use against the data, HAS_CREDIT_CARD with the type, number, and date in the info node would be the best method. Create an index for the things your going to search by and that should give you optimal performance.

You only want to make rels specific if they’re being used as an exact match.

For example:

(:Person)-[:DRIVES_RED_HONDA]->(:Car)
(:Person)-[:DRIVES_BLUE_HONDA]->(:Car)

That works very well if you want red Honda drivers.
But having to check type and do a contains would be slower. Even worse use an OR and have to include all the HONDA rels (what happens when someone buys a green one?)

We do use this pattern in our large set to split up by our processing boundaries. But when crossing boundary we also have ways to dynamically build he query to those rels to still keep the performance.

1 Like