Using Jaccard Similarity in enterprise architecture to define logical components of domains

Dear all,
In the 1980’s, Information Engineering Methodology used affinity analysis to produce a Process Affinity Matrix that defines logical applications. It would be great to see if this can be reproduced and visually represented in graph form and extended across more domains of enterprise architecture.

Enterprise architecture is broken down into various domains such as, depending upon the framework used, Process/Function, Data, Application, Technology, Security and Service. Domains have interactions with other domains and the way portions of one domain interact with portions of another domain can be helpful in grouping the contents of both portions. For example, the business processes that interact with the same data can be grouped together and considered to be a logical application. Conversely, data that is used by similar business processes can be grouped into logical subject areas. Analysing how similarly things interact can be extended to/between organisation unit, location, physical application, database, server and so on.

So what? Why bother? Firstly, it can unearth new insights or add greater definition and detail to already understood facets of a business. Second, and more practically, it can be used to scope a suite of projects in a programme of business change. How so? Very often the design and redesign of organisation structures, applications, even the use of office space is done intuitively. Wouldn’t it be good for components of ideal target operating models to be defined mathematically*? And, accepting that there are often practical and political considerations to be accommodated, the projects to achieve the TOM be equally clear.
(* Michael Hunger has advised that the Jaccard Similarity algorithm is the starting point)

It would be great to see if clusters in enterprise architecture can be visually represented in graph form where the greater the affinity the closer together the nodes are. Whilst many organisations invest a lot of information into enterprise architecture applications, none of them can provide the insight I am suggesting.

I can make available anonymised test data (it starts with a CRUD matrix), some basic load scripts and a basic Jaccard Similarity algo script that seems to work.

For later consideration:
a) some things are structured like a ragged hierarchy, such as an organisation hierarchy or process hierarchy (i.e. process comprising levels of sub-process) or a network (e.g. a sub-process that is invoked in more than one process); and
b) weightings. For example, a process that creates records/nodes has a greater significance than those only reading records/nodes. Greater clarity can be achieved if weightings can be accommodated.

Frankly, I am not a developer and the documentation for Cypher and algos means my progress to making this a useful public resource is dismal. Hints, tips, contributions or even a straight-forward solution and will genuinely have my gratitude and acknowledgements.
Yours aye,

1 Like

Hi Douglas,

I am interested in playing with this concept. Please send me anonymised test data (it starts with a CRUD matrix), some basic load scripts and a basic Jaccard Similarity algo script.


1 Like

And I'm interested in to where this goes! :smile:

Hi Kamal and Karin,
Nice to (virtually) meet you.
I need somewhere to put some CSV files for the base data (170Kb total). If there is a way of doing it on this thread then I can't see it. Perhaps you can suggest location.
The load scripts are:

//Load the set of entity types (sorry to use relational terminology in a graph forum :)
LOAD CSV WITH HEADERS FROM "file:///C:Logical_Entity_Types.csv" AS row
CREATE (:Entity_Type { Name: row.Logical_Entity_Type});

//Load the set of Logical Business Processes
LOAD CSV WITH HEADERS FROM "file:///C:Logical_Business_Process.csv" AS row
CREATE (:Logical_Business_Process { Name: row.Logical_Process_Name, Optional_Indicator: row.Optional_Indicator, Repeatable_Indicator: row.Repeated_Indicator});

//Load how Entity Types relate to (are accessed by) Logical Business Processes
LOAD CSV WITH HEADERS FROM "file:///C:2way_Process_Entity_Access_triple.csv" AS row
MERGE (e:Entity_Type {Name: row.Logical_Entity_Types})
MERGE (p:Logical_Business_Process { Name: row.Process})
WITH e, p, row
CALL apoc.create.relationship(e, row.Access_Type, {}, p) YIELD rel

//Load how business processes relate to each other, i.e. child of, succeeds, excludes and concurrent with. Note that some Business Processes have no relation to others and so are standalone.
LOAD CSV WITH HEADERS FROM "file:///C:2way_Logical_Business_Process_Relationships.csv" AS row
MERGE (p1:Logical_Business_Process {Name: row.node1})
MERGE (p2:Logical_Business_Process { Name: row.node2})
WITH p1, p2, row
CALL apoc.create.relationship(p1, row.relationship, {}, p2) YIELD rel

For the Jaccard Similarity algorithm, I use the following:

MATCH (p1:Logical_Business_Process)-[:read|:write|:read_write]-(e1:Entity_Type)
WITH p1, collect(id(e1)) AS p1entity_type
MATCH (p2:Logical_Business_Process)-[:read|:write|:read_write]-(e2:Entity_Type) WHERE p1 <> p2
WITH p1, p1entity_type, p2, collect(id(e2)) AS p2entity_type
RETURN p1.Name AS from,
p2.Name AS to,
algo.similarity.jaccard(p1entity_type, p2entity_type) AS similarity
ORDER BY similarity DESC
//Where the relation between Logical_Business_Process and Entity_Type is "read", "write" or "read_write" and p1 and p2 are not trying to reference the same node (i.e. the same Logical_Business_Process) …

The following links should allow the sample data to be open to everyone. I have not done this before so please let me know if it does not work.

Yours aye,

Hi Doug,

Thanks for sharing the scripts.

May be Karin can recommend a location for uploading the data files.


Hi there,

There are a few different places that we typically use to upload files:
a) Amazon S3 or Google Cloud Storage. Just make sure the files are public or you'll have to find a way to put auth tokens in the URLs.
b) GitHub - gives you a raw link to files that can be accessed from LOAD CSV
c) Google Spreadsheets. Upload the CSV to a spreadsheet (which converts it to Google Spreadsheets format) and then get the CSV export URL. More on Rik's blog.

I typically use (a), but all 3 should work.


Thanks, @ryan.boyd.
Also, you may have already seen these, but I wanted to share some resources and documentation on Jaccard in case you haven't. :blush:

Did this blog post and online meetup:

Hi Doug,

Please send me your email address so that I can send a link to one of my folders on OneDrive and you can upload your data files.

Hi Doug,

Any chance of getting data? Here is my email:

Thanks for sharing these. I found them interesting but I was not able to translate it into something usable for me - that's my fault - struggling to grasp Cypher etc.

Anything we can help with on our end, Doug?

The following links should allow the data to be open to everyone. I have not done this before so please let me know if it does not work.

Yours aye and enjoy!

1 Like

Thanks, Doug. The links worked and clicking on each link automatically downloaded the respective file.

1 Like

Hi Doug,
To start with I decided to use the data from Process_Entity_Access to find the business processes common to two or more entities. The business processes have parent-child-grandchild hierarchy and it was intereting to mote that most of the entities have access to the child/grandchild processes. Only a few entities have access to all parent/child/grandchild processes.

Also, I changed the relationship 'child_of' to 'child' so as to create parent->child->granchild as shown below.

Entities sharing processes:
MATCH (e:Entity)-[:read]->(p:Process)

Entity with access to parent/child processes:

MATCH (p:Process)-[:child]-(p1:Process)-[:read]-(e:Entity)
WHERE e.eid = "17" AND = "13"
RETURN p, p1, e LIMIT 10;

Ran the Jaccard similarity and the result is:

MATCH (e1:Entity)-[:read]-(p1:Process)
WITH {item:id(e1), categories: collect(id(p1))} as userData
WITH collect(userData) as data
CALL, {topK: 1, similarityCutoff: 0.0})
YIELD item1, item2, count1, count2, intersection, similarity
RETURN algo.asNode(item1).Name AS from, algo.asNode(item2).Name AS to, intersection, similarity
ORDER BY intersection DESC;

These are my observations so far. Let me know your thoughts on this.

ps: I added id's to Entities (eid) and Process (pid)

Good to hear from you. Thank you for taking the trouble to describe your process.
You have observed the ragged hierarchy of processes. Although it is actually a network because a small handful of business processes are invoked as a subprocess of more than one process (although I am sure I can easily find someone who will say that they would not be sub-processes!) but that’s beside the point. And you have spotted that entities can be created, read or updated by more than one process.

I think I got to a similar point but by a different means and that gives me confidence in what I have done so far. It is from this point, that what little ability and understanding I have fails me. My ambition would be to see a diagrammatic representation of the results where nodes of high similarity are close together, nodes of high dis-similarity are far apart, such that there would then be clusters of nodes with visible separation between clusters and, I daresay, there would be a few outliers dotted about the place.

I think the “similarity” score would be central to doing that but I can’t fathom the documentation/meet-ups/videos to get any further. If you can give me some hints on how I could achieve that then I should be grateful.

As an aside, I am struggling to understand some of your results. That is a comment on the quality of Cypher documentation and not your notes. I read the words in the manual but they have no meaning to me. I don't understand what "count1", "count2" and "intersection" represent
Yours aye,

Hi Doug,

Thanks for your detailed reply. Regarding your questions on count1, coun12, intersection here is some info.

In the Jaccard similarity code item1 and item2 refer to Entity1 and Entity2 .

Count1 gives the total number of Processes associated with a given Entity. Example: Entity " Adjudication Outcome Option" has access to 20 Processes.

Count2 is same as Count1 except for the second Entity.

Intersection is the Processes that are shared between two Entities. In the second picture above you can see (in the center) 10 Processes shared by two Entities.

Similarity = (number of shared processes) / (total number of distinct processes) and in this case it is (10/30) = .333

Hope this answers your questions.

Hi, I am going to have another go at getting to the bottom of this! And I am still not a developer, I am a business architect now in a massive petrochem company wanting to show some extremely senior architects the value of graphs to help define aspects of their architecture.
So I should be grateful for more thoughts.

I still want to show the nodes with the label of 'Business Process'. I want to show those which have a higher similarity (of using the other type of node, Entity Type). I think it needs relationships to be created called 'Similarity' where the scores are assigned to the relationships between nodes and those scores updated each time I run it (rather than create new relationship).
Thereafter, I can run a query selecting nodes above a value (e.g. 0.8).
As a nice-to-have, is there a way of translating the value into the relative thickness of the relationship line such that the bigger the value the thicker, or shorter, the line on order to see the stronger similarities at a glance?
I am still on v3.5.16.
Kind regards,