Node Similarity Algorithm for second and third level relationships comparison

aneeshmonn · December 17, 2019, 10:46am

I am trying to find similar nodes based on second/third level relations created using graphaware nlp annotate text API.

Nodes I have say News don't have direct relations to one another but the relations/similarity are through 3rd level down tags.

Can we use algo.nodeSimilarity for this purpose..?

Also, trying to understand the graph parameter in this alogorithm, not much info in the wiki

neo4j> CALL algo.nodeSimilarity('Match(p:News) return id(p) limit 10', 'Match p1=((p:News)-[h:HAS_ANNOTATED_TEXT]->(:AnnotatedText)-[c:CONTAINS_SENTENCE]->(s:Sentence)-[h1:HAS_TAG]->(t:Tag)) return p1', {
         graph:'cypher',
         direction: 'OUTGOING',
         write: false,
         topK: 2,
         topN: 10,
         similarityCutoff: 0.7,
         concurrency: 2
       })
       YIELD nodesCompared, relationships, write, writeRelationshipType, writeProperty, p1, p50, p99, p100;
Failed to invoke procedure `algo.nodeSimilarity`: Caused by: java.lang.IllegalArgumentException: No column "id" exists```

Any pointers on this would be helpful

alicia.frame · December 18, 2019, 10:06am

Yes! Node similarity is intended to help you identify how similar nodes are based on their neighbors (using the Jaccard similarity scoring function). Although node similarity is intended to work on a bipartite graph, you can use a Cypher projection to compare second and third degree neighbors (or just add a relationship in the graph directly, if you need something that is performant on large datasets).

For the projection, the first cypher clause defines the pool of nodes being considered (so you need source and target) and the second defines the relationship. Try something like:

`MATCH (p) WHERE p:News OR p:AnnotatedText OR p:Sentence OR p:Tag RETURN id(p) as id`,
`MATCH (p:News)-[h:HAS_ANNOTATED_TEXT]->(:AnnotatedText)-[c:CONTAINS_SENTENCE]->(s:Sentence)-[h1:HAS_TAG]->(t:Tag) RETURN id (p) as source, id(t) as target`

Specifying graph:'Cypher' tells the procedure that you're using a cypher projection instead of a named graph.

My only caution with using the Cypher projection is that this might be rather slow on a large graph -- you can speed it up by adding a direct relationship between News and Text and then use the huge graph loader if that becomes a problem.

aneeshmonn · December 30, 2019, 5:03pm

Thank you Alicia.

I did figured out on Cypher Projection. But as you have stated, it lags on performance.

I have 20 Million primary nodes and close to 1M related nodes.

Looking for a performance orientied way to identify Communities and then node similarity as well.

alicia.frame · December 30, 2019, 5:30pm

Instead of using a cypher projection, create the relationship directly in your database such that (for example) (p:News)-[a:AssociatedWith]->(t:Tag). Then you can bypass Cypher and use the huge graph loader directly: CALL algo. nodeSimilarity('News|Tag', 'AssociatedWith') - you'll notice that can be orders of magnitude faster (for any algorithm, not just similarity ones).

Other performance tips include:

For similarity algorithms, try preprocessing with algo.unionFind (aka WCC), which will identify all the disjointed subgraphs, and then run similarity over the separate partitions. This saves you comparing nodes that have no neighbors in common.
Load a named graph once (with algo.graph.load) and then run your algorithms against that named graph - this saves you the time spent loading each time you want to run an algo.

If you're using community edition, there will be some limits to how performant it will be. All the algorithms are implemented to run in parallel, but there is a four core limitation for CE users.

aneeshmonn · December 30, 2019, 5:35pm

Yes, I am on CE, once the PoC is success, we will be movting to EE.

Though I have only mentioned News and Tag, I have few more node type where these are connected say location, reporter, news channel etc.

As per your recommendation, creating a new relation type associated with clubbing all these node types will improve performance right?, let me take a look at that approach.

sinan · May 19, 2020, 5:43pm

I have a similar point with recipes and ingredients plus taxonomy of ingredients and recipe categories, and some more. @aneeshmonn Did you manage to solve your use case successfully? How did you finally manage to do it? I am curious.

Topic		Replies	Views
Performance Issue with Recommendation Query Cypher	1	383	April 2, 2020
Calculate similarity for Nodes in the same level and calculate similarity betweeen two sub-graph depths Graph Data Science / Graph Analytics	5	851	August 31, 2021
How to find the similarity between common nodes of multiple type nodes? Graph Data Science / Graph Analytics cypher , neo4j	20	5928	October 28, 2021
Use Similarity graph algorithm to include shared nodes in the 2nd degree Neo4j Graph Platform	3	286	May 10, 2023
Get similarity node pairs with all the common relationships Cypher cypher	7	583	May 31, 2021

Node Similarity Algorithm for second and third level relationships comparison

Related topics