Performance Issue with Recommendation Query

aneeshmonn · March 30, 2020, 10:05am

I have a node Pats with 12M nodes and its title has been annotated using ga.nlp.annotate and a direct relation IS_RELATED_TO has bee created from this Pats node to Tag node.

Task is to identify similar Pats based on this IS_RELATED_TO relationship which can be used to cluster the data.

I tried using algo.nodeSimilarity as shown below but the code did not finish even after 48 hours

CALL algo.nodeSimilarity('Pats|Tag', 'IS_RELATED_TO', {
direction: 'OUTGOING',
write: true,
topK: 5,
similarityCutoff: 0.8,
concurrency: 4,
writeRelationshipType: 'IS_SIMILAR_WITH_TITLE'
})
YIELD nodesCompared, relationships, write, writeRelationshipType, writeProperty, p1, p50, p99, p100

Later, written below code to do compare one by one pairs and compute jaccard similarity

match(sp)-[:IS_RELATED_TO]->(t:Tag)
	set sp.simProcessed=True

	with sp,sp.pat_id as s_pat_id,collect(id(t)) as sourceTags,count(t) as sourceTagsCount

	match (t1)<-[:IS_RELATED_TO]-(dp:Pats)
	where dp.pat_id>s_pat_id and id(t1) in sourceTags

	with sp,dp,sourceTagsCount,sourceTags,count(distinct id(t1)) as overlapTagCount,(count(distinct id(t1))/toFloat(sourceTagsCount)) as overlapSimilarity

	with sp,dp,sourceTags,overlapSimilarity where overlapSimilarity>0.5

	match (dp)-[:IS_RELATED_TO]-(dpt)
	with sp,dp,sourceTags,overlapSimilarity,collect(id(dpt)) as destTags

	with sp,dp,overlapSimilarity,algo.similarity.jaccard(sourceTags,destTags) as jaccardSimilarity where jaccardSimilarity>0.5

	with * 
	order by jaccardSimilarity desc 
	limit 10

	create (sp)-[:HAS_SIMILAR_TITLE {overlapSimilarity:overlapSimilarity,jaccardSimilarity:jaccardSimilarity}]->(dp) return count(*)

Above code works perfect for my use case and the results looks promising, but the query just don't scale for my 12M records as it can only process 100 records per minute.

I use apoc.periodic.iterate to run the query as shown below.

CALL apoc.periodic.iterate(
	"MATCH (sp:Pats) 
	WHERE not exists(sp.simProcessed)
	RETURN sp",
	"match(sp)-[:IS_RELATED_TO]->(t:Tag)
	set sp.simProcessed=True

	with sp,sp.pat_id as s_pat_id,collect(id(t)) as sourceTags,count(t) as sourceTagsCount

	match (t1)<-[:IS_RELATED_TO]-(dp:Pats)
	where dp.pat_id>s_pat_id and id(t1) in sourceTags

	with sp,dp,sourceTagsCount,sourceTags,count(distinct id(t1)) as overlapTagCount,(count(distinct id(t1))/toFloat(sourceTagsCount)) as overlapSimilarity

	with sp,dp,sourceTags,overlapSimilarity where overlapSimilarity>0.5

	match (dp)-[:IS_RELATED_TO]-(dpt)
	with sp,dp,sourceTags,overlapSimilarity,collect(id(dpt)) as destTags

	with sp,dp,overlapSimilarity,algo.similarity.jaccard(sourceTags,destTags) as jaccardSimilarity where jaccardSimilarity>0.5

	with * 
	order by jaccardSimilarity desc 
	limit 10

	create (sp)-[:HAS_SIMILAR_TITLE {overlapSimilarity:overlapSimilarity,jaccardSimilarity:jaccardSimilarity}]->(dp) return count(*)",
	{batchSize:1, iterateList:true,parallel:true,concurrency:3});

I have created index on :Pats(pat_id) and :Pats(simProcessed)

Tried to understand the Profile output but was not much of help for me.

System Configuration

OS: Linux v4.19.0-041900-generic (amd64 architecture) with 8 cores
Cores: 8
RAM: 61GB
dbms.memory.heap.initial_size:20000m
dbms.memory.heap.max_size: 20000m
dbms.memory.pagecache.size: 30000m

neo4j version : Cypher version: CYPHER 3.5, planner: COST, runtime: INTERPRETED.
what kind of API / driver do you use
screenshot of [PROFILE]
plan444×3262 153 KB
which plugins / extensions / procedures: apoc, algo, apoc.periodic.iterate

Any help on this would be appreciable

michael.hunger · April 2, 2020, 11:41pm

Can you try to use the new graph data science library with node similarity which has been reimplemented to be much faster.

see:

Topic		Replies	Views
Node Similarity Algorithm for second and third level relationships comparison Graph Data Science / Graph Analytics	5	1340	May 19, 2020
Need idea about Cypher query writing Operations operations	1	451	July 27, 2020
How to use Jaccard similarity algorithm in neo4j to find the similar nodes Procedures & APOC cypher	17	4550	January 17, 2019
Building similarity graph for bipartite graph Graph Data Science / Graph Analytics operations	6	512	January 4, 2022
Query Slows down when nodes at double depth relationship are accessed Operations performance , cypher	9	1224	November 28, 2019

Performance Issue with Recommendation Query

Related topics