Algo.similarity.jaccard.stream takes more than 3 minutes


(Alexpappasc) #1

Hi Michael,
I am trying to calculate jaccard similarity but it takes more than 3 minutes.

MATCH (u:User)-[:HAS]->(f:Feature) RETURN count(*);

count(*) : 106384

MATCH (u:User) RETURN count(*);

count(*) : 15198

:schema

Indexes
ON :Feature(id) ONLINE (for uniqueness constraint)
ON :User(id) ONLINE (for uniqueness constraint)

Constraints
ON ( feature:Feature ) ASSERT feature.id IS UNIQUE
ON ( user:User ) ASSERT user.id IS UNIQUE

dbms.memory.heap.initial_size=1g
dbms.memory.heap.max_size=8g

Following the documentation:

MATCH (p:User)-[:HAS]->(f:Feature)
WITH {item:id(p), categories: collect(id(f))} as userData
WITH collect(userData) as data
CALL algo.similarity.jaccard.stream(data, {similarityCutoff: 0.01})
YIELD item1, item2, similarity
RETURN algo.getNodeById(item1).id AS startNode, algo.getNodeById(item2).id AS endNode, similarity
ORDER BY similarity DESC limit 10

Started streaming 10 records after 225411 ms and completed after 225414 ms.

Is there any way to improve the performance ?
Because this is not even close to the real amount of data.


(Michael Hunger) made this topic public #3

(Michael Hunger) #4

How long does the query run to compute the data?

MATCH (p:User)-[:HAS]->(f:Feature)
WITH {item:id(p), categories: collect(id(f))} as userData
WITH collect(userData) as data
RETURN size(data)

You could increase your cut-off or add a degree-filter or add a top-k (like top-5 pairs)

how many rows/pairs are returned in total and how long does this run?

MATCH (p:User)-[:HAS]->(f:Feature)
WITH {item:id(p), categories: collect(id(f))} as userData
WITH collect(userData) as data
CALL algo.similarity.jaccard.stream(data, {similarityCutoff: 0.01})
YIELD item1, item2, similarity
RETURN count(*)

How many cpu's do you have?


(Alexpappasc) #5

Hi Michael,

Thank you for the response. I needed to use cut-off , degree-filter and top k to speed up the query.
But I also need to find similarity between nodes with different labels.

MATCH (m:User{id:"34er796-23sa2963-4235534hj34"})
MATCH (m)-[:HAS]->(g:Feature)<-[:HAS]-(other:Post)
WITH m, other, COUNT(g) AS interscn
MATCH (m)-[userHas:HAS]->(mg:Feature) 
WITH m, other, interscn, {item:id(m), categories: collect(id(mg))} as userData
MATCH (other)-[postHas:HAS]->(og:Feature)
WITH m, other, interscn,userData,{item:id(other), categories: collect(id(og))} as aData
WITH m,other, interscn, collect(userData) as udata , collect(aData) as adata
with udata+adata as data
CALL algo.similarity.jaccard.stream(data, {concurrency:8, similarityCutoff: 0.5})
YIELD item1, item2, count1, count2, intersection, similarity
RETURN algo.getNodeById(item1) AS startNode, algo.getNodeById(item2) AS endNode, similarity
ORDER BY similarity DESC limit 100;

is there any better approach? should i add a common Label to all node types that i need to compute similarity metrics between them?


(Michael Hunger) #6

Would you mind answering my other questions too? So we know more about your setup?


(Alexpappasc) #7

Hi Michael,

Currently the setup is not the same because neo4j data managed from other services.

MATCH (u:User) RETURN count(*);

"count(*)"││11804

MATCH (u:User)-[:HAS]->(f:Feature) RETURN count(*);

"count(*)"││64897

ID Allocation
Node ID	21148
Property ID	1273369
Relationship ID	834959
Relationship Type ID	5
MATCH (p:User)-[:HAS]->(f:Feature)
WITH {item:id(p), categories: collect(id(f))} as userData
WITH collect(userData) as data
CALL algo.similarity.jaccard.stream(data, {similarityCutoff: 0.01})
YIELD item1, item2, similarity
RETURN count(*)

"count(*)"││28255867

MATCH (p:User)-[:HAS]->(f:Feature)
WITH {item:id(p), categories: collect(id(f))} as userData
WITH collect(userData) as data
RETURN size(data)

"size(data)"││11804

$ lscpu

Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit 
Byte Order: Little Endian 
CPU(s): 8 
On-line CPU(s) list: 0-7 
Thread(s) per core: 2 
Core(s) per socket: 4 
Socket(s): 1 
NUMA node(s): 1 
Vendor ID: GenuineIntel 
CPU family: 6 
Model: 158 
Model name: Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz

(Mark Needham) #8

Hey,

So this algorithm does a pair wise comparison of all the {item:id(p), categories: collect(id(f))} values that are passed in. So in this example it's doing:

11804^2 / 2 = 69,667,208 comparisons

And then it only returns the ones with a similarity > 0.01, so that's how we get down to 28m records.

What do you want to do with the similarities once they're computed? I think the answer to that will make it easier to figure out what to suggest.

Cheers, Mark


(Alexpappasc) #9

Hi Mark,

Thank you for the details. In the main service I don't need similarity with Cutoff 0.01 so if I set it up to 0.7 it needs 1.5 seconds. But I need to run a scheduled process within I need to compute similarity with any Cutoff in order to create certain paths and do some process after this. More in details I have 2 cases:

  1. I need to compute similarity for a list of nodes with all other nodes. So I don't need to compute similarity all versus all. Any approach like with periodic commit iterate etc for batch processing it may helps.

  2. I need to find similarity between nodes with different labels where I can also set up common label but I want to avoid any extra labels on my nodes.

Best,
Alex


(Mark Needham) #10

At the moment the only way to do this would be to use the function rather than the procedure, but then that's all working single threaded unless you parallelise with APOC, and even then I guess it wouldn't be as fast as if we can do it in a procedure.

Could you show me an example of how you'd like it to work and then maybe we can update the procedure to do it. Right now it assumes you want to compare all values in the list against each other - how would you specify the list of nodes that you want to compare with everybody else? I guess we could have the procedure take in a collection of ids, kind of like what we do for Personalized PageRank with the sourceNodes? https://neo4j.com/docs/graph-algorithms/current/algorithms/page-rank/#_personalized_pagerank

Can you explain this a bit more? It should be possible to do this - the procedure doesn't care what labels you have, it only cares about lists of numbers...


(Alexpappasc) #11

Thanks you Mark, you example points out what I am looking for the first case. It whould be great to have this addition. A collection of ids like personalized pagerank it will make it very easy.

For the second case:
we have users with relationships to tags and posts with relationships to tags.
I need to compute similarity between users and posts with common tags (When referring to users and posts are the labels of the nodes) but not to cumpute similarity between user to user or post to post.

If similarity measures can have as a parameter two collection of ids or two match query results (any approach that can distinguish two sets of nodes) where similarity computed only between of those collections then it will use only :
For each a of collection1
For each b of collection2

Like a bipirtite graph as a result. But I know it make it more complex etc. I can compute the results with the query written above or calculting the jaccard index with cypher only between nodes with different labels.

Second case covers also first one where we have collection of ids vs all (which is again a collection of ids) .


(Mark Needham) #12

Hey,

Yeh that makes sense. Sorry for the delayed reply - I've had this tab open and have been thinking of the best way to do this. If we did something like this I think it'd work?

MATCH (user:User)-[:HAS_TAG]->(tag)
WITH {item:id(user), categories: collect(id(tag))} as data
WITH collect(data) AS userTags

MATCH (post:Post)-[:HAS_TAG]->(tag)
WITH userTags, {item:id(post), categories: collect(id(tag))} as data
WITH userTags, collect(data) AS postTags

WITH userTags, postTags, 
     [userTag in userTags | userTag.item] AS sourceIds,
     [postTag in postTags | postTag.item] AS targetIds

CALL algo.similarity.jaccard.stream(userTags + postTags, {topK: 1, similarityCutoff: 0.0, sourceIds: sourceIds, targetIds: targetIds})
YIELD item1, item2, count1, count2, intersection, similarity
RETURN algo.getNodeById(item1).name AS from, algo.getNodeById(item2).name AS to, similarity
ORDER BY from

And then as you suggested - if you don't specify sourceIds or targetIds it'll assume you want to use all ids.

For the weight based similarity procedures we have support for Cypher statements. e.g. for Cosine similarity - https://neo4j.com/docs/graph-algorithms/current/algorithms/similarity-cosine/#algorithms-similarity-cosine-cypher-projection

WITH "MATCH (person:Person)-[likes:LIKES]->(c)
      RETURN id(person) AS item, id(c) AS category, likes.score AS weight" AS query
CALL algo.similarity.cosine(query, {
  graph: 'cypher', topK: 1, similarityCutoff: 0.1, write:true
})
YIELD nodes, similarityPairs, write, writeRelationshipType, writeProperty, min, max, mean, stdDev, p95
RETURN nodes, similarityPairs, write, writeRelationshipType, writeProperty, min, max, mean, p95

In terms of keeping the functions reasonably uniform I guess we could make a version that takes in a single query + target and source ids like this:

WITH "MATCH (user:User)-[:HAS_TAG]->(tag)
      RETURN id(user) AS item, id(c) AS category
      UNION
      MATCH (post:Post)-[:HAS_TAG]->(tag)
      RETURN id(post) AS item, id(c) AS category" AS query
MATCH (p:Post)
WITH query, collect(id(p)) as targetIds
MATCH (u:User) 
WITH query, targetids, collect(id(u)) AS sourceIds
CALL algo.similarity.cosine(query, {
  graph: 'cypher', topK: 1, similarityCutoff: 0.1, write:true, sourceIds; sourceIds, targetIds: targetIds
})
YIELD nodes, similarityPairs, write, writeRelationshipType, writeProperty, min, max, mean, stdDev, p95
RETURN nodes, similarityPairs, write, writeRelationshipType, writeProperty, min, max, mean, p95

That doesn't actually look very nice though to be honest! Perhaps instead we could have the sourceIds and targetIds treated as a Cypher query too if you have graph: "cypher"?

WITH "MATCH (user:User)-[:HAS_TAG]->(tag)
      RETURN id(user) AS item, id(c) AS category
      UNION
      MATCH (post:Post)-[:HAS_TAG]->(tag)
      RETURN id(post) AS item, id(c) AS category" AS query,
      "MATCH (u:User) RETURN id(u) AS item" as sourceQuery,
      "MATCH (p:Post) RETURN id(p) AS item" as targetQuery
CALL algo.similarity.cosine(query, {
  graph: 'cypher', topK: 1, similarityCutoff: 0.1, write:true, sourceIds; sourceQuery, targetIds: targetQuery
})
YIELD nodes, similarityPairs, write, writeRelationshipType, writeProperty, min, max, mean, stdDev, p95
RETURN nodes, similarityPairs, write, writeRelationshipType, writeProperty, min, max, mean, p95

I guess to start we can just do the no Cypher query version though...


(Alexpappasc) #14

Hi Mark,

The no Cypher way looks great and easy to understand.
Thank you for all this effort...Keep up with the great support!

Best,
Alex


(Mark Needham) #15

Hi Alex,

I've been working on this today and I've made some progress. The code around this area splits based on the concurrency value provided and whether you want to compute topK or not. So there are 4 code paths:

concurrency: 1, topK: null
concurrency: 1, topK: provided
concurrency: > 1, topK: null
concurrency: >1, topK: provided

I've got it working for that top option at the moment, and I'm going to work on the other 3 now. Those 3 all have one bit of code in common, so in theory once I've got that function updated it should just work (TM).

I gave it a try on a recipes dataset that I've been playing with. You can load that data by running:

:play recipes

in the Neo4j browser, if you want to.

This is what the query looks like:

MATCH (recipe:Recipe)-[:COLLECTION]->(collection)
WITH {item:id(collection), categories: collect(id(recipe))} as data
WITH collect(data) AS collectionRecipes

MATCH (recipe:Recipe)-[:CONTAINS_INGREDIENT]->(ingredient)
WITH collectionRecipes, {item:id(ingredient), categories: collect(id(recipe))} as data
WITH collectionRecipes, collect(data) AS ingredientRecipes

WITH collectionRecipes, ingredientRecipes, 
     [value in collectionRecipes | value.item] AS sourceIds,
     [value in ingredientRecipes | value.item] AS targetIds

CALL algo.similarity.jaccard.stream(collectionRecipes + ingredientRecipes, {similarityCutoff: 0.0, concurrency:1, sourceIds: sourceIds, targetIds: targetIds})
YIELD item1, item2, count1, count2, intersection, similarity
WITH algo.getNodeById(item1) AS from, algo.getNodeById(item2) AS to, similarity
RETURN labels(from) AS fromLabels, from.name AS from, labels(to) AS toLabels,  to.name AS to, similarity
ORDER BY similarity DESC
LIMIT 100

So we're computing similarities from collections -> ingredients based on their common recipes, but we shouldn't ever see collection -> collection or recipe -> recipe.

I can send you a jar with the code in as it is now or you can play with it once I've got the other options working too. I'll probably not merge it into the library until next week as I want @michael.hunger to review it and he's on vacation this week.

Cheers, Mark


(Mark Needham) #16

I'm working on it over here in case you wanted to follow - https://github.com/neo4j-contrib/neo4j-graph-algorithms/pull/820

I have a version which seems to work for all the variants that I described above. I've only implemented it for Jaccard at the moment, but will make it work for all the similarity algorithms.

Here's a fun version that computes the most similar ingredients to the Gooseberry collection:

MATCH (recipe:Recipe)-[:COLLECTION]->(collection)
WITH {item:id(collection), name: collection.name, categories: collect(id(recipe))} as data
WITH collect(data) AS collectionRecipes

MATCH (recipe:Recipe)-[:CONTAINS_INGREDIENT]->(ingredient)
WITH collectionRecipes, {item:id(ingredient), categories: collect(id(recipe))} as data
WITH collectionRecipes, collect(data) AS ingredientRecipes

WITH collectionRecipes, ingredientRecipes, 
     [value in collectionRecipes WHERE value.name = "Gooseberry" | value.item] AS sourceIds,
     [value in ingredientRecipes | value.item] AS targetIds

CALL algo.similarity.jaccard.stream(collectionRecipes + ingredientRecipes, {similarityCutoff: 0.0, topK:10, sourceIds: sourceIds, targetIds: targetIds})
YIELD item1, item2, count1, count2, intersection, similarity
WITH algo.getNodeById(item1) AS from, algo.getNodeById(item2) AS to, similarity
RETURN labels(from) AS fromLabels, from.name AS from, labels(to) AS toLabels,  to.name AS to, similarity
ORDER BY similarity DESC
LIMIT 100

It'd be cool if you have a chance to test it out to make sure it does what you want. You can grab the jar from - https://github.com/neo4j-contrib/neo4j-graph-algorithms/files/2879346/graph-algorithms-algo-3.4.12.3.zip

I had to zip it so that GitHub was happy, but just unzip that and replace your existing jar in the plugins folder.


(Alexpappasc) #17

Hi Mark that was really fast! Thank you once again.
Just tested and works as expected with really good performance.
Query profile :
Old implementation : 190405 total db hits
Plain cypher: 190632 total db hits
New implementation: 7079 total db hits

Best,
Alex