This a cypher query I have written on neo4j's "movies" dataset, even though the dataset just a simple one it serves for the use case, please read the query as stated below:
CALL {
match (m:Movie{title:$movie_name})
match (actors:Person)-[:ACTED_IN]->(m)
match (p:Person) WHERE p.name IN $person_names
CALL apoc.path.spanningTree(actors,{
labelFilter : "+Person",
relationshipFilter : "<FOLLOWS",
maxLevel : 4,
termonatorNodes : p
})yiled path
return [node in nodes(path) | node.name] as personNames
}
CALL {
with personNames
with reverse(personNames) as reversedPathOfPersonName
return reversedPathOfPersonName[1] as firstPersonContact
}
CALL {
with personNames
with reverse(personNames) as reversedPathOfPersonName
return reversedPathOfPersonName[2] as SecondPersonContact
}
retrun firstPersonContact,SecondPersonContact
Just to give you some context I am using the apoc's inbuilt function "apoc.path.spanningTree" to find the connection between a "person" or an array of "persons" and the movie actors who have acted in a specific movie upto hop 4, meaning the input "person" can be connected anywhere to the movie actor with in 4 hops.
After finding such paths we again reverse those paths(not necessary but just for my convenience) and find their's first contact and second contact and so on. Just think of it as finding the link of contacts in a LinkedIn contacts list(chain of contacts).
There are couple of issues that I want solution for:
- Optimization of spanning tree function - apoc's spanning tree function works well for my use case but when the number of nodes and edges increases in my graph, this takes an considerable amount of time to fetch results as we are fetching connections for each records serially.
Just for sake of example if we have 300000 names that we have to check the contacts for, if execution time for each name takes on average 4-5 seconds, 300000 times 4 is easily 14 days.
So any suggestion's on how to reduce the time for such computations and how we can optimize the query or even the working of apoc functions are very welcome.
- Working of sub queries in cypher - The "CALL" sub query in cypher does not always work as intended as per my observation, just an example, we return "personNames" from first sub query and "SecondPersonContact" and "firstPersonContact" from the other two.
Lets say if one of the sub query returns "NULL" or there is no data present in the database as per our filters, all the other parameters becomes null or empty on the final return statement i.e if "SecondPersonContact" is null(has no data), "firstPersonContact" will also be empty even there is some data for "firstPersonContact".
The above example might not be the best suitable one has there is no way for "firstPersonContact" to have data when "SecondPersonContact" is empty but I am just trying to give a general example.
we can also use the following as an example:
CALL{
MATCH (node1:LABEL_1)
where node1.name = 'something'
return node1
}
CALL{
MATCH (node2:LABEL_2)
where node2.name = 'anything'
return node2
}
return node1,node2
if any of node1 or node2 becomes empty(i.e no data for the given filter) the final return statement "return node1,node2" will return nothing. Any suggestion on how to solve this will be a big help.
NOTE: The provided queries are just prototype and I can't provide any query planner's data like "PROFILE" and "EXPLAIN", but I am confident that I have stated my problem statement correctly and any change in the query are welcome to solve the provided problem statement.
Also the necessary nodes and their properties are already 'INDEXED' for so please avoid this solution.