How best to do parallel processing

christian · January 6, 2020, 3:25am

Danke Stefan und frohes Neues!

Thanks to your example above as well as @simon's post here Using parallel queries to sum value of bitcoin outputs connected to bitcoin address nodes which I found a little later I got the following apoc.cypher.mapParallel2() Cypher to work ...

MATCH (u:EffortUser) 
WITH collect(u) AS users 
CALL apoc.cypher.mapParallel2("
	MATCH (_)-[r]-(:EffortObject) 
    WHERE r.Effort = 'yes' and r.TimeEvent >= '2016-01-01' and r.TimeEvent <= '2018-12-31' 
    RETURN _.Name as user, date(datetime(r.TimeEvent)).year as date, count(distinct r.IdUnique) as count",
    {}, users, 4) YIELD value 
RETURN value.user as user, value.date as date, sum(value.count) as count 
ORDER BY user, date

Regarding r.Effort = 'yes', I checked and unfortunately we do need this additional filter as not all relationships between EffortUser and EffortObject are true 'effort'. Either way, the query with r.Effort = 'yes' takes 16,622 ms and without 16,429 ms so not a big enough difference to worry about this for now.

Regarding your suggestion of ...

Embedding time is a good idea if you always to queries on fixed time ranges. If you always query for the same year, you could have :EFFORT_2017 instead of :EFFORT .

... I decided against implementing this as the min max TimeEvent value inputs are very fluid so not sure this would do us much good, besides the above query only took 16,622 ms to complete which is awesome! I'm making a note of this though to revisit it later, there might be something there worth exploring when we need those extra few ms

Anyway, the basic query is working now, which is great, however when I start customizing it the results quickly become erratic and I'm not sure why - most likely operator error but I can't figure it out. Here are the main two issues I'm grappling with:

Issue 1: Query fails when making minor Cypher changes

I'm only removing the and r.TimeEvent >= '2016-01-01' and r.TimeEvent <= '2018-12-31', everything else stays the same (see below) but now the query fails and for some reason returns no results.

MATCH (u:EffortUser) 
WITH collect(u) AS users 
CALL apoc.cypher.mapParallel2("
	MATCH (_)-[r]-(:EffortObject) 
    WHERE r.Effort = 'yes' 
    RETURN _.Name as user, date(datetime(r.TimeEvent)).year as date, count(distinct r.IdUnique) as count",
    {}, users, 4) YIELD value 
RETURN value.user as user, value.date as date, sum(value.count) as count 
ORDER BY user, date

Here are the PROFILE results ...

Right around the same time my CPU started going to 100% even though the previous queries that returned results never really exceeded 25% (see below).

To me this looks like I'm somehow overloading the process by removing the date range filter? Can that be? Are there any additional settings I should make to avoid this? I noticed your sample query has {} but I saw @simon using some additional parameters here such as {parallel:true, batchSize:1000, concurrency:4}.

Issue 2: Additional filter settings are not working but should

This might be related to issue 1 above but I wasn't sure so thought I'd outline it anyway - as this is the real challenge to productionize this query for our GraphQL API setup.

Our graph really looks like this ...

(:Employee)-[]-(:EffortUser)-[]-(:EffortObject)

... so one employee can have multiple users across multiple platforms. Looks a bit like this from left to right, i.e. Employee LINKED_TO one EffortUser (in reality there are multiple here) and multiple EffortObject nodes (emails in this case).

When trying to incorporate the above structure into the Cypher query I have the same issue as in suddenly the query returns no results even though it should as the link between Employee 'cbartens' and EffortUser clearly exists (see graph example above) ...

MATCH (e:Employee)-[]-(u:EffortUser) 
WHERE e.Name = 'cbartens' 
WITH collect(u) AS users 
CALL apoc.cypher.mapParallel2("
	MATCH (_)-[r]-(:EffortObject) 
    WHERE r.Effort = 'yes' and r.TimeEvent >= '2016-01-01' and r.TimeEvent <= '2018-12-31' 
    RETURN _.Name as user, date(datetime(r.TimeEvent)).year as date, count(distinct r.IdUnique) as count",
    {}, users, 4) YIELD value 
RETURN value.user as user, value.date as date, sum(value.count) as count 
ORDER BY user, date

Now maybe that is related to issue 1 above as in the additional filter makes the query more complex and overloads it but I just don't know. Any ideas?

I tried different Cypher variations (example below) but the outcome is always the same, query returns no results even though it should, i.e. the nodes and relationships definitely exist.

MATCH (e:Employee) 
WHERE e.Name = 'cbartens' 
WITH collect(e) AS employees 
CALL apoc.cypher.mapParallel2("
	MATCH (_)-[]-(:EffortUser)-[r]-(:EffortObject) 
    WHERE r.Effort = 'yes' and r.TimeEvent >= '2016-01-01' and r.TimeEvent <= '2018-12-31' 
    RETURN _.Name as employee, date(datetime(r.TimeEvent)).year as date, count(distinct r.IdUnique) as count",
    {}, employees, 4) YIELD value 
RETURN value.employees as employees, value.date as date, sum(value.count) as count 
ORDER BY employees, date

Topic		Replies	Views
Problem with apoc.cypher.parallel2 Procedures & APOC	0	422	October 5, 2020
Parallel Cypher & Apoc Cypher apoc , cypher	8	3938	June 19, 2019
Optimize Neo4j cypher query on huge dataset Cypher optimization , performance , cypher , neo4j	3	378	December 20, 2021
APOC parallelise a simple MATCH(n) RETURN n Procedures & APOC apoc , cypher	5	526	November 6, 2020
APOC vs CYPHER Newbie Questions	2	437	April 29, 2020

Get Certified in June!

How best to do parallel processing

Related topics