Using parallel queries to sum value of bitcoin outputs connected to bitcoin address nodes

simon · March 15, 2019, 1:23pm

We have imported the bitcoin blockchain into a neo4j graph database. The DB schema looks like this:

For each address, for a given entity (company) I would like to calculate the current bitcoin balance of an address.

To do this we must sum up the bitcoinValue property of all :Output nodes that belong to each :Address node via the [:LOCKED_BY] that do not have an [:UNLOCKED_BY].

My current workflow to do this is like so:

// set the query parameter first.
:params "entity" : "Binance"

Then run the query:

// match addresses labelled with the entity of interest
match (a :Address)-->(e :Entity)
where e.name = $entity
// distinct must be included or the query will run for a very long time
with distinct a

// match outputs locked by address
match (a)<-[:LOCKED_BY]-(o: Output)
// exclude those outputs that have been subsequently spent
where not (o)-[:UNLOCKED_BY]->()

return
  a.address as address,
  round(sum(o.bitcoinValue)*100000000)/100000000 as balance

This works fine, but for a large number of addresses it can take some time. Is there a way to parallelise this using apoc.cypher.parallel() or some other apoc query? There wasn't documentation for these functions that I could find.

Many thanks.

simon · March 18, 2019, 5:06pm

I found the answer was to use apoc.cypher.mapParallel2(). The documentation was a little hard to understand because there were no examples, but this reduced the run-time of the above query from about 12 seconds for 300k addresses to around 2 seconds. We have longer queries where this technique is proving useful, but this is what I did:

# match addresses labelled with the entity of interest
match (a :Address)-->(e :Entity)
where e.name = $entity

# mapParallel will iterate over a list, so we `collect`
with collect(distinct a) as addresses

# The first argument is the cypher code we want to run as a string
# The second argument is a map of parameters, e.g. {parallel: true}
# The third argument is the list to iterate over
# (in this example `addresses` from above)
# The fourth argument is an integer to split the list into partitions,
# (though I'm not sure how this relates to the batchSize or concurrency
# parameters from the third argument)
CALL apoc.cypher.mapParallel2("
   optional match (_)<-[:LOCKED_BY]-(o: Output)
   where not (o)-[:UNLOCKED_BY]->()
   return _.address as address,
   round(sum(o.bitcoinValue)*100000000)/100000000 as balance"
, {parallel:True, batchSize:1000, concurrency:20}, addresses,  20) yield value

# Extract the columns we want to return from the list (returned by `yield value`)
return
  value.address as address,
  value.balance as balance

Hope this is helpful to someone and open to suggestions if this is not the correct way to use parallelism.

Topic		Replies	Views
Operating in batches on a massive list Cypher apoc	3	4610	May 27, 2020
Aggregation of node triplets into tuples in parallel Cypher apoc , performance , apocperiodiciterate	2	1492	October 3, 2019
Fast Aggregation operation Procedures & APOC cypher , counts , neo4j	7	350	October 25, 2023
How best to do parallel processing Procedures & APOC	15	7490	February 24, 2020
Optimize getting all incoming and outgoing edges for each node and computing the difference in amounts Neo4j Graph Platform migrated	2	137	June 21, 2022

Get Certified in June!

Using parallel queries to sum value of bitcoin outputs connected to bitcoin address nodes

Related topics