We have imported the bitcoin blockchain into neo4j. I am trying to find the balance of every single bitcoin address on the network, however there are around 470 million addresses.
A transaction sends bitcoin to addresses, which may then send that bitcoin on in a new transaction.
From the left-most red (:Transaction)
node, we see two outputs created, each belonging to a seperate (:Address)
node in blue. One of these outputs is then spent in an onwards transaction, whilst the other is still locked to the address.
To find the balance of a given address we can use the following query:
MATCH
(a :Address)<--(o :Output)
WHERE a.address = "someBitcoinAddress"
AND NOT (o)-[:UNLOCKED_BY]->()
RETURN
a.address as Address,
sum(o.bitcoinValue) as balance
This works great for one address, but with 470 million addresses and around 1.5 billion outputs we start to see problems with DB responsiveness. Is there a good way to split a massive list into batches, operate on the batches and preferably stream results to csv in those batches as well? This is as far as I got, but I'm not sure this is doing what I expect as the DB still grinds to a halt:
// match all address nodes
MATCH
(a :Address)
WITH
collect(DISTINCT a) AS addresses
// try and batch the large list into chunks
CALL
apoc.coll.partition(addresses,1000000) YIELD value AS addressBatches
// work on each chunk
UNWIND
addressBatches AS batch
// apoc.mapParallel is expecting a list so collect the batch into a list
WITH
collect(batch) as addresses
// run multi-threaded cypher statement on this batch
CALL
apoc.cypher.mapParallel2(
"
MATCH (_)<--(o :Output)
WHERE NOT (o)-[:UNLOCKED_BY]->()
RETURN
_.address AS Address,
round(sum(o.bitcoinValue)*100000000)/100000000 AS Balance
",
{parallel:True},
addresses,
24,
7200
) YIELD value
RETURN
value.Address,
value.Balance
Does anyone have any suggestions of how I could do this more efficiently?
Many thanks,
Simon