Finding clusters in Ethereum transfers using Neo4j 3.5.16 and Cypher

I’m performing an analysis of token distributions in Ethereum using Neo4j and Cypher. In one of my use cases I need to find clusters. For example, if a token transfer for a particular token was made to 5 different accounts and then these 5 accounts resent all of these tokens to one particular account I need to find those 6 accounts.

The token address needs to be the same in both hops so I start my query by getting all of the token addresses for a particular transfer type. There are four transfer types, and for each transfer type there is a different number of distinct token addresses. This table shows how many transfers and token addresses I have per transfer type:
image
These are my specs and configuration settings: Neo4j Version is 3.5.16 Community, Database size is 77GB, 8 CPU Cores. 128 GB RAM (only 95 available for Neo).

dbms.memory.heap.initial_size=32g
dbms.memory.heap.max_size=32g
dbms.memory.pagecache.size=55g

And this is my query for the Airdrops transfer type:

MATCH (at:Transfer {eventName:'Airdrop'})
WITH collect(DISTINCT at.address) as tokenAddresses
UNWIND tokenAddresses as tokenAddress
MATCH (t1:Transfer {eventName:'Airdrop',address:tokenAddress})-[:sent_to]->(a2:active)<-[:sent_from]-(t2:Transfer 
{address:tokenAddress})-[:sent_to]->(a3:account)
USING INDEX t1:Transfer(eventName,address)
USING INDEX t2:Transfer(address)
WITH a3.accountId as a3Account, collect(DISTINCT a2.accountId) AS a2Accounts, count(DISTINCT a2) AS 
a2count,collect(DISTINCT t2.address) as tokenAddress
WHERE a2count>=2 AND a2count <=1000
RETURN a2Accounts,a3Account,tokenAddress,a2count+1 AS clusterCount
ORDER BY clusterCount DESC;

So when the transfer type is Airdrop or Distr. my query runs and finishes in 4 seconds and 10 minutes respectively. When the transfer type is Mint it just never ends or just hangs after a while. I haven’t even tried this with Transfer type=Transfer as I know it won’t run. As you can see both accounts and transfers are Nodes, this way I can use indexes on the transfers. Any thoughts or ideas would be greatly and very much appreciated. Thank you.

Hello @andrespra and welcome to the Neo4j community :slight_smile:

Just to know, did you use UNIQUE CONSTRAINTS on all your nodes?

Regards,
Cobra

Thank you for your quick reply. I have a Uniqueness constraint on the account nodes for the "accountId" property. Transfers Nodes also have a unique id on the property "TransferId'. That being said, I don't use these indexes on my query as I don't query accounts or transfers by their id. I query them by their "eventName + tokenAddress" properties or just by their "tokenAddress" property, and in nboth cases I'm using indexes as it can be seen on my query after the second MATCH statement:

USING INDEX t1:Transfer(eventName,address)
USING INDEX t2:Transfer(address)

I see, no problem :slight_smile:

Try to add UNIQUE CONSTRAINTS also on others Label node like Transfer, it will speed-up your query even if you don't use them, Neo4j will use them :slight_smile:

Regards,
Cobra

Ok, actually :Transfer(transferId) was a regular index, not a uniqueness constraint, unlike :account(accountId). I dropped the index, created the index again as a uniqueness constraint and launched my queries again but my issue remains. The problem starts as the amount of nodes I have to check increases. Any thoughts on my memory configuration? I ran Memrec and with my available RAM I tried to match the given values as much as I could. Do you think I need more resources for this? The thing is that the graph is only 77gb in size so I think it should work ok.

Ok good:)

dbms.memory.heap.initial_size=32g
dbms.memory.heap.max_size=32g

You can let heap.initial_size at the minimum (I think it's 500Mo or something) and maybe increase heap.max_size if you can:)

And I never touched dbms.memory.pagecache.size so I can't advice you on this one :confused:

Regards,
Cobra