What is the best AWS EC2 instance for a Neo4j database?

I am trying to determine the best AWS EC2 instance type for running a Neo4j database. Here are some details about my setup:

Neo4j: Community Edition (version: 5.19.0)

Deployment: AWS Marketplace Neo4j Template

Database Schema:

Dao {
    organizationId: String!
    daoId: String!
    name: String!
}

Proposal {
    proposalNativeId: String!
    proposer: String!
    (Proposal)-[:IN]->(Dao)
}

Vote {
    voteNativeId: String!
    voter: String!
    choice: String!
    (Vote)-[:BELONGS_TO]->(Proposal)
}

Indexes:

  • Dao - daoId column
  • Proposal - proposalNativeId column
  • Vote - voteNativeId column

Data Volume:

  • Dao - 1
  • Proposal - 62 (potentially more)
  • Vote - 483 (potentially more)

Query being executed (additionally, I have queries for finding coalitions with larger groups of people):

// Step 1: Match votes and collect votes per voter
    MATCH (d:Dao { daoId: '9ad9fb81-ad04-4830-85c2-212034072580' })<-[:IN]-(v:Vote)-[:BELONGS_TO]->(p:Proposal)
    WITH d, p, v.voter AS voter, v.choice AS choice
    ORDER BY voter, p.proposalNativeId
    
    // Step 2: Collect votes per voter
    WITH d, voter, COLLECT({proposal: p.proposalNativeId, choice: choice, proposal_choice: p.proposalNativeId + ":" + choice}) AS votes
    WITH d, COLLECT({voter: voter, votes: votes}) AS voterPatterns
  
    
    // Step 3: Compare voting patterns for groups of 8 voters
    UNWIND voterPatterns AS v1
    UNWIND voterPatterns AS v2
  UNWIND voterPatterns AS v3
  UNWIND voterPatterns AS v4
  UNWIND voterPatterns AS v5
  UNWIND voterPatterns AS v6
  UNWIND voterPatterns AS v7
  UNWIND voterPatterns AS v8
    // Step 4: Ensure unique groups of voters
    WITH v1,v2, v3, v4, v5, v6, v7, v8
    WHERE v1.voter < v2.voter AND v2.voter < v3.voter AND v3.voter < v4.voter AND v4.voter < v5.voter AND v5.voter < v6.voter AND v6.voter < v7.voter AND v7.voter < v8.voter  
    
    // Step 5: Find common proposals where choices match
    WITH v1,v2, v3, v4, v5, v6, v7, v8, apoc.coll.intersection([a IN v1.votes | a.proposal_choice], [b IN v2.votes | b.proposal_choice]) AS commonProposals
    WHERE SIZE(commonProposals) > 0
  
    WITH v1,v2, v3, v4, v5, v6, v7, v8, commonProposals, apoc.coll.intersection(commonProposals, [c IN v2.votes | c.proposal_choice]) AS commonProposals2
    WHERE SIZE(commonProposals2) > 0
    
    WITH v1,v2, v3, v4, v5, v6, v7, v8, commonProposals2 AS commonProposals, apoc.coll.intersection(commonProposals, [d IN v2.votes | d.proposal_choice]) AS commonProposals3
    WHERE SIZE(commonProposals3) > 0
    
    WITH v1,v2, v3, v4, v5, v6, v7, v8, commonProposals3 AS commonProposals, apoc.coll.intersection(commonProposals, [e IN v2.votes | e.proposal_choice]) AS commonProposals4
    WHERE SIZE(commonProposals4) > 0
    
    WITH v1,v2, v3, v4, v5, v6, v7, v8, commonProposals4 AS commonProposals, apoc.coll.intersection(commonProposals, [f IN v2.votes | f.proposal_choice]) AS commonProposals5
    WHERE SIZE(commonProposals5) > 0
    
    WITH v1,v2, v3, v4, v5, v6, v7, v8, commonProposals5 AS commonProposals, apoc.coll.intersection(commonProposals, [g IN v2.votes | g.proposal_choice]) AS commonProposals6
    WHERE SIZE(commonProposals6) > 0
    
    WITH v1,v2, v3, v4, v5, v6, v7, v8, commonProposals6 AS commonProposals, apoc.coll.intersection(commonProposals, [h IN v2.votes | h.proposal_choice]) AS commonProposals7
    WHERE SIZE(commonProposals7) > 0
    
    WITH v1,v2, v3, v4, v5, v6, v7, v8, commonProposals7 AS commonProposals, apoc.coll.intersection(commonProposals, [i IN v2.votes | i.proposal_choice]) AS commonProposals8
    WHERE SIZE(commonProposals8) > 0
    
    // Step 6: Filter matching choices
    WITH v1,v2, v3, v4, v5, v6, v7, v8, commonProposals8 AS commonProposals
    
    RETURN [v1.voter,v2.voter, v3.voter, v4.voter, v5.voter, v6.voter, v7.voter, v8.voter] AS members, SIZE(commonProposals) AS votedTogether, commonProposals
    ORDER BY SIZE(commonProposals) DESC
    LIMIT 20

Currently, I am trying to run this query on an AWS EC2 r6i.2xlarge instance, and the memory usage is at 91-96%, but the query cannot be completed for several days.

What EC2 instance would be ideal for my use case, and should I focus on getting more RAM or more CPU cores, etc.?

I believe your issue is in the memory requirements you have created with the 8-dimensional cross product. This will generate (# of votes for selected daoId)^8 combinations before the filter to remove rows that have duplicate voters. This could be a huge number if the number of vote nodes returned is significant.

I tried refactoring the order of operations to eliminate the full 8-dimensional cross-product by performing it pair-wise, so the filtering can be applied incrementally.

I also used an apoc collections method to count the number of common votes between each tuple of 8 voters. This method may also be faster than the approach used in your query.

Give it a try to see if it is faster. I also don't have test data to verify it gives the same results, so be aware.

MATCH (d:Dao { daoId: '9ad9fb81-ad04-4830-85c2-212034072580' })<-[:IN]-(v:Vote)-[:BELONGS_TO]->(p:Proposal)
WITH v.voter AS voter, COLLECT(p.proposalNativeId + ":" + v.choice) AS votes
WITH COLLECT({voter: voter, votes: votes}) AS voterPatterns

UNWIND voterPatterns AS v1
UNWIND voterPatterns AS v2
WITH *
WHERE v1.voter < v2.voter

UNWIND voterPatterns AS v3
WITH *
WHERE v2.voter < v3.voter

UNWIND voterPatterns AS v4
WITH *
WHERE v3.voter < v4.voter

UNWIND voterPatterns AS v5
WITH *
WHERE v4.voter < v5.voter

UNWIND voterPatterns AS v6
WITH *
WHERE v5.voter < v6.voter

UNWIND voterPatterns AS v7
WITH *
WHERE v6.voter < v7.voter

UNWIND voterPatterns AS v8
WITH *
WHERE v7.voter < v8.voter

WITH *, v1.votes + v2.votes + v3.votes + v4.votes + v5.voters + v6.votes + v7.votes + v8.votes as combinedVotes
WITH *, [i in combinedVotes where i.count = 8 | i.item] as commonProposals
    
RETURN 
    [v1.voter, v2.voter, v3.voter, v4.voter, v5.voter, v6.voter, v7.voter, v8.voter] AS members,
     SIZE(commonProposals) AS votedTogether, 
     commonProposals
ORDER BY votedTogether DESC
LIMIT 20

Note: the proposal assumes the 'votes' list contains distinct values per voter, thus we can look for a frequency of 8 in the combined list for each table of 8 voters. If this assumptions is not correct, you can add 'distinct' in the collect to enforce that.

WITH v.voter AS voter, COLLECT(distinct p.proposalNativeId + ":" + v.choice) AS votes