Building daily new graph


(Oded) #1

We have implemented the following: We want to create a fresh graph every day. But we want to keep the previous version as well. What we are doing is we split to CSV files the whole data and we do LOAD CSV. After the individual CSV was loaded - we deleted the 3rd version (previous’es previous) and we keep just the current + previous. The problem is that we get many Transaction Locking. This is not scaleable. What is the best way to: 1. Build a new graph quickly. 2. Maintain the previous graph as well. 3. Delete the oldest so we can maintain only 2 versions at a time. Thank you!


(Michael Hunger) #2

You have some options here:

  1. retry after the locking exception, if you use a driver it should be able to retry automatically if you use tx-callbacks
  2. mark the active graphs with a label, then you can just remove that label from the "previous-previous" one and delete it without issues after the new graph is imported

Can you describe more of your graph structure, the CSV and the load scripts? Also how they are executed.


(Oded) #3
  1. I am creating around 30k-40k jobs for each with CSV LOAD a different csv file. I am retrying but looks as if it is causing a lot of rejection and bottlenecks

  2. Are labels more efficient than putting a property on the relationship?


(Michael Hunger) #4

Yes labels are definitely more efficient.

How large are your CSV files each / in total?
Perhaps it's easier to just run this from a queue. If you import all of them concurrently you'd also get into deadlock issues even without removing the old graph data.


(M. David Allen) #5

Are your graphs from yesterday and today interconnected, or disjoint? For example, a node from 2 days ago, is it probably in today's graph as well?

Also it would be helpful to indicate when you're running into the locking problems, I guess it's when you're deleting the old stuff? Some sample cypher here could help.


(Oded) #6

So what we are doing is this.

  1. The nodes on the graphs are consistent and never deleted.

  2. Each day we calculate the relationships from scratch. The output will be divided into node's id separate CSV.

  3. Those CSVs will be fed into the LOAD CSV - each load will do:

USING PERIODIC COMMIT

LOAD CSV FROM "#{s3_object}" AS csv WITH csv

MATCH (a:AffinityChannel {program_id: toInteger(csv[3]) })

MATCH (b:AffinityChannel {program_id: toInteger(csv[0]) })

MERGE (a)-[r:AFFINITY { network: csv[2], version: toInteger(csv[4]) }]->(b)

ON MATCH SET r.weight = toInteger(csv[1])

ON CREATE SET r.weight = toInteger(csv[1]);

All of those CSV loads are held into queue of 10 parallel workers. We are getting the deadlocks every time, sporadically.


Mailtrack

Sender notified by

                [Mailtrack](https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality6&)
                09/02/18, 8:58:47 PM

(Oded) #7

And of course:
4. We delete the previous'es previous relationship afterwards by using a apoc.periodic.commit query


Mailtrack

Sender notified by

                [Mailtrack](https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality6&)
                09/02/18, 9:00:35 PM

(Michael Hunger) #8

If you have new data anyway you can also use CREATE instead of MERGE to be faster, or are there duplicate affinities across the files (on network/version keys) ?

Another option to handle the retries (without needing to reload the whole CSV for every issue).

Load the CSV in your client code, and send batches (e.g. 10k) to a statement like this:

UNWIND $rows as csv
MATCH (a:AffinityChannel {program_id: toInteger(csv[3]) })
MATCH (b:AffinityChannel {program_id: toInteger(csv[0]) })
CREATE (a)-[r:AFFINITY { network: csv[2], version: toInteger(csv[4]) }]->(b)
SET r.weight = toInteger(csv[1])

And if you run into a deadlock issue, just split the 10k batch into smaller ones (e.g. 1k etc). Until the tx succeeds.
This should also be working automatically if you use transaction callbacks in your driver.

apoc.periodic.iterate has also a retry option, but I have to improve its behavior (i.e. splitting batches and more exponential backoff)

call apoc.periodic.iterate('
    LOAD CSV FROM $url AS csv RETURN csv
','
    MATCH (a:AffinityChannel {program_id: toInteger(csv[3]) })
    MATCH (b:AffinityChannel {program_id: toInteger(csv[0]) })
    CREATE (a)-[r:AFFINITY { network: csv[2], version: toInteger(csv[4]) }]->(b)
    SET r.weight = toInteger(csv[1])
', {batchSize:10000, iterateList:true, retries:3, params:{url:$s3_object})

(Oded) #9

Thanks Michael.

Yes, the MERGE is because there could be another rel between the two, in the opposite direction.

Maybe I am tackling this the wrong way. Because initially I've loaded 1 single CSV file which is a few GB large, and it blew the machine memory.

Could I create 2 DBs on the same machine? so each version will go into a fresh one?

Otherwise I dont see this scale.


(Michael Hunger) #10

Yes you can run two servers on the same machine.
Even a multi-GB file shouldn't blow memory, do you still have the statement around that you used for that one?


(Oded) #11

The statement is the same. The data is split into files, instead of one big file


(Michael Hunger) #12

Then it should not blow up. Can you share the query plan for the statement?