Working of aggreagations with periodic iterate

Hey community ,

I am working with a cypher query as follows :

CALL apoc.load.csv("file:///temp/cms_data/cms_data.csv", {header:true, sep : '|', ignore:['label']}) YIELD map as row
MATCH (patent:PATENT {app_num : row.app_num})
CREATE (file:FILE_NODE {id : row.id})
SET file += apoc.map.clean(row, [], [])
CREATE (patent)-[:HAS_FILE]->(file)
WITH collect(file.document_code) as docs, count(file) as file_count,patent
CREATE (patent)-[:UPDATED_CMS]->(up:UPDATE_META_CMS)
SET up.added_files = file_count , up.added_docs = docs

now this works fine when the CSV is not as big as it can be.
So I wanted to use apoc.periodic.iterate for doing this stuff in batches.

Now here is my question :
The last node that I am creating i.e the UPDATE_META_CMS takes values by aggregating the details of files added to a specific PATENT node and then attaching this new UPDATE_META_CMS node, will this work fine as now the process is running in batches ?

NOTE : the CSV is made such that all the files related to a patent comes as a cluster i.e files related to a patent will come in succession i.e after the patent number changes no further files of the previous patent will come moving forward.
I don't think we can use this knowledge to enahance the process but if we can , then please do share it.

TIA,
Aman

You could in the first query in apoc.periodic.iterate collect all the files for a patent and return the patent and collection of a patent’s files. Then in the update query, unwind the files and create the file nodes and other relationships. In this way, you will have all a patent’s files when processing the batch.

hey @glilienfield Can you please give a rough idea that how will the query look like ?

You can try this. Sorry, I don't have data to test it. I think you will be safe executing in parallel, since each patent is processed together and all a patent's files have been collected with the patent. Anyways, give it a try.

CALL apoc.periodic.iterate(
"
    CALL apoc.load.csv('file:///temp/cms_data/cms_data.csv', 
    {header: true, sep: '|', ignore: ['label']}) YIELD map as row
    WITH row.app_num as patent_num, collect(apoc.map.clean(row, [], [])) as file_data
    MATCH (patent:PATENT {app_num: patent_num})
    RETURN patent, file_data
",
"
    forEach(row in file_data | 
        CREATE (file:FILE_NODE {id : row.id})
        SET file = row
        CREATE (patent)-[:HAS_FILE]->(file)
    )
    CREATE (patent)-[:UPDATED_CMS]->(up:UPDATE_META_CMS)
    SET up.added_files = size(file_data) , up.added_docs = [i in file_data | i.document_code]
",
{batchSize:1000, parallel:true})

Hey @glilienfield I tested it out . It worked, thank you so much.

Regards,
Aman

1 Like

You are welcome. An FYI, Your use of apoc.map.clean does not do anything, since both lists are empty. Calling it like this will return the original list, 'row' in your case.

1 Like