Hello everyone,
I'm fairly new to Neo4j and Cypher. I'm enjoying the journey of learning and the support this community provides very much. It's incredible how much documentation exists and for that I'm very grateful. I'm starting to step into uncharted territory and I'm having trouble with some of the more advanced queries, relatively speaking.
I've been able to experiment with all sorts of Python code and have written a few helper libraries for my project. In short, I'm trying to run a large batch job (Many GB of data). I've prepopulated the database using the CSV import tools via command line which worked quite well. I'm now at the data analysis / processing step in my project and I thought my queries were going well until I discovered that in fact I pretty much ruined my dataset by carelessly performing two UNWIND operations back to back. That's no biggie as I have the VM snapshotted. Here was my query:
UNWIND [['a1','b2'],['b2','b3']] as datapair
UNWIND datapair as dataelement
MATCH (n:some_node_label {some_property: dataelement})-->(c:cluster_node)
WITH DISTINCT(c.cluster_id) as cids,c
UNWIND cids as cid
MATCH (c)<--(a:random_node)
WITH count(a) as c_count,c
ORDER BY c_count DESC
WITH collect(c) as nodes
CALL apoc.refactor.mergeNodes(nodes, {properties:'discard'})
YIELD node
RETURN "success" as status
Basically what I was wanting to do is pass in thousands of these pairs in a single query, but process each of them independently from one another. They are in no way related to one another. Insert the sigh - Well, this of course "UNWOUND" all of the elements within each pair and merged every single one of them together toward the end of the query. In lieu of thousands of individual queries (in serial mind you) I was hoping to process all of my queries in a single request via a for loop of some sort (which I now know isn't supported). I also tried the foreach hack but discovered that APOC is unable to run from within there - I'm also not sure how or if it would be helpful.
I've even tried experimenting with py2neo and using tx.append, but that not longer seems to be supported, nor am I certain it would help. Is there an efficient way of sending a list of queries for the server to process without having to make a separate call each time? OR perhaps is there a way to send a list of queries for the server to process in parallel via multiple sessions? From having worked in Python so far, I'm not having much luck in terms of figuring out how to speed up or parallel process my requests on the server.
For now I'm stuck having to issue thousands of these Cypher queries via a single Python Bolt session which is incredibly slow. The query essentially figures out which of the two cluster_nodes are the largest and merges the smaller one into the larger one. I have verified that this part works by looking at not only the count of relationships but also the ID's internal to Neo4J. The node with the largest number of relationships persists.
UNWIND ['a1','b2'] as dataelement
MATCH (n:some_node_label {some_property: dataelement})-->(c:cluster_node)
WITH DISTINCT(c.cluster_id) as cids,c
UNWIND cids as cid
MATCH (c)<--(a:random_node)
WITH count(a) as c_count,c
ORDER BY c_count DESC
WITH collect(c) as nodes
CALL apoc.refactor.mergeNodes(nodes, {properties:'discard'})
YIELD node
RETURN "success" as status
I would be incredibly appreciative of any help or guidance. Happy to pay it forward when the right time comes.
Regards,
Al