New merged nodes from existing nodes

Hi all,

Neo4j beginner here. This seems like it should be a simple task, but I really cannot figure it out. I want to create a new set of nodes based off an existing set, merging them based on a unique value for a certain variable.

I have 3.7 million nodes of type "person" successfully loaded into my database. They have several variables attached. I want to create a new set of nodes called "name," where there is one node per unique value of the variable "person_name_cluster_key" in the "person" nodes. (Based on working with the data in R, I know that this should result in 2.3 million "name" nodes). I also want to bring over another variable called "person_name" for each of the new "name" nodes. Of the multiple "person" nodes merged, I don't care which "person" node this value is take from. Then, I want to relate each new "name" node to the original "person" nodes with a (n:name)-[:name_of]->(p:person) relationship.

I need the process to be iterative and computationally efficient since the dataset is so large. I feel like this should be really simple, but I'm stumped.

Thanks so much.

You can try something like this. Let's see how it works.

Test data:

unwind [["santa","a"], ["pluto","b"], ["goofy","c"], ["rudolph","d"], ["micky","e"], ["rudolph","f"], ["pluto","g"], ["minnie","h"], ["micky","i"]] as person
create (:Person{person_name_cluster_key: person[0], person_name: person[1]})

Query:

:auto
CALL () {
    MATCH(p:Person) 
    WITH p.person_name_cluster_key as key, collect(p) as persons_for_key
    CREATE (n:Name{person_name_cluster_key: key, person_name: head(persons_for_key).person_name})
    FOREACH(i in persons_for_key |
        MERGE (n)-[:name_of]->(i) 
    )
} in CONCURRENT TRANSACTIONS of 100000 ROWS

Note: 1 you need the ":auto" when executing the query in the browser or cypher-shell, but not otherwise.

Note 2: You definitely need an indexes on the labels and properties you are matching and merging on. This is not an issue with this query, as you are not matching on a specify property.

Note 3: I have not used the CONCURRENT TRANSACTIONS clause. It is relatively new.

Note 4: This may require a lot of memory due to the size of your database and the use of a COLLECT on your entire data set.

Let me know how it goes.