cancel
Showing results for 
Search instead for 
Did you mean: 

Join the community at Nodes 2022, our free virtual event on November 16 - 17.

Fastest bulk merge via py2neo

dmitryd
Node

Hello Community! I'm exploring the features neo4j several months. I use dockerized Neo4j comunity version: Browser version: [4.2.6], Server version: [4.1.9] and py2neo==2021.2.3.

There is a problem in importing a large number of data, about 100-500k rows.

I wrote a wrapper for the merge_relationships (py2neo -> bulk operations), the code below. Actually, he breaks the dataset into chunks and via threads sends them to merge. Problem: the relations between the nodes are lost, the nodes of 100% exist! (did not clutter up the code about the nodes).
Code example:

from py2neo import Graph
from py2neo.bulk import merge_relationships

from concurrent.futures import ThreadPoolExecutor


MAX_WORKERS = 10

graph = Graph("http://neo4j:password@localhost:7474")


def batcher(iterable, n=1):
    l = len(iterable)
    for ndx in range(0, l, n):
        yield iterable[ndx:min(ndx + n, l)]


def upload_relations(graph, dataset):
    with ThreadPoolExecutor(max_workers=MAX_WORKERS) as executor:
        for batch in batcher(dataset.relationships, dataset.batch_size):
            executor.submit(
                merge_relationships,
                graph.auto(),
                batch,
                dataset.rel_type,
                (tuple(dataset.start_node_labels), *dataset.fixed_order_start_node_properties), # start_node_key
                (tuple(dataset.end_node_labels), *dataset.fixed_order_end_node_properties) # end_node_key
            )

class DataSet:
    batch_size = 1000
    rel_type = "HAS_ACTIVITY"

    start_node_labels = ['Organization']
    fixed_order_start_node_properties = ('index',)

    end_node_labels = ['Activity']
    fixed_order_end_node_properties = ('name',)

    relationships = [
        ('1810003938', {}, 'Type1'),
        ('1710000665', {}, 'Type2'),
        ('1810002242', {}, 'Type3'),
        ('0310006089', {}, 'Type4'),
        ('0310005915', {}, 'Type5'),
        ('1810002325', {}, 'Type6'),
        ('5710001175', {}, 'Type7'),
        ('3610002514', {}, 'Type8'),
        ('3910000839', {}, 'Type9'),
        ...
    ]

dataset = DataSet()

upload_relations(graph, dataset)
1 REPLY 1

dmitryd
Node

If you merge nodes through threads (same code via py2neo.bulk.merge_nodes instead merge_relationships), there make duplicates of Activity nodes, BUT this doesn't generate duplicates in Organization nodes - very strange behavior.