Hello Community! I'm exploring the features neo4j several months. I use dockerized Neo4j comunity version: Browser version: [4.2.6], Server version: [4.1.9] and py2neo==2021.2.3.
There is a problem in importing a large number of data, about 100-500k rows.
I wrote a wrapper for the merge_relationships (py2neo -> bulk operations), the code below. Actually, he breaks the dataset into chunks and via threads sends them to merge. Problem: the relations between the nodes are lost, the nodes of 100% exist! (did not clutter up the code about the nodes).
Code example:
from py2neo import Graph
from py2neo.bulk import merge_relationships
from concurrent.futures import ThreadPoolExecutor
MAX_WORKERS = 10
graph = Graph("http://neo4j:password@localhost:7474")
def batcher(iterable, n=1):
l = len(iterable)
for ndx in range(0, l, n):
yield iterable[ndx:min(ndx + n, l)]
def upload_relations(graph, dataset):
with ThreadPoolExecutor(max_workers=MAX_WORKERS) as executor:
for batch in batcher(dataset.relationships, dataset.batch_size):
executor.submit(
merge_relationships,
graph.auto(),
batch,
dataset.rel_type,
(tuple(dataset.start_node_labels), *dataset.fixed_order_start_node_properties), # start_node_key
(tuple(dataset.end_node_labels), *dataset.fixed_order_end_node_properties) # end_node_key
)
class DataSet:
batch_size = 1000
rel_type = "HAS_ACTIVITY"
start_node_labels = ['Organization']
fixed_order_start_node_properties = ('index',)
end_node_labels = ['Activity']
fixed_order_end_node_properties = ('name',)
relationships = [
('1810003938', {}, 'Type1'),
('1710000665', {}, 'Type2'),
('1810002242', {}, 'Type3'),
('0310006089', {}, 'Type4'),
('0310005915', {}, 'Type5'),
('1810002325', {}, 'Type6'),
('5710001175', {}, 'Type7'),
('3610002514', {}, 'Type8'),
('3910000839', {}, 'Type9'),
...
]
dataset = DataSet()
upload_relations(graph, dataset)