Upload mass data in neo4j

Hi Team,
I have a json file with 1 Million objects and I need to upload these 1 Million objects in Neo4j. But when I upload these objects using "Merge" or "Create" command it takes lot of time, Please suggest me some other methods to upload in optimize way.

@samarthakr

a. what version of Neo4j?

b. do you have sample cypher you are running to perform the CREATE / MERGE

c. are you trying to commit all 1 million new 'objects' in a single transaction?

d. for MERGE do you have an index on the label/property used as part of the MERGE

HI @dana_canzano ,
Please find the code which I'm using to upload data in Neo4j.

a. what version of Neo4j?
Ans: I am using Remote Connection of Neo4j on Neo4j Desktop.

b. do you have sample cypher you are running to perform the CREATE / MERGE
Ans: Cypher query is provide in the code snippet below.

Libraries

import json
from py2neo import Graph, Node

Connect to Neo4j

graph = Graph("XXXX", auth=("XXX", "XXX"))

Function to load JSON data

def load_json(file_path):

with open(file_path, 'r', encoding='utf-8') as file:
    data = json.load(file)

return data

Function to create a person node

def create_person_node(node):

if(node.get('schema') == 'Person' or node.get('schema') == 'Organization'):
# Extracting properties     
    id = node.get('id',"")
    caption = node.get('caption','Unknown')
    schema = node.get('schema','')
    properties = node.get("properties", {})
    name = properties.get("name", ["Unknown"])[0]
    firstName = properties.get("firstName", [""])[0]
    lastName = properties.get("lastName", [""])[0]
    alias = properties.get('alias',[''])[0]
    birthDate = properties.get("birthDate", [""])[0]
    birthPlace = properties.get("birthPlace", [""])[0]
    nationality = properties.get("nationality", [""])[0]
    gender = properties.get("gender", [""])[0]
    country = properties.get("country", [""])[0]
    sourceUrl = properties.get("sourceUrl", [""])[0]
    topics = properties.get("topics", [""])[0]
    datasets = node.get('datasets',[''])[0]

    # Create Person node
    record = Node("Record",
                id=id,
                schema = schema,
                caption = caption,
                name=name,
                firstName=firstName,
                lastName=lastName,
                birthDate=birthDate,
                birthPlace=birthPlace,
                nationality=nationality,
                gender=gender,
                country=country,
                sourceUrl=sourceUrl,
                topics=topics,
                datasets = datasets,
                alias = alias)

    # Creating Record Node
    graph.merge(record,'Record','id')

Function to create dataset nodes and relationships

def create_dataset_relationships(person):
for dataset in person["datasets"]:
dataset_query = f"""
MERGE (d:Dataset {{name: "{dataset}"}})
WITH d
MATCH (p:Record {{id: "{person['id']}"}})
MERGE (p)-[:BELONGS_TO]->(d)
"""
graph.run(dataset_query)

Function to create referent nodes and relationships

def create_referent_relationships(person):
for referent in person["referents"]:
referent_query = f"""
MERGE (r:Referent {{name: "{referent}"}})
WITH r
MATCH (p:Record {{id: "{person['id']}"}})
MERGE (p)-[:HAS_REFERENT]->(r)
"""
graph.run(referent_query)

Load data from JSON files

json_files = ["input_data.json"]

for file in json_files:

data = load_json(file)

for person in data:
    create_person_node(person)
    create_dataset_relationships(person)
    create_referent_relationships(person)

print("Data upload to Neo4j completed.")

I'm open for any other alternative also to upload the data. Please help me for bulk upload.

I agree with @dana_canzano about the indexes. You have three match/merge statements, so you need indexes on each of these properties you are matching on.

The second observation is the you are matching on the same Record node twice, because you separated the cypher into two create relationship methods. It looks like you also created this Record node in your create_person_node method. You could eliminate the two matches if you created one cypher script to replace the three separate methods; thereby, allowing you to pass the created Record node the the two merge statements that create the relationships.

@samarthakr

a. what version of Neo4j?
Ans: I am using Remote Connection of Neo4j on Neo4j Desktop.

that doesnt really indicate the version. but ok

also you are using py2neo which is a community driven python library and not the officially support Neo4j Python driver. neo4j · PyPI

HI @dana_canzano

  1. Can you please suggest Official Neo4j python driver or alternative for py2Neo ?
  2. Even for creating node also it takes lot of time. In the below code I'm just creating a Node for 1 Million records and it is taking more time (for 40K nodes it took 15min) . So could you please suggest me any tool or approach for faster upload.

Libraries

import json
from py2neo import Graph, Node

Connect to Neo4j

graph = Graph("XXXX", auth=("XXX", "XXX"))

Function to load JSON data

def load_json(file_path):

with open(file_path, 'r', encoding='utf-8') as file:
    data = json.load(file)

return data

Function to create a person node

def create_person_node(node):

if(node.get('schema') == 'Person' or node.get('schema') == 'Organization'):
# Extracting properties     
    id = node.get('id',"")
    caption = node.get('caption','Unknown')
    schema = node.get('schema','')
    properties = node.get("properties", {})
    name = properties.get("name", ["Unknown"])[0]
    firstName = properties.get("firstName", [""])[0]
    lastName = properties.get("lastName", [""])[0]
    alias = properties.get('alias',[''])[0]
    birthDate = properties.get("birthDate", [""])[0]
    birthPlace = properties.get("birthPlace", [""])[0]
    nationality = properties.get("nationality", [""])[0]
    gender = properties.get("gender", [""])[0]
    country = properties.get("country", [""])[0]
    sourceUrl = properties.get("sourceUrl", [""])[0]
    topics = properties.get("topics", [""])[0]
    datasets = node.get('datasets',[''])[0]

    # Create Person node
    record = Node("Record",
                id=id,
                schema = schema,
                caption = caption,
                name=name,
                firstName=firstName,
                lastName=lastName,
                birthDate=birthDate,
                birthPlace=birthPlace,
                nationality=nationality,
                gender=gender,
                country=country,
                sourceUrl=sourceUrl,
                topics=topics,
                datasets = datasets,
                alias = alias)

    # Creating Record Node
    graph.merge(record,'Record','id')

Load data from JSON files

json_files = ["input_data.json"]

for file in json_files:

data = load_json(file)

for person in data:
    create_person_node(person)

print("Data upload to Neo4j completed.")

@samarthakr

all i know is Neo4j has limited experience and or support for py2neo. if there is a bug in py2neo its not something we control

https://neo4j.com/docs/api/python-driver/current/#

Hi @dana_canzano and @glilienfield ,

Even after trying this Neo4j Python Driver there is no significance improvement in the speed while uploading the data. Please let me know any tool or library or any approach which would be helpful to upload 1 Million nodes in less time (1-2 min).

Do you have the indexes @dana_canzano recommended? Did you refactor the code to eliminate the two extra matches?

1 Like