Why is my neo4j insertion python script slowing down over time?

Hi everyone,

I'm using python to extract various datapoints out of a 100GB + json file. Using neomodel, for every group of datapoints extracted, the data is then saved into a neo4j DBMS running on the same system.

I monitor the data extracted and saved by calculating how often the saving function is called every second. Depending on the system im running this on, the script will start to slow down either almost immediately or after a few minutes. I don't see any strain on the system, so I don't know what to make of this behaviour. This is the script to save the data:

from neomodel import StructuredNode, StringProperty, Relationship, config, DateProperty, BooleanProperty

config.DATABASE_URL = 'bolt://neo4j:password@localhost:7687'


class Label1(StructuredNode):
    Prop1 = StringProperty(unique_index=True)
    Prop2 = StringProperty()
    Prop3 = StringProperty(unique_index=True)
    Prop4 = Relationship('Label2', 'REL1')
    Prop5 = Relationship('Label3', 'REL2')


class Label2(StructuredNode):
    Prop1 = DateProperty()
    Prop2 = BooleanProperty(default=False)


class Label3(StructuredNode):
    Prop1 = StringProperty()
    Prop2 = StringProperty()
    Prop3 = StringProperty()


def save_data(Data1, Data2, Data3):

    # LABEL 1

    send_data1 = Label1(Prop1=Data1[0], Prop3=Data1[1], Prop2=Data1[2]).save()

    # LABEL 2

    try:
        send_data2 = Label2.nodes.get(Prop1=Data2[0], Prop2=Data2[1])
    except:
        send_data2 = Label2(Prop1=Data2[0], Prop2=Data2[1]).save()

    # LABEL 3

    try:
        send_data3 = Label3.nodes.get(Prop1=Data3[0])
    except:
        send_data3 = Label3(Prop1=Data3[0], Prop2=Data3[1], Prop3=Data3[2]).save()

    # RELATIONSHIPS

    send_data1.Label2.connect(send_data2)
    send_data1.Label3.connect(send_data3)

I'm very new to neo4j, but from what I found, this probably has something to do with doing a lot of writes in short succession? Is there a better way to do this using neomodel? Thanks in advance!

Edit: Added the log. At around 5500 saved datagroups the decline begins. If I leave it to run long enough it will decline to 1 group/s and below.

saved 100 datagroups [groups/s: 12.57]
saved 200 datagroups [groups/s: 16.25]
saved 300 datagroups [groups/s: 18.01]
saved 400 datagroups [groups/s: 19.22]
saved 500 datagroups [groups/s: 19.89]
saved 600 datagroups [groups/s: 20.40]
found 616 datapoints among 1000 entities
saved 700 datagroups [groups/s: 20.81]
saved 800 datagroups [groups/s: 21.05]
saved 900 datagroups [groups/s: 21.28]
saved 1000 datagroups [groups/s: 21.42]
saved 1100 datagroups [groups/s: 21.54]
found 1150 datapoints among 2000 entities
saved 1200 datagroups [groups/s: 21.69]
saved 1300 datagroups [groups/s: 21.86]
saved 1400 datagroups [groups/s: 22.00]
saved 1500 datagroups [groups/s: 22.00]
found 1565 datapoints among 3000 entities
saved 1600 datagroups [groups/s: 22.10]
saved 1700 datagroups [groups/s: 22.20]
saved 1800 datagroups [groups/s: 22.25]
saved 1900 datagroups [groups/s: 22.34]
saved 2000 datagroups [groups/s: 22.41]
found 2096 datapoints among 4000 entities
saved 2100 datagroups [groups/s: 22.46]
saved 2200 datagroups [groups/s: 22.45]
saved 2300 datagroups [groups/s: 22.47]
saved 2400 datagroups [groups/s: 22.50]
found 2467 datapoints among 5000 entities
saved 2500 datagroups [groups/s: 22.53]
saved 2600 datagroups [groups/s: 22.54]
saved 2700 datagroups [groups/s: 22.58]
saved 2800 datagroups [groups/s: 22.61]
saved 2900 datagroups [groups/s: 22.61]
saved 3000 datagroups [groups/s: 22.66]
saved 3100 datagroups [groups/s: 22.76]
found 3113 datapoints among 6000 entities
saved 3200 datagroups [groups/s: 22.84]
saved 3300 datagroups [groups/s: 22.91]
saved 3400 datagroups [groups/s: 22.99]
saved 3500 datagroups [groups/s: 23.06]
saved 3600 datagroups [groups/s: 23.07]
saved 3700 datagroups [groups/s: 23.06]
found 3714 datapoints among 7000 entities
saved 3800 datagroups [groups/s: 23.07]
saved 3900 datagroups [groups/s: 23.09]
saved 4000 datagroups [groups/s: 23.09]
saved 4100 datagroups [groups/s: 23.10]
saved 4200 datagroups [groups/s: 23.11]
saved 4300 datagroups [groups/s: 23.11]
saved 4400 datagroups [groups/s: 23.12]
saved 4500 datagroups [groups/s: 23.13]
saved 4600 datagroups [groups/s: 23.11]
saved 4700 datagroups [groups/s: 23.08]
found 4776 datapoints among 8000 entities
saved 4800 datagroups [groups/s: 23.07]
saved 4900 datagroups [groups/s: 23.05]
saved 5000 datagroups [groups/s: 23.05]
saved 5100 datagroups [groups/s: 23.03]
saved 5200 datagroups [groups/s: 23.02]
saved 5300 datagroups [groups/s: 23.03]
saved 5400 datagroups [groups/s: 23.02]
saved 5500 datagroups [groups/s: 23.02]
saved 5600 datagroups [groups/s: 22.99]
saved 5700 datagroups [groups/s: 22.98]
saved 5800 datagroups [groups/s: 22.98]
saved 5900 datagroups [groups/s: 22.98]
found 5902 datapoints among 9000 entities
saved 6000 datagroups [groups/s: 22.97]
saved 6100 datagroups [groups/s: 22.96]
saved 6200 datagroups [groups/s: 22.95]
saved 6300 datagroups [groups/s: 22.95]
saved 6400 datagroups [groups/s: 22.94]
saved 6500 datagroups [groups/s: 22.95]
saved 6600 datagroups [groups/s: 22.94]
saved 6700 datagroups [groups/s: 22.93]
saved 6800 datagroups [groups/s: 22.92]
saved 6900 datagroups [groups/s: 22.90]
found 6920 datapoints among 10000 entities

Hi @jonas1!

I'm not a neomodel expert, but from what I see, you eventually start querying on Label2 and Label3.

Do you have an index on Prop1 for Label3 and a composite one for Label2 on Prop1, Prop2?

Regards,

Bennu

Hi @bennu.neo,
thanks for your quick reply!

Yes, so the queries on Label2 and 3 are are to check if a node with these properties already exists.

In this example Label2 would be a date. Prop1 would be the actual date property and Prop2 would be a bool to signal if the date is BC or not. Every date only should exist once.

Same for Label3: I want to know if a node with Prop1 already exists so I can create a relationship to it, or if I need to create the node first.

I hope this is what you wanted to know.

Thanks!

Jonas

Hi @jonas1 !

I see Label1 has some unique_index defined but the other 2 doesn't.

You may like adding some indexes in order to speed up this .get executions. Otherwise you be always making a LabelScan + filter, behavior that decrease performance when the cardinality of the Label increase.

Bennu

1 Like

Hi. I don’t know the nuts and bolts behind this, but I faced a similar issue. My solution was to commit in batches of 5000 (just like you have empirically found to be ideal!)

from neomodel import db
statement = """
	UNWIND $rows as row
	MERGE (l1:Label1 {Prop1:row.Prop1,Prop3:row.Prop3}
	ON CREATE
	 SET l1.Prop2 = row.Prop2
	MERGE (l2:Label2 {Prop1:row.Prop4,Prop2:row.Prop5})
	MERGE (l3:Label3 {Prop1:row.Prop6,Prop2:row.Prop7,Prop3:row.Prop8}

	 WITH l1,l2,l3
	 MERGE (l3)<-[: PROP5]-(l1)-[:PROP4]->(l2)

	 """
	params = []
	for i range(len(Data)):
		params_dict = {"Prop1": Data[i][0], "Prop2": Data[i][1],"Prop3":Data[i][2],"Prop3":Data[i][3],"Prop4":Data[i][5],"Prop5":Data[i][6],"Prop6":Data[i][7], "Prop7":Data[i][8], "Prop8":Data[i][9]}
		params.append(params_dict)
		if i % 5000 == 0:
        results, meta = db.cypher_query(statement, params ={"rows":params)
			params = []			
	results, meta = db.cypher_query(statement, params ={"rows":params)

Hi @bennu_neo,

thank you for pointing me in the right direction! This makes perfect sense, now that I think about it. I also did not know about the index functionality. It actually not only solved the performance decline over time, but also lead to a performance incline up to around 80%. Thanks a lot!

Hi @sanjaysingh13, thanks for your solution! I will need to get more familiar with cypher, but will try and see if this can boost performance even more.