cancel
Showing results for 
Search instead for 
Did you mean: 

Slow Kafka-Neo4j sink

busymo16
Ninja
Ninja

Hi everyone,

I have the following situation. I am sinking data from Kafka to Neo4j using the Kafka connector Neo4j connector. I have 44 topics from which I should read from and some of the topics are intersecting with each other, which means that during the data creation, I can have locks. The reason why the locks exist is that I am using multiple Merge operations since I need to deal with Null and Duplicate values. An example query is shown below:

WITH event AS data
MERGE (cs:CustomerSite { id: data.id }) 
ON CREATE SET cs = data
ON MATCH SET cs = data
WITH data, cs
CALL apoc.do.when( data.customerId IS NOT NULL,
				\" MERGE (c:Customer{id: data.customerId}) MERGE (c)-[:HAS_SITE]->(cs) \",
				\" RETURN True \",
				{data: data, cs: cs}
				) YIELD value
WITH data, cs
CALL apoc.do.when( data.mainAddressId IS NOT NULL,
				\" MERGE (a:Address{id: data.mainAddressId}) MERGE (cs)-[:IS_LOCATED]->(a) \",
				\" RETURN True \",
				{data: data, cs: cs}
				) YIELD value
WITH data, cs
CALL apoc.do.when( data.partnerId IS NOT NULL,
				\" MERGE (cp:Customer:Partner{id: data.partnerId}) MERGE (cp)-[:HAS_SITE]->(cs) \",
				\" RETURN True \",
				{data: data, cs: cs}
				) YIELD value
WITH data, cs
CALL apoc.do.when( data.symCsaId IS NOT NULL,
				\" MERGE (e:ExternalID:SYM{value: data.symCsaId, type: 'CSA ID'}) MERGE (cs)-[:HAS_EXTERNAL_ID]->(e) \",
				\" RETURN True \",
				{data: data, cs: cs}
				) YIELD value
WITH data, cs
CALL apoc.do.when(data.deleted = 1,
				\" DETACH DELETE cs \",
				\" RETURN True \", 
				{data: data, cs: cs}
				) YIELD value
RETURN 'done'

Due to locking on nodes while the relationships are being created, I get a very poor performance from Neo4j side and even sometimes the Kafka consumer disconnects due to the huge processing time needed on Neo4j. The amount of data that I am trying to load is around 80mil nodes and 80mil relationships. I could see that Kafka sends batches of data that consist of different topics and therefore it increases significantly the possibility of locks. When I have only one topic per connector and the connectors are not all at the same time up and running, the performance is quite decent.

Do you have any idea on a possible improvement on the performance on the Neo4j side? Or on the Kafka side, to trigger Kafka to send only batches of one topic at a time and not a mix of topics? My configurations consist of a cluster with 3 core replicas.

Heap size = 20G
Pagecache size = 60G
Transaction max size = 2G

Thanks in advance!

1 ACCEPTED SOLUTION

bennu_neo
Neo4j
Neo4j

Hi @busymo16!

Have you checked the indexes you are using?

😉

Bennu

Oh, y’all wanted a twist, ey?

View solution in original post

1 REPLY 1

bennu_neo
Neo4j
Neo4j

Hi @busymo16!

Have you checked the indexes you are using?

😉

Bennu

Oh, y’all wanted a twist, ey?