cancel
Showing results for 
Search instead for 
Did you mean: 

Join the community at Nodes 2022, our free virtual event on November 16 - 17.

Possible bug when running an heavy Query

paolodipietro58
Graph Voyager

Running community Edition 4.4.5 with browser 4.4.3, 64GB RAM, 32 allocated to Neo4j

While running an heavy :auto LOAD CSV .... the browser lost the connection with the server.
This the content of the .log file

2022-04-29 13:27:29.512+0000 INFO  Starting...
2022-04-29 13:27:30.050+0000 INFO  This instance is ServerId{82eca200} (82eca200-c0fc-4d26-863f-8b517146c413)
2022-04-29 13:27:31.404+0000 INFO  ======== Neo4j 4.4.5 ========
2022-04-29 13:27:38.802+0000 INFO  Performing postInitialization step for component 'security-users' with version 3 and status CURRENT
2022-04-29 13:27:38.802+0000 INFO  Updating the initial password in component 'security-users'
2022-04-29 13:27:40.353+0000 INFO  Called db.clearQueryCaches(): Query cache already empty.
2022-04-29 13:27:40.404+0000 INFO  Bolt enabled on [0:0:0:0:0:0:0:0%0]:7687.
2022-04-29 13:27:41.203+0000 INFO  Remote interface available at http://localhost:7474/
2022-04-29 13:27:41.211+0000 INFO  id: C14F9C88D8145F5B0AFA874534F68CD593BA9750FDA5DF53D182C90AC1C2D54A
2022-04-29 13:27:41.212+0000 INFO  name: system
2022-04-29 13:27:41.212+0000 INFO  creationDate: 2022-04-21T10:39:11.124Z
2022-04-29 13:27:41.212+0000 INFO  Started.
2022-04-29 13:28:53.642+0000 ERROR LEAK: ByteBuf.release() was not called before it's garbage-collected. See https://netty.io/wiki/reference-counted-objects.html for more information.
Recent access records: 
Created at:
	io.netty.buffer.AbstractByteBufAllocator.compositeDirectBuffer(AbstractByteBufAllocator.java:224)
	io.netty.buffer.AbstractByteBufAllocator.compositeBuffer(AbstractByteBufAllocator.java:202)
	io.netty.handler.codec.MessageAggregator.decode(MessageAggregator.java:271)
	io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:88)
	io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
	io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
	io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
	io.netty.channel.CombinedChannelDuplexHandler$DelegatingChannelHandlerContext.fireChannelRead(CombinedChannelDuplexHandler.java:436)
	io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:327)
	io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:299)
	io.netty.channel.CombinedChannelDuplexHandler.channelRead(CombinedChannelDuplexHandler.java:251)
	io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
	io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
	io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
	io.netty.handler.codec.ByteToMessageDecoder.handlerRemoved(ByteToMessageDecoder.java:255)
	io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:517)
	io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:449)
	io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:279)
	io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
	io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
	io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
	org.neo4j.bolt.transport.pipeline.AuthenticationTimeoutHandler.channelRead(AuthenticationTimeoutHandler.java:58)
	io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
	io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
	io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
	io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
	io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
	io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
	io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
	io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:795)
	io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:480)
	io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378)
	io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)
	io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	java.base/java.lang.Thread.run(Thread.java:829)

This is a screenshot of the browser, which is not connected nor responding while the query looks still running (see the rolling icons in the lower right corner, which is not real as the server is disconnected), the CPU is 100% on all the available threads: what does it means? How can I bypass this problem?

And this after some hours:

And this is a fragment from debug.log

2022-04-29 16:14:50.680+0000 WARN  [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=54175, gcTime=30321, gcCount=6}
2022-04-29 16:14:30.276+0000 ERROR [o.n.b.t.p.ProtocolHandshaker] Fatal error occurred during protocol handshaking: [id: 0x6f6ed698, L:/127.0.0.1:7687 ! R:/127.0.0.1:57184]
org.neo4j.bolt.runtime.BoltConnectionFatality: Terminated connection '[id: 0x6f6ed698, L:/127.0.0.1:7687 ! R:/127.0.0.1:57184]' as the client failed to authenticate within 30000 ms.
	at org.neo4j.bolt.transport.pipeline.AuthenticationTimeoutHandler.channelIdle(AuthenticationTimeoutHandler.java:78) ~[neo4j-bolt-4.4.5.jar:4.4.5]
	at io.netty.handler.timeout.IdleStateHandler$ReaderIdleTimeoutTask.run(IdleStateHandler.java:506) [netty-handler-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.handler.timeout.IdleStateHandler$AbstractIdleTask.run(IdleStateHandler.java:478) [netty-handler-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98) [netty-common-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:170) [netty-common-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) [netty-common-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:469) [netty-common-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384) [netty-transport-classes-epoll-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986) [netty-common-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-common-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [netty-common-4.1.73.Final.jar:4.1.73.Final]
	at java.lang.Thread.run(Thread.java:829) [?:?]
2022-04-29 16:15:00.868+0000 WARN  [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=30391, gcTime=10186, gcCount=2}


10 REPLIES 10

Cobra
Ninja
Ninja

Hello @paolodipietro58

I think the database loaded to many things at the same time or it was a connection issue.
Did you retry to execute the query?
Can you share the query please?

Regards,
Cobra

paolodipietro58
Graph Voyager

I rerun the query several times, but the result has always been the same.
My idea is to try lo load using the cypher shell and not the browser, because the issue looks in the browser that loose the connection with the db which uses all the available threads at 100%, so the browser doesn't answer the DB to check if it still alive or viceversa.

Here is the query: the only difference with similar ones is the file size: 4GB with at list 31.500.000 lines.

MATCH (n:PhysicalAsset)-[r]-() DETACH DELETE n,r;
MATCH (n:PhysicalAsset) delete n;
MATCH (n:InstalledAsset)-[r]-() DETACH DELETE n,r;
MATCH (n:InstalledAsset) delete n;

CREATE CONSTRAINT IF NOT EXISTS FOR (n:`PhysicalAsset`) REQUIRE n.`uuid` IS UNIQUE;
CREATE CONSTRAINT IF NOT EXISTS FOR (n:`PhysicalAsset`) REQUIRE n.`id` IS UNIQUE;

CREATE CONSTRAINT IF NOT EXISTS FOR (n:`InstalledAsset`) REQUIRE n.`uuid` IS UNIQUE;
CREATE CONSTRAINT IF NOT EXISTS FOR (n:`InstalledAsset`) REQUIRE n.`id` IS UNIQUE;

:auto USING PERIODIC COMMIT 500 LOAD CSV WITH HEADERS FROM 'file:///assets.csv' AS row FIELDTERMINATOR ';' WITH row

WITH row
	WHERE row.asset_type IS NOT NULL

CREATE (physicalAsset:PhysicalAsset { id : toInteger(row.id)})
		SET physicalAsset = row

WITH physicalAsset, row, CASE row.asset_type WHEN NULL THEN -1 ELSE toInteger(row.asset_type) END AS type

call {
	WITH physicalAsset, row
	WITH physicalAsset, row, toInteger(row.asset_type) as type      // we throw away 4 physicalAssets with undefined (NULL) asset_type

	WITH physicalAsset, row, type, trim(row.name) + trim(row.description) as fullName,
		CASE type 
			WHEN 0 THEN 'Type_0'
			WHEN 1 THEN 'Type_1'
			WHEN 3 THEN 'Type_2'
			WHEN 4 THEN 'Type_3'
			WHEN 5 THEN 'Type_4'
			WHEN 6 THEN 'Type_5'
			ELSE NULL
		END as type1

	WITH physicalAsset, row, type, type1, fullName,
		CASE 
			WHEN type = "Type_1" AND apoc.text.indexOf(fullName,"M10") > -1 THEN "M10"
			WHEN type = "Type_1" AND apoc.text.indexOf(fullName,"M20") > -1 THEN "M20"

			WHEN type = "Type_3" THEN "M40"

			WHEN type = "Type_4" THEN "Q50"
			
			ELSE NULL
		END as subType		

		WHERE type1 IS NOT NULL	 

	CALL apoc.create.setLabels(physicalAsset,[type1, subType, 'PhysicalAsset']) YIELD node

	return node
}

WITH row, physicalAsset, type, node
	WHERE node IS NOT NULL

MERGE (node)-[ownerhipPeriod:IS_INSTALLED_AS]->(installedAsset:InstalledAsset)
	ON CREATE
		SET installedAsset.uuid = apoc.create.uuid(),
			installedAsset.id = toInteger(row.id),
			installedAsset.start_date = datetime(replace(row.created_at," ","T")),
			installedAsset.updated_at = datetime(replace(row.updated_at," ","T")),
			installedAsset.end_date = datetime(replace(row.deleted_at," ","T"))

WITH row, installedAsset

OPTIONAL MATCH (organisation:Organisation { id: toInteger(row.current_organisation_id) })
OPTIONAL MATCH (store:Store { id: toInteger(row.store_id) })

CALL {
	WITH organisation, installedAsset
	WITH organisation, installedAsset
	WHERE organisation IS NULL
	MERGE (error:Error)
	MERGE (asset)-[:HAS_NO_ORGANISATION]->(error)
}

CALL {
	WITH store, installedAsset
	WITH store, installedAsset
	WHERE store IS NULL
	MERGE (error:Error)
	MERGE (asset)-[:HAS_NO_STORE]->(error)
}
		
CALL {
	WITH store, installedAsset
	WITH store, installedAsset
	WHERE store IS NOT NULL
    MERGE (installedAsset)-[:IS_INSTALLED_AT]->(store)	
}

RETURN COUNT(installedAsset);

Cobra
Ninja
Ninja

Sorry, for taking the time to respond.

You need to optimize all queries for operations of this size because there will never be enough RAM to do them all at once.

These requests must be compatible with Neo4j Community 4.4.5.

// Delete PhysicalAsset nodes and their relationships
MATCH (n:PhysicalAsset) 
CALL { 
	WITH n 
	DETACH DELETE n 
} IN TRANSACTIONS OF 10000 ROWS;
// Delete InstalledAsset nodes and their relationships
MATCH (n:InstalledAsset) 
CALL { 
	WITH n 
	DETACH DELETE n 
} IN TRANSACTIONS OF 10000 ROWS;

You should also use MERGE clause instead of CREATE clause especially if you are creating UNIQUE CONSTRAINTS before:

MERGE (physicalAsset:PhysicalAsset {id: toInteger(row.id)})
SET physicalAsset = row

I'm also pretty sure you can reduce the workload when you compute types (type, type1, subType).
Don't pass variables throughout the query like type, you don't need it anymore after you set the labels.

It's hard to do more without a data set to test.

To conclude, you don't need more RAM, you just need to optimize queries

@Cobra I never thought to buy more RAM!
Well, these queries are sequential, so I don't need to DETACH DELETE IN TRANSACTIONS because they run like a charm. The problem is with the main body, the one with LOAD CSV.

I'll surely change CREATE with MERGEand remove type but, anyway, the server disconnect (I made another post for this) because it is overloaded - look at this image:

Thank you!

Cobra
Ninja
Ninja

You can try running your query directly through cypher-shell to avoid the connection part?

@Cobra Sure, I did it. But nothing changed.

Here is the complete offending query: it loads 8,999,000 rows. Tonight (without Transactions) it run out of memory!

So, I added transactions.

But, looking at the manual, I tried to different ways before running this query:
I put the LOAD CSV just before the CALL {, but it returned an error

MATCH (n:Transaction)-[r]-()
CALL {
    WITH n,r
    DETACH DELETE n,r
    } IN TRANSACTION OF 10000;

MATCH (n:Transaction)
CALL {
    WITH n
    delete n
    } IN TRANSACTION OF 10000;

CREATE CONSTRAINT IF NOT EXISTS FOR (n:`Transaction`) REQUIRE n.`uuid` IS UNIQUE;
CREATE CONSTRAINT IF NOT EXISTS FOR (n:`Transaction`) REQUIRE n.`id` IS UNIQUE;

CALL {

LOAD CSV WITH HEADERS FROM 'file:///transactions.csv' AS row FIELDTERMINATOR ';'

OPTIONAL MATCH (organisation:Organisation {id: toInteger(row.organisation_id)})

OPTIONAL MATCH (bankAccount:BankAccount {id: toInteger(row.bank_account_id)})

MERGE (transaction:Transaction { id : toInteger(row.id) })
		SET transaction = row,
			transaction.uuid = apoc.create.uuid(),
            transaction.id = toInteger(row.id),
            transaction.consumer_id = NULL // remove the attribute no more used

    WITH transaction, row, bankAccount, organisation

CALL {
    WITH transaction, row, bankAccount
    WITH transaction, row, bankAccount
        WHERE bankAccount IS NOT NULL

    MERGE (transaction)-[h:HAS]->(bankAccount)
}

CALL {
    WITH transaction, row, bankAccount
    WITH transaction, row, bankAccount
        WHERE bankAccount IS NULL
    
    MERGE (error:Error)
    MERGE (transaction)-[h:HAS_NO_BANK_ACCOUNT]->(error)
}

CALL {
    WITH transaction, row, organisation
    WITH transaction, row, organisation
        WHERE organisation IS NOT NULL

    MERGE (transaction)-[h:BELONGS_TO]->(organisation)
}

CALL {
    WITH transaction, row, organisation
    WITH transaction, row, organisation
        WHERE organisation IS NULL
    
    MERGE (error:Error)
    MERGE (transaction)-[h:HAS_NO_ORGANISATION]->(error)
}

} IN TRANSACTIONS OF 500 ROWS;

Hi @paolodipietro58 ,

I suggest you to read this 2 posts.

Your DBMS is dying due to garbage collection issues, this line 2022-04-29 16:15:00.868+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=30391, gcTime=10186, gcCount=2} means that your db paused for more than 30 seconds (and I guess its not the only pause. This is mainly due to lack of query tuning.

Can you split your logic of transaction creation and other elements in a way that the Plan of each query doesn't contain any eager operation?

Regards,

Bennu

Oh, y’all wanted a twist, ey?

I understand the concept of eager operation, but it's not clear which is an eager operation and how to discover it!

glilienfield
Ninja
Ninja

BTW, I think you can simplify the following snippet from your query as shown:

MATCH (n:Transaction)-[r]-()
CALL {
    WITH n,r
    DETACH DELETE n,r
    } IN TRANSACTION OF 10000;

MATCH (n:Transaction)
CALL {
    WITH n
    delete n
    } IN TRANSACTION OF 10000;

Suggested refactor.

MATCH (n:Transaction)
CALL {
    WITH n
    detach delete n
    } IN TRANSACTION OF 10000;

Good suggestion! I'll just apply it!