Creating Relationships takes a very long time

NewGraphGuy · November 8, 2022, 3:06am

Hi,

I am working with Twitter data and creating nodes for Tweets. More specifically, each row in my CSV has 2 tweets, some tweet properties (likes, retweets, and such) and a similarity score created earlier in python. I am then connecting the tweets based on their score (since scores below a certain level were removed from the dataset, I am creating a relationship for every row in the CSV).

I do the following:

Delete everything to start clean
Create nodes for every tweet in column TweetA (5000 nodes, 50k properties, 20 mins)
Create nodes for every tweet in column TweetB (very little changes if any, 20 mins)
Finally, load the csv for a third time and draw a line based on TweetA and TweetBs token.

The very last step is taking insanely long. I've tried a few times now and I get a web socket error (I think my PC falls asleep and the Neo4J browser disconnects). Steps 2 and 3 take about 20 minutes each. The dataset has 230k rows and is 200 MB. Any ideas how to optimize my query?

//Delete Everything

MATCH (n)

DETACH DELETE n;

//Create Tweets from TweetA

LOAD CSV WITH HEADERS FROM 'file:///Tweet2Tweet.csv' AS row

WITH row where not row.UserA is null

MERGE (tweet:Tweet

{

TweetID:row.TweetTokenA,

Tweet:row.text_x,

AuthorToken:row.AuthorTokenA,

AuthorHandle:row.UserA,

CreatedAt:row.a_created_at,

Retweets:row.a_rt_cnt,

Replies:row.a_reply_cnt,

Likes:row.a_like_count,

Quotes:row. a_qt_count,

Aspect:row.AspectsA

});

//Create Tweets from TweetB

LOAD CSV WITH HEADERS FROM 'file:///Tweet2Tweet.csv' AS row

WITH row where not row.UserB is null

MERGE (tweet:Tweet

{

TweetID:row.TweetTokenB,

Tweet:row.text_y,

AuthorToken:row.AuthorTokenB,

AuthorHandle:row.UserB,

CreatedAt:row.b_created_at,

Retweets:row.b_rt_cnt,

Replies:row.b_reply_cnt,

Likes:row.b_like_count,

Quotes:row.b_qt_count,

Aspect:row.AspectsB

});

//Create Tweet to Tweet relationship

LOAD CSV WITH HEADERS FROM 'file:///Tweet2Tweet.csv' AS row

WITH row where not row.UserB is null

MATCH (a:Tweet {TweetID:row.TweetTokenA}),(b:Tweet {TweetID:row.TweetTokenB})

CREATE (a) -[:Similar{Score:row.SimilarityScore}]-> (b);

cobra · November 8, 2022, 8:30am

Hello @NewGraphGuy

You must have a UNIQUE CONSTRAINT on id property.

I also updated your queries:

// Delete Everything
CALL apoc.periodic.iterate("
	MATCH (n) RETURN n
	", "
	DETACH DELETE n
	", {batchSize: 1000, parallel: false});

// Create unique constraints
CREATE CONSTRAINT constraint_Tweet_id IF NOT EXISTS FOR (tweet:Tweet) REQUIRE tweet.TweetID IS UNIQUE;

// Create Tweets from TweetA
LOAD CSV WITH HEADERS FROM 'file:///Tweet2Tweet.csv' AS row 
WITH row 
WHERE NOT row.UserA IS NULL 
MERGE (tweet:Tweet {TweetID: row.TweetTokenA}) 
SET tweet += {
	Tweet: row.text_x, 
	AuthorToken: row.AuthorTokenA, 
	AuthorHandle: row.UserA, 
	CreatedAt: row.a_created_at, 
	Retweets: row.a_rt_cnt, 
	Replies: row.a_reply_cnt, 
	Likes: row.a_like_count, 
	Quotes: row. a_qt_count, 
	Aspect: row.AspectsA
};

// Create Tweets from TweetB
LOAD CSV WITH HEADERS FROM 'file:///Tweet2Tweet.csv' AS row 
WITH row 
WHERE NOT row.UserB IS NULL 
MERGE (tweet:Tweet {TweetID: row.TweetTokenB}) 
SET tweet += {
	Tweet: row.text_y, 
	AuthorToken: row.AuthorTokenB, 
	AuthorHandle: row.UserB, 
	CreatedAt: row.b_created_at, 
	Retweets: row.b_rt_cnt, 
	Replies: row.b_reply_cnt, 
	Likes: row.b_like_count, 
	Quotes: row.b_qt_count, 
	Aspect: row.AspectsB
};

//Create Tweet to Tweet relationship
LOAD CSV WITH HEADERS FROM 'file:///Tweet2Tweet.csv' AS row 
WITH row 
WHERE not row.UserB IS NULL 
MATCH (a:Tweet {TweetID: row.TweetTokenA}) 
MATCH (b:Tweet {TweetID: row.TweetTokenB}) 
MERGE (a)-[r:SIMILAR_TO]-(b) 
SET r += {
	Score: row.SimilarityScore
};

Regards,
Cobra

NewGraphGuy · November 12, 2022, 5:39am

Thanks @Cobra. I got this to work before your response by adding this line prior to each merge statement.

:auto USING PERIODIC COMMIT 1000

I got it to create the database very quickly, however we removed a few duplicate rows from our dataset and I tried to rerun everything and it's incredibly slow now. Any idea why that could be? The dataset is smaller and the query is the exact same.

I tried your solution too. I needed to install APOC. However it errors on the constraint statement. Any ideas?

Neo.ClientError.Statement.SyntaxError: Invalid input 'F': expected whitespace, comment or ON (line 1, column 53 (offset: 52))
"CREATE CONSTRAINT constraint_Tweet_id IF NOT EXISTS FOR (tweet:Tweet) REQUIRE tweet.TweetID IS UNIQUE"

cobra · November 12, 2022, 11:15am

Make sure you have UNIQUE CONSTRAINTS on id properties. Moreover you should use my queries and add what you did front of them.

For the error, I don't see anything wrong so I don't know

Topic		Replies	Views
Query taking a very long time to create relationships Neo4j Graph Platform migrated	16	583	February 7, 2023
How long should it take to create several millions of relationships? Cypher	3	341	March 5, 2021
LOAD CSV taking time Import / Export cypher , import	6	694	September 25, 2021
Less efficiency while performing create relation with neo4j from python Operations relationship	15	510	October 27, 2021
Creating relationships in a not so big graph takes very long time Neo4j Graph Platform migrated	1	200	November 25, 2022

Creating Relationships takes a very long time

Related topics