Hi,
I am working with Twitter data and creating nodes for Tweets. More specifically, each row in my CSV has 2 tweets, some tweet properties (likes, retweets, and such) and a similarity score created earlier in python. I am then connecting the tweets based on their score (since scores below a certain level were removed from the dataset, I am creating a relationship for every row in the CSV).
I do the following:
- Delete everything to start clean
- Create nodes for every tweet in column TweetA (5000 nodes, 50k properties, 20 mins)
- Create nodes for every tweet in column TweetB (very little changes if any, 20 mins)
- Finally, load the csv for a third time and draw a line based on TweetA and TweetBs token.
The very last step is taking insanely long. I've tried a few times now and I get a web socket error (I think my PC falls asleep and the Neo4J browser disconnects). Steps 2 and 3 take about 20 minutes each. The dataset has 230k rows and is 200 MB. Any ideas how to optimize my query?
//Delete Everything
MATCH (n)
DETACH DELETE n;
//Create Tweets from TweetA
LOAD CSV WITH HEADERS FROM 'file:///Tweet2Tweet.csv' AS row
WITH row where not row.UserA is null
MERGE (tweet:Tweet
{
TweetID:row.TweetTokenA,
Tweet:row.text_x,
AuthorToken:row.AuthorTokenA,
AuthorHandle:row.UserA,
CreatedAt:row.a_created_at,
Retweets:row.a_rt_cnt,
Replies:row.a_reply_cnt,
Likes:row.a_like_count,
Quotes:row. a_qt_count,
Aspect:row.AspectsA
});
//Create Tweets from TweetB
LOAD CSV WITH HEADERS FROM 'file:///Tweet2Tweet.csv' AS row
WITH row where not row.UserB is null
MERGE (tweet:Tweet
{
TweetID:row.TweetTokenB,
Tweet:row.text_y,
AuthorToken:row.AuthorTokenB,
AuthorHandle:row.UserB,
CreatedAt:row.b_created_at,
Retweets:row.b_rt_cnt,
Replies:row.b_reply_cnt,
Likes:row.b_like_count,
Quotes:row.b_qt_count,
Aspect:row.AspectsB
});
//Create Tweet to Tweet relationship
LOAD CSV WITH HEADERS FROM 'file:///Tweet2Tweet.csv' AS row
WITH row where not row.UserB is null
MATCH (a:Tweet {TweetID:row.TweetTokenA}),(b:Tweet {TweetID:row.TweetTokenB})
CREATE (a) -[:Similar{Score:row.SimilarityScore}]-> (b);