Neo4j query really slow

(Dragotic) #1

So, I am running a local neo4j database which has 5000 nodes, 5000 relationships and is of 120MB in size.

I am running a query:

MATCH p=(:Tweet)-[:REPLIED_TO|RETWEETED_FROM*]->(:Tweet)-[:REPLIED_TO|RETWEETED_FROM]->(:Tweet {type: 'TWEET'})
RETURN p LIMIT 150

Having the LIMIT modifier returns the result in ~20seconds, while not adding the modifier LIMIT took me at least 5 minutes where I stopped running the query.

Do you have any idea why it runs so slow?

Let me give you a little more insight into the db schema.

I have tweets, retweets, and replies. For each tweet, I'm creating a chain ordered by timestamp. The chain has replies-retweets and ends to the specific tweet.

The above pattern returns this type of chains.

0 Likes

(Andrew Bowman) #2

I think you may not fully understand what this query is doing.

It's finding every single possible path that matches this pattern, and you have no limit on the length of the relationships in this chain, and you're doing this for all possible :Tweets of type 'TWEET'. I think you'll find that the number of possible paths is skyrocketing into the hundreds of thousands if not higher.

Even if this is what you want, I'm not sure what you would do with all of those rows that you're returning. For sure the browser can't handle that volume of data and display it.

If you can, please be more specific in what you're trying to do here, as what this is doing currently isn't efficient and doesn't seem useful as it is.

You say you're creating chains, but there are no CREATE or MERGE operations here. Also you mentioned specific tweets, but your query for a :Tweet with the 'TWEET' type doesn't seem specific to me.

0 Likes

(Dragotic) #3

The chain is created at some prior point. The :Tweet nodes are connected with :REPLIED_TO, :RETWEETED_FROM relationships.

What I want to achieve is to get these types of chains. The longest is probably 250-300 nodes. Is there a better way to get them, instead of the above pattern?

0 Likes

(Andrew Bowman) #4

One thing you could try is ensuring the end tweets in the chain are end nodes (which either don't reply or retweet any other node, or are not themselves replies or retweets) as your query currently finds subchains of any length.

You could give this a try:

MATCH p=(end:Tweet)-[:REPLIED_TO|RETWEETED_FROM*]->(start:Tweet {type: 'TWEET'}) 
WHERE NOT (start)-[:REPLIED_TO|RETWEETED_FROM]->() AND NOT ()-[:REPLIED_TO|RETWEETED_FROM]->(end)
RETURN p LIMIT 150

Start with a lower limit to make sure it's working okay then scale up. Also you may want to add your PROFILE plan of the query (with all elements expanded) to take a look at how it's being planned and executed.

0 Likes

(Dragotic) #5

Hey @andrew.bowman, thanks a lot. The query did run way faster.

Below you can find the Profile plan of the query that you asked for.

0 Likes