Getting a list of children runs really slowly

cypher

(Oleg ) #1

Hi,

I have what should be a simple query to get a list of children (ideally grandchildren later) nodes from a parent node. I have a document node that is connected to words with a :BOW_OF relationship. The result I'm looking for a row with a documentID and a list of words in that document.
If I specify the document ID, it is very fast:

MATCH (word)-[r:BOW_OF]->(doc:Desc{id:'12345'}) RETURN doc.id, collect(word.word)

but if I take the id out, and add a LIMIT 1 on the end, it doesn't finish, so I think I'm doing something wrong.

MATCH (word)-[r:BOW_OF]->(doc:Desc) RETURN doc.id, collect(word.word) limit 1

What I would like to get to is without the limit:

MATCH (stem)-[s:STEM_OF]->(word)-[r:BOW_OF]->(doc:Desc) RETURN doc.id, collect(stem.stem)

Is there something I'm doing wrong? Thank you very much!

Oleg


(Andrew Bowman) #2

Since you're using an aggregation (collect) with respect to the doc.id, ALL results need to be expanded out first before the collect(). Is doc.id unique per :Desc node? If so, your aggregation should instead be by the doc node and not by its id property. That way when you do property access at the end, it only does the access once per node instead of multiple times for every row for which the same node appears.

For your LIMIT 1 approach try this instead:

MATCH (doc:Desc) 
WITH doc
LIMIT 1
MATCH (word)-[r:BOW_OF]->(doc)
WITH doc, collect(word.word) as words
RETURN doc.id, words

Alternately you could use pattern comprehension to get a list of results from a pattern:

MATCH (doc:Desc) 
WITH doc
LIMIT 1
WITH doc, [(word)-[r:BOW_OF]->(doc) | word.word] as words
RETURN doc.id, words

How many :Desc nodes are your db, and how many word and stem nodes? If the result set is huge you may have some trouble executing this via the browser (especially if the browser is attempting to visualize it). You could try using cypher-shell instead.

For your full query, you would want to do a similar approach, but make sure to get only DISTINCT stems, I'm guessing there are a lot of duplicates there.

MATCH (stem)-[:STEM_OF]->()-[:BOW_OF]->(doc:Desc)
WITH doc, collect(DISTINCT stem) as stems
RETURN doc.id, stems

(Oleg ) #3

Thanks for replying! :) I get it now about aggregating by node instead of property. I Yes, every doc.id is unique. I have 200k doc nodes now, but eventually a few million. Each doc node can have ~100-4000 words/stems. No, there shouldn't be any duplicate words/stems, but that will be something to check.

What I'd like to do is then add the classification(s) of each document to the query to get a result to be able to train on... classifications and a list of stems. Does this seem like a reasonable query to do that? I don't necessarily need to visualize it, but I'll try to use the cypher-shell, I just never have before.