Hi,
I have a large graph (order of a billion edges, 400M nodes) with few properties, loaded into neo4j, and I would like to do some operations on all edges of a given type (let's say, 200M). I found the performance to be slow for this application, and I would like to know if it is normal or if there is a more efficient way to do what I want to do. Let's take the simplest example to avoid discussion on how to optimize queries:
If I want to get property p of all nodes of a given type, for instance, something like match (n:Node) return n.p
, then the process is very fast, as expected. If I check my hard drive activity, it says that it is reading at about 50Mb/s.
But if I do something similar with edges, something like match (a:A)-[r:clusterOf]->(b:B) return a.param, b.param
(or even, r.param), then it is at least 100 times slower, and I see that my disk is reading at about 1Mb/s (it's an HDD, I'll move to a SSD soon but I think the problem is independent).
So, is it normal? Is there a more adapted way to do something like this, i.e., a query over large number of edges? (Of course, what I want to do is more complex than that, but if this is slow already, what I want to do will be even slower). One solution would be to load the graph in memory, but I have only 64G, so if I load the whole graph with attributes, it does not fit, and if I try to warm up loading only the relations I need, then I have the problem of slowness, it would take several days to load it...
Of course, I'm using apoc.export.csv.query to get my reuslts into a file.
To give an idea of the slowness, it writes in the file about 200 lines/s
Hello remy, welcome to the community. Some common things you can to do to increase the speed of your query is to make sure you've got indexes on the properties that you're searching by. If you're trying to work with properties on relationships you're going to find that you can't index those and so for very large or complex cases it's not helpful to structure your data that way. If you're curious about seeing how your query is being carried out you can put "explain" in front of it and you'll get a map of the operations being carried out and a break down of possible bottle necks in the operation. If you haven't already found it, this is an excellent example How can I optimize my Cypher Queries? . I hope that helps and again welcome to the community.
Thank you for the answer,
I provide below more details on an example
1)I do have indexes on the two properties I want. But anyway I'm not filtering them, I want them all.
2)I did play with the explain
function, but most problems I've seen on the web are when queries become complex and the ordering of the query makes a difference. In my example, I know that I'll have a lot of rows since I want an answer for 100M edges. I'm ready to wait for 2, 5 or even 10 hours... But at the current speed, it would take rather 10 days, while doing it efficiently with a normal code would be about 30 minutes on the same machine.
I nevertheless join the profile of an explain
of a simple query I find too slow for my usage
explain match (c:Cluster)-[:CONTAINS]->(a:Address) return c.cluster_id,a.address
.
So the question boils out to: is it normal that such a query, properties of both extremities of all edges of a given type, is much slower than getting properties of all nodes of a given type? giving answers at a rate of about 200 rows/s ?
If it is normal, is there a way to do this kind of thing faster? At least a way to load some specific nodes and relations in memory, fast ?