Help me to understand "EXPLAIN" output?

After wasting a lot of time with LOAD CSV, I succeeded in importing 176M records from Semantic Scholar into my Neo4j database using neo4j-admin import
I was able to complete the operation 2.7 hours, once I upgraded to an r5.2xlarge with 64GB RAM
Thanks for your help @stefan.armbruster!

Now I am trying to run benchmarking queries on the database, and I am running into more trouble.
Here are some stats on the data:
MATCH (p:Paper) return count(p); --> 176452537
MATCH (a:Author) RETURN count(a); --> 58748409
MATCH ()-[c:CITES]->() return count(c); --> 581057025
MATCH ()-[h:HAS_AUTHOR]->() return count(h); --> 452616106

I have copied the queries below that are giving me the most trouble.
I expect them to be quite costly, but I let the first one run all day and it still did not finish, so I aborted it for now...
Can anyone give me advice how to construct them better?

"Find which papers have the most citations"

EXPLAIN MATCH (:Paper)-[r:CITES]->(p:Paper)
RETURN p.title, p.id, COUNT(r)
ORDER BY COUNT(r) DESC
LIMIT 10;

+-------------------+----------------+-------------------------+---------------+--------------------------+
| Operator          | Estimated Rows | Identifiers             | Ordered by    | Other                    |
+-------------------+----------------+-------------------------+---------------+--------------------------+
| +ProduceResults   |          24105 | COUNT(r), p.id, p.title | COUNT(r) DESC |                          |
| |                 +----------------+-------------------------+---------------+--------------------------+
| +Top              |          24105 | COUNT(r), p.id, p.title | COUNT(r) DESC | COUNT(r); 10             |
| |                 +----------------+-------------------------+---------------+--------------------------+
| +EagerAggregation |          24105 | COUNT(r), p.id, p.title |               | p.title, p.id            |
| |                 +----------------+-------------------------+---------------+--------------------------+
| +Filter           |      581057025 | anon[7], p, r           |               | `anon[7]`:Paper          |
| |                 +----------------+-------------------------+---------------+--------------------------+
| +Expand(All)      |      581057025 | anon[7], p, r           |               | (p)<-[r:CITES]-(anon[7]) |
| |                 +----------------+-------------------------+---------------+--------------------------+
| +NodeByLabelScan  |      176452537 | p                       |               | :Paper                   |
+-------------------+----------------+-------------------------+---------------+--------------------------+

"Find which authors have published the most papers"

EXPLAIN MATCH (:Paper)-[r:HAS_AUTHOR]->(a:Author)
RETURN a.name, a.id, COUNT(r)
ORDER BY COUNT(r) DESC
LIMIT 10;
+-------------------+----------------+------------------------+---------------+-------------------------------+
| Operator          | Estimated Rows | Identifiers            | Ordered by    | Other                         |
+-------------------+----------------+------------------------+---------------+-------------------------------+
| +ProduceResults   |          21275 | COUNT(r), a.id, a.name | COUNT(r) DESC |                               |
| |                 +----------------+------------------------+---------------+-------------------------------+
| +Top              |          21275 | COUNT(r), a.id, a.name | COUNT(r) DESC | COUNT(r); 10                  |
| |                 +----------------+------------------------+---------------+-------------------------------+
| +EagerAggregation |          21275 | COUNT(r), a.id, a.name |               | a.name, a.id                  |
| |                 +----------------+------------------------+---------------+-------------------------------+
| +Filter           |      452616106 | anon[7], a, r          |               | `anon[7]`:Paper               |
| |                 +----------------+------------------------+---------------+-------------------------------+
| +Expand(All)      |      452616106 | anon[7], a, r          |               | (a)<-[r:HAS_AUTHOR]-(anon[7]) |
| |                 +----------------+------------------------+---------------+-------------------------------+
| +NodeByLabelScan  |       58748409 | a                      |               | :Author                       |
+-------------------+----------------+------------------------+---------------+-------------------------------+

"Find which authors have the most citations"

EXPLAIN MATCH (:Paper)-[r:CITES]-(:Paper)-[HAS_AUTHOR]-(a:Author)
RETURN a.name, a.id, COUNT(r)
ORDER BY COUNT(r) DESC
LIMIT 10;

+-------------------+----------------+-------------------------------------+---------------+--------------------------------+
| Operator          | Estimated Rows | Identifiers                         | Ordered by    | Other                          |
+-------------------+----------------+-------------------------------------+---------------+--------------------------------+
| +ProduceResults   |          54598 | COUNT(r), a.id, a.name              | COUNT(r) DESC |                                |
| |                 +----------------+-------------------------------------+---------------+--------------------------------+
| +Top              |          54598 | COUNT(r), a.id, a.name              | COUNT(r) DESC | COUNT(r); 10                   |
| |                 +----------------+-------------------------------------+---------------+--------------------------------+
| +EagerAggregation |          54598 | COUNT(r), a.id, a.name              |               | a.name, a.id                   |
| |                 +----------------+-------------------------------------+---------------+--------------------------------+
| +Filter           |     2980923603 | a, anon[26], anon[7], r, HAS_AUTHOR |               | `anon[7]`:Paper                |
| |                 +----------------+-------------------------------------+---------------+--------------------------------+
| +Expand(All)      |     2980923610 | a, anon[26], anon[7], r, HAS_AUTHOR |               | (anon[26])-[r:CITES]-(anon[7]) |
| |                 +----------------+-------------------------------------+---------------+--------------------------------+
| +Filter           |      452616107 | anon[26], HAS_AUTHOR, a             |               | `anon[26]`:Paper               |
| |                 +----------------+-------------------------------------+---------------+--------------------------------+
| +Expand(All)      |      452616107 | anon[26], HAS_AUTHOR, a             |               | (a)-[HAS_AUTHOR:]-(anon[26])   |
| |                 +----------------+-------------------------------------+---------------+--------------------------------+
| +NodeByLabelScan  |       58748409 | a                                   |               | :Author                        |
+-------------------+----------------+-------------------------------------+---------------+--------------------------------+

@mike.r.black suggested the following structure (Thanks!), but I am not sure what makes it better, and how I can use that lesson to improve my query design...
The fact that it only touches 176M rows instead of 581M rows seems like a good sign though.
I guess the idea is that you can just run a "SIZE()" query on each node individually, which is relatively cheap. (Is that just an O(1) operation on each node? Basically calculating the length of the list of relations?)

"Better way to find which papers have the most citations"

EXPLAIN MATCH (p:Paper)
RETURN p, SIZE( (:Paper)-[:CITES]->(p) ) AS cite_count
ORDER BY cite_count DESC
LIMIT 10;
+------------------+----------------+--------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------------------+
| Operator         | Estimated Rows | Identifiers                    | Ordered by      | Other                                                                                                                    |
+------------------+----------------+--------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------------------+
| +ProduceResults  |             10 | anon[33], cite_count, p        | cite_count DESC |                                                                                                                          |
| |                +----------------+--------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------------------+
| +Top             |             10 | anon[33], cite_count, p        | cite_count DESC | cite_count; 10                                                                                                           |
| |                +----------------+--------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------------------+
| +Projection      |      176452537 | anon[33], cite_count, p        |                 | {cite_count : SIZE(`anon[33]`)}                                                                                          |
| |                +----------------+--------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------------------+
| +RollUpApply     |      176452537 | anon[33], p                    |                 | anon[33]                                                                                                                 |
| |\               +----------------+--------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------------------+
| | +Projection    |              2 | anon[32],   NODE33,   REL41, p |                 | { : PathExpression(NodePathStep(Variable(  NODE33),SingleRelationshipPathStep(Variable(  REL41),OUTGOING,NilPathStep)))} |
| | |              +----------------+--------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------------------+
| | +Filter        |              2 |   NODE33,   REL41, p           |                 | `  NODE33`:Paper                                                                                                         |
| | |              +----------------+--------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------------------+
| | +Expand(All)   |              2 |   NODE33,   REL41, p           |                 | (p)<-[  REL41:CITES]-(  NODE33)                                                                                          |
| | |              +----------------+--------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------------------+
| | +Argument      |              1 | p                              |                 |                                                                                                                          |
| |                +----------------+--------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------------------+
| +NodeByLabelScan |      176452537 | p                              |                 | :Paper                                                                                                                   |
+------------------+----------------+--------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------------------+

regarding

EXPLAIN MATCH (p:Paper)
RETURN p, SIZE( (:Paper)-[:CITES]->(p) ) AS cite_count
ORDER BY cite_count DESC
LIMIT 10;
```

under the covers Neo4j keeps track in its metadata the following precomputed counts

# of Total Nodes in the graph  (i..e match (n) return count(n);)
# of Nodes for a given Label  (i.e. match (n:Paper) return count (n:Paper);)

in the above 2 examples whether you have 100 nodes or 100 million nodes Neo4j simply returns the value given the precomputed value in the metadata.   Neo4j does not iterate over those 100 or 100 million nodes and counts 1 by 1.


# of relationsships for a given type and node and direction  (i.e   match (n:Paper) return size ( (n)-[:CITES]->();   )

in this case for each node with label :Paper the underlying metadata will hold the precomputed number of relationships by type and their direction.   It does not however report number of relationships by type direction and destination label.   For example for a given node the metadata may include for this node there are
    12 relationships named :CITES which are of outgoing direction
      3 relationships named :CITES which are of incoming direction
      1 relationship named :ARCHIVED which is of incoming direction

usage of the SIZE( (:Paper)-[:CITES]->() ) will trigger in the plan a `GetDegree(
Variable(p),Some(RelTypeName,(CITES)),INCOMING)}`.   Note this is a slight change from the prior post.   which was `SIZE( (:Paper)-[:CITES]->(p) ) `
1 Like

Thanks @dana.canzano !
So, based on your (and @mike.r.black 's) advice, I have restructured my queries in the following form:

"Count all papers by an author":

MATCH (a:Author)
WHERE a.id = "144117798"
RETURN SIZE((:Paper)-[:HAS_AUTHOR]->(a));

"Count all papers that cite a paper":

MATCH (cited:Paper)
WHERE cited.id = "fbb11a841893d4b68fa2173226285ded4f7b04d6"
RETURN SIZE((:Paper)-[:CITES]->(cited));

"Find the most cited papers":

MATCH (p:Paper)
RETURN p.title, p.id, SIZE( (:Paper)-[:CITES]->(p) ) AS cite_count
ORDER BY cite_count DESC
LIMIT 10;

"Find which authors have published the most papers":

MATCH a:Author
RETURN a.name, a.id, SIZE((:Paper)-[:HAS_AUTHOR]->(a)) as pub_count
ORDER BY pub_count DESC
LIMIT 10;

This is the one I am not sure about:
"Find which authors have the most citations":

MATCH (p:Paper)
WITH SIZE((:Paper)-[:CITES]-(p)) as cite_count
MATCH (a:Author)
WHERE (p)-[HAS_AUTHOR]-(a:Author))
WITH SUM(cite_count) as total_cites
RETURN a.name, a.id, total_cites
ORDER BY total_cites DESC
LIMIT 10;

I am pretty sure I am doing this wrong, because I suspect that I will get the same value for every author... Is cite_count tied to each individual paper?
Maybe I need to GROUP BY a instead of SUM(cite_count)?

regarding my last response and the types of queries that use the underlying pre-calculated metadata, when running a EXPLAIN or `PROFILE' you can tell the pre-calculated metadata is used as

match (n) return count(n); will use an operator named +NodeCountFromCountStore

a

match (n:Paper) return count(n); will use and operator named NodeCountFromCountStore

and a

match (n:Paper) return size ( (n)-[:CITES]->() ); will use an operator named 'Projection' and the detail will report {size ( (n)-[:CITES]->() ) : GetDegree(Variable(n),Some(RelTypeName(CITES)),OUTGOING)} and specifically the 'GetDegree` clause is indicating it is using the metadata.

For your first query of

MATCH (a:Author)
WHERE a.id = "144117798"
RETURN SIZE((:Paper)-[:HAS_AUTHOR]->(a));

if this is rewritten as

MATCH (a:Author)
WHERE a.id = "144117798"
RETURN SIZE(()-[:HAS_AUTHOR]->(a));

and thus remove the target label of :Paper then GetDegree is utilized. As the metadata only records for this node there are N incoming/outgoing relationship with a relationship type of ':CITES` then the metadata can be utilized. The metadata does not record number of relationships for a node by type / by direction AND taget node label.

the only caveat with my rewrite is if your model is such that if both (:Author)-[:HAS_AUTHOR]->(:SciFi) and (:Author)-[:HAS_AUTHOR]->(:Fiction) then my rewrite would not work. Well the query would run but all it would do is tell you for this author there are N relationships named :HAS_AUTHOR but it would be the total of all :HAS_AUTHOR relationships regardless if the target node had a label of :SciFi or :Fiction

Hi @dana.canzano!

So, if I am understanding you correctly, I get faster performance if I don't specify the type of Node being referenced when I use SIZE:
SIZE((:Paper)-[:HAS_AUTHOR]->(a));

As you said yesterday, the count stores keep track of the number of incoming and outgoing relations, but not the type of node thy are each related to. So by specifying the type, I am forcing Neo4j to make sure they types match?
So long as I have properly assigned the [:HAS_AUTHOR] relationships to go from (:Papers) to (:Authors) all the time, then I can get away with not performing this check at query time, and speed things up.

So, how can I structure that last query? I am still a novice at SQL, and even more new to Cypher, so I am unsure how I am supposed to combine things...

yes
for a given node we record
N INCOMING RELATIONSHIPS with type XYZ
N1 OUTGOING RELATIONSHIPS with type XYZ
we do not provide further granularity to indicate the resultant label for said relationship types.

Regarding your last query of

MATCH (p:Paper)
WITH SIZE((:Paper)-[:CITES]-(p)) as cite_count
MATCH (a:Author)
WHERE (p)-[HAS_AUTHOR]-(a:Author))
WITH SUM(cite_count) as total_cites
RETURN a.name, a.id, total_cites
ORDER BY total_cites DESC
LIMIT 10;

some issues here. your traversals are undirected. Is that your model as well. For example, (p)-[HAS_AUTHOR]-(a:Author) if a :Paper-[:HAS_AUTHOR]->(:Author) such that the model is only a :Paper points to a :Author then I would specify the direction with ->. Or is your model such that you may also see :Author-[:HAS)_AUTHOR]-> :Paper You can do whatever way you want but if your model is such that the relationship direction always goes from :Author to :Paper then I would explicitly specify it in the query. Not specifying might add overhead as we will look for incoming as well as outgoing :HAS_AUTHOR relationship.

Does this rewrite work (though i havent added direction in the relationship traversals )

profile MATCH (p:Paper)
WITH p,SIZE(()-[:CITES]-(p)) as cite_count
MATCH (a:Author)
WHERE (p)-[:HAS_AUTHOR]-(a:Author)
WITH a,SUM(cite_count) as total_cites
RETURN a.name, a.id, total_cites
ORDER BY total_cites DESC
LIMIT 10;

Sorry! I just forgot to include the directions in my relationships...
It is always (:Paper)-[:HAS_AUTHOR)->(:Author)
There is also (:Paper)-[:CITES]->(:Paper)
and (:Paper)-[:IS_CITED_BY]->(:Paper)
I realized that [:CITES] and [:IS_CITED_BY] are basically redundant in Neo4j however...
In the original dataset, each individual paper record has a list of papers it cited as well as papers that cited it.
I kept both direction in my Postgres database, because I was doing some partial imports (only 10M or 50M records), and I had relaxed the foreign key constraints so that I could keep the entire list of citations (even if some pointed to papers that had not been imported yet). I assume that once the entire dataset was imported then they should all be valid.
However, in the case of Neo4j, it has foreign key constrainsts baked-in, since I cannot create a relationship that does not connect two existing nodes. So, I ended up with a giant import.report listing all of the invalid relations.

I tried running explain on the query you mentioned. In this case, I am testing on my 50M record database, rather than the full 176M

Here is what I got:

EXPLAIN MATCH (p:Paper)
WITH p,SIZE(()-[:CITES]->(p)) AS cite_count
MATCH (a:Author)
WHERE (p)-[:HAS_AUTHOR]->(a:Author)
WITH a,SUM(cite_count) as total_cites
RETURN a.name, a.id, total_cites
ORDER BY total_cites DESC
LIMIT 10;
+--------------------+------------------+------------------------------+------------------+-------------------------------------------------------------------------+
| Operator           | Estimated Rows   | Identifiers                  | Ordered by       | Other                                                                   |
+--------------------+------------------+------------------------------+------------------+-------------------------------------------------------------------------+
| +ProduceResults    |               10 | a, a.id, a.name, total_cites | total_cites DESC |                                                                         |
| |                  +------------------+------------------------------+------------------+-------------------------------------------------------------------------+
| +Projection        |               10 | a, a.id, a.name, total_cites | total_cites DESC | {a.name : a.name, a.id : a.id}                                          |
| |                  +------------------+------------------------------+------------------+-------------------------------------------------------------------------+
| +Top               |               10 | a, total_cites               | total_cites DESC | total_cites; 10                                                         |
| |                  +------------------+------------------------------+------------------+-------------------------------------------------------------------------+
| +EagerAggregation  |         34900113 | a, total_cites               |                  | a                                                                       |
| |                  +------------------+------------------------------+------------------+-------------------------------------------------------------------------+
| +Apply             | 1218017853000000 | a, cite_count, p             |                  |                                                                         |
| |\                 +------------------+------------------------------+------------------+-------------------------------------------------------------------------+
| | +SemiApply       | 1218017853000000 | a, cite_count, p             |                  |                                                                         |
| | |\               +------------------+------------------------------+------------------+-------------------------------------------------------------------------+
| | | +Expand(Into)  |         50357056 |   REL87, a, p                |                  | (p)-[  REL87:HAS_AUTHOR]->(a)                                           |
| | | |              +------------------+------------------------------+------------------+-------------------------------------------------------------------------+
| | | +Filter        |  624245788499854 | a, p                         |                  | a:Author                                                                |
| | | |              +------------------+------------------------------+------------------+-------------------------------------------------------------------------+
| | | +Argument      | 1624023804000000 | a, p                         |                  |                                                                         |
| | |                +------------------+------------------------------+------------------+-------------------------------------------------------------------------+
| | +NodeByLabelScan | 1624023804000000 | a, cite_count, p             |                  | :Author                                                                 |
| |                  +------------------+------------------------------+------------------+-------------------------------------------------------------------------+
| +Projection        |         51000000 | cite_count, p                |                  | {cite_count : GetDegree(Variable(p),Some(RelTypeName(CITES)),INCOMING)} |
| |                  +------------------+------------------------------+------------------+-------------------------------------------------------------------------+
| +NodeByLabelScan   |         51000000 | p                            |                  | :Paper                                                                  |
+--------------------+------------------+------------------------------+------------------+-------------------------------------------------------------------------+

I am seeing some pretty big numbers there... makes me nervous. Is that really going to work?

Estimated rows are more like very wide ballpark figures, and are of much more use to the planner rather than for manual query tuning. PROFILE plans on the other hand show actual rows and actual db hits, and you should prefer that when possible (when the query is a READ query, or if you otherwise have the ability to undo writes, as PROFILE does execute the query).

Hi andrew.bowman, @dana.canzano, @mike.r.black, @stefan.armbruster,

Thanks for all of your help over the last couple of days! I was able to complete all of the benchmark tests, and the modified query structure really helped!
I had to shut down all of the databases today because I basically finished the project (plus I got a bit too close to my budget limit).
Overall, I learned a lot, and look forward to my next opportunity to work with neo4j and cypher!