Creating a subset graph

Hi, newbie question:

I have a graph with "Literature" and "Keyword" nodes, and I wanted to display all Literature with say >200 keywords with the Keyword nodes. Can anyone help with the right cypher query to achieve this?

Here's what I tried:

(1) match (k:Keyword)-[r:is_keyword]->(l:Literature) with l, count(k) as nk where nk>200 return k,r,l --> k and r are not defined error

(2) match (k:Keyword)-[r:is_keyword]->(l:Literature) with l, count(k) as nk where nk>200 return l --> I have 14 Literature nodes returned, but I want the keywords and relationships to be displayed as well

(3) match (k:Keyword)-[r:is_keyword]->(l:Literature) with l, count(k) as nk where nk>200 match (k:Keyword)-[r]-(l) return k,r,l --> only 2 Literature nodes were returned, but I was expecting 14 from query (1)

Any help is appreciated.

Thanks!

Welcome to the Neo4j Community!

Because you want to return both Literature nodes and Keyword nodes, you cannot take advantage of the count store which is not used with both nodes have their labels specified.

In the first MATCH statement, in order to return k and r, you must also include them in the WITH clause. This might work for you:

MATCH (k:Keyword)-[r:is_keyword]->(l:Literature) WITH l, count(k) as nk, k, r, WHERE nk>200 RETURN k,r,l

MATCH (k:Keyword)-[:is_keyword]->(l:Literature)
WITH k, count(k) AS nk, collect(l.id) as lits
WHERE nk > 200
RETURN k.name, lits

Elaine

Thanks, Elaine. The first query seems to be what I needed. However, I don't think I am getting the right results. If I include k and r in the WITH clause, I get 0 results but if I excluded them still with nk>200 I get 14 hits. The difference appears to be in the filter step where there are no pagecache hits with k&r included. Do you know why it is so?

profile match (k:Keyword)-[r:is_keyword]->(l:Literature) with l, count(k) as nk, k, r where nk>200 return k,r,l

profile match (k:Keyword)-[r:is_keyword]->(l:Literature) with l, count(k) as nk where nk>200 return l

Thanks!

Another thing you can do is perform the filtering with respect to the degree of :is_keyword relationships on the node, this should be more efficient than doing the expand first:

match (l:Literature)
where size((l)<-[:is_keyword]-()) > 200
match  (k:Keyword)-[r:is_keyword]->(l)
return k, r, l

Thanks for the suggestion, Andrew. I did that and got only 2 Literature node hits in the end, even though the first filter returned 14 nodes. Somehow the next MATCH filter eliminated 12 Literature nodes! In fact, this returns the same result as query (3)
match (k:Keyword)-[r:is_keyword]->(l:Literature) with l, count(k) as nk where nk>200 match (k:Keyword)-[r]-(l) return k,r,l

image

Perhaps the direction or type of relationship is the problem? Of your 14 :Literature nodes, do they all really have outgoing :is_keyword relationships? Or are some relationships incoming instead? Or is the type different (check for extra whitespace)? Or are they connected to other nodes besides :Keyword nodes?

My schema is very simple, as below:
image

I did a couple of checks on the direction and type of relationships:

  1. MATCH (k:Keyword)-[r:is_keyword]->(m:Literature) with m, count(r) as nr, count(k) as nk where nk>200 match (m)-[r]-(k) return *
    This still just returned 2 Literature nodes.

  2. MATCH (k:Keyword)-[r:is_keyword]->(m:Literature) with m, count(r) as nr, count(k) as nk where nk>200 return m,nr,nk
    This returned 14 Literature nodes, and nk=nk

Those are puzzling results. Just as a sanity check, if you change the return on both to be RETURN DISTINCT m, do you also see differing results?

And if so, if you change the RETURN in the first query to be RETURN m, r, k, nr, nk do you also see a difference?

Yes it is very puzzling. Return distinct gave the same results. On the 14 returned nodes, I can verify that they are 14 different titles in the "table" view.

With
MATCH (k:Keyword)-[r:is_keyword]->(m:Literature) with m, count(r) as nr, count(k) as nk where nk>200 match (m)-[r]-(k) return m,r,k,nk,nr
I still get 2 Literature nodes.

Alright then, next is to perform a consistency check. Also, it would help to know what version of Neo4j you're using.

You're right. There was a problem with consistency. The error that I got was

2019-11-15 04:45:41.324+0000 WARN [o.n.c.ConsistencyCheckService] Label index was not properly shutdown and rebuild is required.

What is the best way to do this? I dumped the graph.db and created a new graph with this dump but still got the same problem. Also, when I CALL db.indexes there doesn't seem to be any index.

If you're using 3.3.x or above, then this should do the trick:

Rebuild the labelscanstore. To do this stop the database using bin/neo4j stop and then remove graph.db/neostore.labelscanstore.db. Restarting Neo4j, via bin/neo4j start, will rebuild these files.

I neglected to mention that I was using Neo4j ver3.5.6 on Win10. I could not stop or start bin\neo4j because bin\neo4j status shows the windows service is not installed. Anyway I deleted the labelscanstore.db file and tried the query, but got the same problem. Consistency check reported lots of nodes with the following error:

ERROR: This node record has a label that is not found in the label scan store entry for this node

There was a "partition" key in the "Keyword" nodes which wasn't supposed to be there so I removed them, and reran the query but that still didn't solve the problem. I tried to rerun consistency-check but got this message:

C:\Users\User.Neo4jDesktop\neo4jDatabases\database-b7b68788-7a76-4923-a39a-9cf353f20eba\installation-3.5.6>bin\neo4j-admin check-consistency --database=grap
h.db
command failed: Active logical log detected, this might be a source of inconsistencies.
Please recover database.
To perform recovery please start database in single mode and perform clean shutdown.

I tried to exit Neo4j and check-consistency again but same warning appeared. Do you know why this is happening? Everytime I shutdown the graph I'd exit the browser by clicking on the 'x' in the browser window, and stop the graph in the Desktop console.