Indexing and Efficiently Filtering Based on Relationships in Neo4j

jakob · August 29, 2024, 11:19am

Hello everyone,

I'm transitioning from SQL to Neo4j and am accustomed to indexing foreign key columns and filtering on them in relational databases. In SQL, if a table has two foreign keys, I can filter on both, and thanks to indexing, this filtering happens very quickly.

Now, I'm working with a graph database for the first time and have encountered a scenario where my initial filtering feels like it's performing a full table scan, which is obviously very inefficient.

Scenario:

I have the following relationships:

(:Person)-[:WORKS_AT]->(:Company)
(:Person)-[:LIVES_AT]->(:City)

I want to find all Person nodes who work at a specific Company and live in a specific City. I know I can start the MATCH pattern from either the Company or City nodes and then filter down to the Person nodes, but the number of Person nodes is still very large, making this approach slow.

Question:

Is there a technique in Neo4j to index these relationships to make the query more efficient? Any advice or best practices would be greatly appreciated.

Thank you!

john.stegeman · August 29, 2024, 11:31am

Hi Jakob,

In Neo4j, traversing relationships (e.g. from company to person or city to person) is just a pointer lookup - which is way faster than even an index lookup. You can create an index on city (maybe on the name property) and also on company to help neo4j find the starting point quickly. Have you tried using EXPLAIN or PROFILE to see how Neo4j is actually processing your query?

jakob · August 29, 2024, 11:44am

Hi John, thanks for your reply. Indexing Company or City isn't the problem, it's the case where I want to filter Person based on its relationship with both Company and City. This would, as far as I can tell, require testing all Person nodes to see if such a relationship exists - ie. similar to performing a full table scan. This is the scenario I want to avoid. Thank you

john.stegeman · August 29, 2024, 11:57am

I get your point, but that's likely not how Neo4j will execute the query. If you start from, say, a Company - Neo4j can traverse the [WORKS_AT] relationships to find the people. Even though relationships are directed, Neo4j can traverse them in both directions. Maybe you can try a query (with EXPLAIN) and check the results

jakob · August 29, 2024, 12:07pm

Hi John,

Thanks for your insights. I understand that Neo4j’s relationship traversal is more efficient than an index lookup in a relational database. However, my concern isn't about the traversal itself, but rather about the potential inefficiency when filtering Person nodes by both Company and City relationships.

To clarify, my worry is that in order to find Person nodes that both WORK_AT a specific Company and LIVE_AT a specific City, Neo4j might need to check all Person nodes to see if they have these relationships, which seems analogous to performing a full table scan in SQL.

I'm looking for advice on how to avoid what feels like this exhaustive filtering process. Is there a way to optimize this type of query so it doesn’t involve checking all Person nodes for these two relationships?

I hope this clears up my concern. I appreciate any guidance you can offer!

hakan.lofqvist1 · August 29, 2024, 12:18pm

I would write the query like this:

MATCH (p:Person)-[:WORKS_AT]->(:Company{name: $company_name})
WHERE EXISTs { (p)-[:LIVES_AT]->(:City{name: $city_name}) }
RETURN count(p)

Mainly, for myself to remember when I read my own code, that I make the assumption that there are fewer persons working for a company than persons that live in a city. I expect the part (p:Person)-[:WORKS_AT]->(:Company{name: $company_name}) to be more "selective" thus reducing "wasted traversing of relathinships".

However, the query planner, will use statistics about your graph to select what leg of the path (:City)<-[:LIVES_AT]-(:Person)-[:WORKS_AT]->(:Company) to expand first. And it may change as your graph grows.

If you have a-priori knowledge about your data. You can give hints to the query planner: Index hints for the Cypher planner - Cypher Manual

Make sure to profile your query, happy coding!

jakob · August 29, 2024, 12:21pm

Hi Hakan,

Thanks for the input. One approach I’m considering is adding companyId and cityId as properties directly on the Person node and then creating a composite index on these properties. This would allow me to efficiently query Person nodes based on both relationships.

However, this approach would require manually updating these properties whenever the WORKS_AT or LIVES_IN relationships change. Alternatively, do you know if I could use APOC triggers to automate this process?

Do you think this is a viable solution for optimizing the query, or do you see any obvious downsides to it?

Thank you

hakan.lofqvist1 · August 29, 2024, 12:37pm

For the type of query where you ancor on two nodes to traverse two "legs", it is always a worry that it will become a lot of work for the supernodes. Like "Google" and "Mountain View". But just as you describe, optimizing the model for read queries, may have an impact on write queries (why is everything in life a tradeoff?). EDIT: I also see a problem with your suggestion if persons work for more than one company or lives in more than one place.

My recommendation (not how to solve but how to approach) is to wait and see for a while, until you have a more evidence (or more quries) for how your application will interact with the graph.

With some luck, the query "what persons lives in x and works for y" will not be important for your application. Unless your application is manly analytical/reporting. And if that is the case, just throw it at the parallel runtime.

Apologies for my wild guesses and assumptions about your application.

Topic		Replies	Views
Using Relationship Properties to Filter Other Nodes Neo4j Graph Platform migrated	4	189	February 8, 2023
Querying relationships slow performance Cypher performance , cypher , relationship	4	2037	October 15, 2020
Which is more efficient, using properties or traversing relationship? Neo4j Graph Platform cypher	5	887	April 16, 2023
Optimising query performance with a relatively simple match Cypher performance	3	656	July 3, 2020
Performance Expectations - 55m nodes, 3000m relationships Neo4j Graph Platform	3	128	May 26, 2022

Get Certified in June!

Indexing and Efficiently Filtering Based on Relationships in Neo4j

Related topics