Cypher query traversal evaluator


(Petercheng) #1

Hi everyone,

We have a requirement in which we need to be able to accept any user defined cypher query, execute the user supplied query in an embedded Neo4j server but have every traversal be evaluated based on certain business logic in our server to determine if the user has access rights to current node for any given traversal.

After investigating the cypher query documentation, I haven't been able to find a way to add a traversal evaluator without using the traversal API. The issue with the traversal API is its used through Java API which we cannot expect users to write java to query data as it would be too complex to have users write an imperative query language.

For those that are familiar with Apache Tinkerpop, what we are trying to achieve is similar to executing a user defined gremlin script using GremlinExecutor.eval() function but at the same time adding a traversal strategy to do an access control check on every traversal step by adding a traversal strategy analogous to Neo4j traversal API evaluator using the GraphTraversalSource.withStrategies() function.

We are currently evaluating Neo4j to see whether it can serve as the underlying schema-less database for our products. This will be a deal breaker if there isn't a good way for us to do this.

Has anyone encountered this situation before? If so, how did you solve this issue? Or is there any other approach for us to solve our requirement?

Thanks so much,


(M. David Allen) #2

Hi Peter,

The best way to blend use of both the traversal API and the declarative Cypher query style is to use procedures and functions in Neo4j.

https://neo4j.com/docs/java-reference/current/extending-neo4j/procedures/

Basically, you should probably use the declarative Cypher query style for as much as you possibly can, to locate the various paths you're interested in, and then write a cypher function or procedure to help you evaluate or filter those.

At a very high level, I'm thinking about something like this:

MATCH (start:Node { id: 'A' }), (end:Node { id: 'B' })
WITH a, b
MATCH p=(start)-[:relationship*]->(end)
WITH p
CALL myLibrary.customPathFilter(p) YIELD p, status
WHERE status = 'OK'
RETURN p;

In this case myLibrary.customPathFilter can use the Traversal API or anything else it wishes.

On the question of what the users would write, or what they'd be expected to write though, I think we need more detail. What is the nature of what users would have to write there? If you say that you need an imperative traversal, then it stands to reason you have 2 options:

  1. Create a library of simpler imperative traversals and let the user choose which to use
  2. Expose some sort of language where the user can define their own imperative traversal

If you want to go the route of using some DSL (option 2) then we could discuss options for that, but it would seem that either way the users would be coding if you can't offer a pre-defined library of traversals that covers the needed use cases.


(Michael Hunger) #3

In the declarative cypher query you can already use expressions like ALL/NONE/ANY/SINGLE to express conditions over the path (e.g. over nodes(path) or rels(path) ).

If needed you can also iterate over those in an indexed manner and compare predecessor with current.
There are a number of apoc functions to help with that.

For sub-expression, also have a look at pattern comprehensions which can express [(a)-[r:TYPE]->(b) WHERE predicates(a,b,r) | expr(a,b,r) ] while introducing new identifiers (e.g. r and b).

It would be good to see some example queries on your dataset.

In APOC there is apoc.pathExpand and friends which gives a lot of power but is probably not the best fit for end-users.

We've already discussed passing arbitrary expressions into those functions to be used as callbacks in the underlying evaluators/expanders. Some concrete use-cases could help drive the implementation there.

User based permissions are something that will be extended in the next version of Neo4j.


(Petercheng) #4

Hi David and Michael,

To explain our use case, the user can use our application to auto-generate reports, graphs, pdfs, spreadsheets, dashboard widgets etc for any interested information they want for our collaboration life-cycle management systems index. The data stored in our index is secured through access control rules, and configurations selected govern and scope what versions of resources are valid for any given query. The user will make selections through the UI on what they want to report on and the system will auto-generate the queries to be used extract the requested information. 99% of the time the user will never need to know or even look at the query.

However there will be times when the user may need to hand tweak queries to optimize performance depending on their needs. Therefore we expose the generated queries for the users, and provide a UI to allow them to test and tweak the queries as well as save them. We also have a REST servlet which used for dashboard widgets to execute remote queries in order to extract necessary information to build the widgets.

While in the case where users never have to touch the queries, the option of using a cypher function/procedure would work perfect as we could make our query generator make calls to our custom function/procedure to enforce our access control and configuration scope business logic. But in the cases where the user hand tweaks a query in our UI or if a hacker submits a GET request omitting those calls to our access control function/procedure, the user can get access to information they are not authorized to do so which would leave a security hole. Every query that get's executed on our server must follow our access control and configuration rules 100% of the time.

I suppose the other option as David suggested is to use a custom DSL, which we would then make it generate the Cypher query enforcing our access control business logic. Then we would have our server only ever expose this custom DSL and never expose Cypher to end user directly. However, having this DSL seems like a lot more work to create, manage, document and maintain.

I was hoping there is some kind of hook that I could set in Neo4j which would get executed on each and every queries node traversal to callback additional business logic to say whether yes or no to proceeding further. Other database frameworks offer this feature, but I haven't found such a hook for Neo4j yet.

Please confirm with me if such a hook exists or not, or if there are any other possible approaches you can think of to try and solve our use case without leaving any possibility of by-passing our access control and configuration rules.

Thanks so much,
Peter


(M. David Allen) #5

Specifically to the point of the kind of hook you're asking for, I don't think that exists in Neo4j that I know of, but I also don't think it's strictly necessary to accomplish your goals.

Suppose we have a user query Q which results in fields A, B, and C. Most of the time, Q will be system generated, and so there (hopefully) isn't a security concern or at least that concern is at the level of the software which generates Q, not from direct user input. Sometimes, Q will be generated by the user.

If this is the case, then we can separate the query into two phases:

  1. Go get stuff
  2. Check that the results are OK before giving them back.

I do think in this case what I would do is implement a stored procedure with any non-cypher business logic I needed. Then I would constrain Q and say that it always has to have the same return type. Let's say that it only returns paths, or only nodes, or so on, and it must return columns A, B, and C.

At the app layer then, I would issue a composite query, which would be like this:

<Q>
WITH a, b, c
CALL myPackage.filter(a, b, c) YIELD a, b, c
RETURN a, b, c;

Imagine now that Q=MATCH (a:Person)-[:FRIENDS]->(b:Person)-(c:Person).

Now, if <Q> is a cypher query that is always guaranteed to produce a, b, c, you can do this every time and know that it's impossible to evade the final filter myPackage.filter.

Now imagine Q=MATCH (something:Else)-[:unauthorized]->(secret). The composite query cannot leak secret, and in fact the query will fail. If the query returns something other than a, b, c -- then it still can't get through. The purpose of myPackage.filter is just to apply business logic and remove anything the user isn't permitted to see.

This will work both in the case where Q is system generated, and when Q is user provided.

From a strict security perspective, any time you let users specify a query, you may have potential security problems. What if they MERGE/DELETE? What if they call other procedures? This should probably be sandboxed to a user in the database with absolute minimum necessary permissions to accomplish the purpose. And constraining what the user can do without permitting them to write arbitrary queries would always be a better choice (in any DB) if you could swing it.