Cypher grammer, syntax, and documentation -- List, Collect, avg, etc are impenetrable to me

I find Cypher, or at least its documentation, to be extraordinarily frustrating because it tells me obvious things and is silent about the things that matter. I am new to Cypher, that's why I depend on the docs. Let me give an example.

As a baby step, I just want to see "avg" work in a live bolt browser (mumble:7474). The doc says "avg returns the average of a set of numeric values". Great. I guessed that from its name. The docs say that the syntax is "avg(expression)". Duh.

So I need to understand how to write an expression in Cypher that evaluates to a set of numeric values. That's EASY (in the bolt browser):

RETURN [1, 2, 3, 4, 5]

Push the little button and I get:

[1, 2, 3, 4, 5]

Ok, I just want to pass that to avg and return the result, right?

RETURN avg([1, 2, 3, 4, 5])

Wrong. ERROR: Neo.ClientError.Statement.SyntaxError --

Type mismatch: expected Float, Integer or Duration but was List<Integer> (line 1, column 12 (offset: 11))
"RETURN avg([1, 2, 3, 4, 5])"
            ^

Hmmm. Well, maybe I need to wrap it in something. But what?

RETURN COLLECT([1, 2, 3, 4, 5])

Nope, that's a list of lists.

Well, the complaint seemed to squawk about the square bracket, so maybe the following will work:

RETURN avg(1, 2, 3, 4)

Nope, no joy.

And so suddenly I find myself lost in the Cypher manual.

If a list literal -- [1, 2, 3], or range(0, 3) -- doesn't work as an argument to "avg", then what DOES work? What TYPE is an "expression" in the documentation of avg?

I sort-of understand how paths work. Once I understand how to talk to cypher, I THINK what I want to do is straightforward -- I want to collect a set of data values from an existing graph and compute something that looks a lot like a standard deviation (full, not sample).

My intuition says, perhaps incorrectly, that I ought to be able to do all of this in Cypher so that once I have a query I can apply it to my database (with several hundred thousand datapoints).

I know what a list comprehension is in Python, and I'm perfectly comfortable with its counterpart in Javascript/nodejs/React. I can write what I want in about 10 minutes in any of those. I think I can probably do it in MySQL if I had to (perish the thought).

I think this is almost surely a documentation issue and/or pilot error.

Does this REALLY have to be so hard? Is there some resource that walks through stuff like this that I don't know about?

I want to know, quite specifically (as in a BNF grammar or something similar) EXACTLY what type avg expects. Then I want to find out exactly what type these various clauses emit.

A phone-book style enumeration of names and natural-language descriptions just barely scratches the surface.

Half of Cypher's magic is in transformation between different groupings of data. This really boils down to one big matrix, which can be thought of like a table.

  • Any operation will act on one "row" at a time, until it has done all rows.
  • Any variable or property in that row can be a list.
  • Aggregation functions, and things like COLLECT take multiple rows, and turn them into one variable in one row.
  • An expression is a part of a Cypher command which "chooses" specific variables from a row.
  • WITH chooses specific variables to pull from all rows, and use in following commands.
  • UNWIND turns a list (or collection) into rows.

So, to answer your question, if you're working with a list, or collection, you have two choises:

  • Use UNWIND to turn the list into rows, then you can apply aggregation functions like avg().
  • Use list functions like REDUCE() to collapse the list into a single variable.

UNWIND

WITH [1, 2, 3, 4, 5] AS list1
UNWIND list1 AS vals
RETURN avg(vals) AS average

In this example, we're turning list1 into a bunch of "rows," where each entry in the list is now referenced by the vals variable. The expressions passed to the avg function is selecting all instances of vals as input to avg.

REDUCE

WITH [1, 2, 3, 4, 5] AS list1
RETURN REDUCE(sum = 0, val IN list1 | sum + val) / SIZE(list1) AS average

Instead of turning the list into a set of rows, we're stepping through all entries in the list, doing some math, and spitting out the result. Essentially creating our own avg function for operating on lists.

Apoc wins everything

No matter what you're doing, APOC probably has a clean, fast, tool that will perfectly fit your need. This case included. Enter, APOC Collection Functions

WITH [1,2,3,4,5] AS list1
RETURN apoc.coll.avg(list1) AS average

See Also: cypher - How to use average function in neo4j with collection - Stack Overflow

Regarding Cypher and aggregations, you may find this tutorial useful:
https://neo4j.com/developer/aggregation-returns-functions/

There should be links on the side for other tutorials as well, that provide more worked out examples and explanations.

1 Like

This is very helpful, I'll play with it a bit. In particular, your six bullets really help. So if we're talking about a table with rows, columns, and cells, then this makes more sense. I get that each cell can be either a value or collection of values. Your explanation of UNWIND and REDUCE is really helpful.

I can't use APOC (yet) for unrelated reasons, that's a topic for a different thread (I run neo4j "headless" in a Centos 7 guest VM, and so there's no way to run a GUI installer on the system I care about. I haven't found a bash command-line installer for APOC yet).

FWIW, the context of this is that I'm using Neo4j to ingest large quantities of data were each datapoint is associated with a geospatial feature (like a county, state, town, or whatever). I've already got the features connected together with an adjacency matrix, so that each county has a relationship with each of its neighbors. I'm now coding the first of several analysis components, where in this case the component needs to perform a calculation for each feature using the datapoint of that feature together with each of its neighbors.

The schema works great, and the query to collect the relevant datapoints (one feature at a time) is looking good so far. What drove me to write this flame is that I've been arguing with Cypher about how to code up the actual MAP/REDUCE code (in functional programming terms).

Your comment is a big help and I greatly appreciate it.

I'm glad to be able to help!

Installing APOC in linux really doesn't need a GUI.

  • Put the apoc jar in Neo4j's /plugins director
  • Make one adjustment in the config file
  • Restart neo4j

Note: In linux, the Neo4j plugin directory is usually at /var/lib/neo4j/plugins
Note2: Find the latest version of apoc here: Release 5.3.1 · neo4j-contrib/neo4j-apoc-procedures · GitHub (if you're using an older version of Neo4j, check apoc's pom.xml to find the closest matching neo4j version).

The simplest would just be to use wget

sudo apt-get update
sudo apt-get install wget
sudo service neo4j stop
cd /var/lib/neo4j/plugins
wget https://github.com/neo4j-contrib/neo4j-apoc-procedures/releases/download/4.0.0.12/apoc-4.0.0.12-all.jar
sudo chown neo4j:neo4j apoc-4.0.0.12-all.jar
sudo echo "dbms.security.procedures.unrestricted=apoc.*" >> /etc/neo4j/neo4j.conf
sudo service neo4j start

Heh, I guess I had already done that along the way and forgotten about it. I found v4.0.0.6 already there, so I just pulled down the new one, restarted, and all is well.

I also, today, got the query working that I was looking for. I appreciate all your help!

2 Likes

Haha, I see what you mean... :slight_smile:

They introduced lists, patterns, maps etc. That's great, but this is also a double-edged sword now, because some functions do not work on them (not directly anyway). You frequently have to UNWIND lists into rows, COLLECT row values into lists etc.

To me, it is frustrating CREATE/MERGE cannot take a list[i] element in its syntax, you have to use a hack (like in FOREACH (elem3 in list[3])) just to provide an alias for it...

There are also many limitations in nesting expressions directly, you frequently need a WITH clause to separate them like into pipelines. I don't complain, as the query may look cleaner this way. But sometimes you're just surprised you cannot do it "the SQL way"...

Indeed. I've been a professional software developer nearly forty years and I certainly understand that rough edges come with aggressive technology like Neo4J.

I'm primarily asking for documentation that describes the current state of the world. Since this is Java under the covers, it seems as though it ought to be possible to enumerate the types (in a Java sense) that are accepted as arguments and the types answered by the various operators.

I'm really fuzzy about how the stages of a query chain together. I'm getting where I need to go through trial-and-error. Since I suspect the development team has walked this way before, I'm hoping to provide encouragement for them to find someone who can document what's happening as they go.

Just one more note here: make no mistake, it is developed in Java, but do not expect anything like in Java from Cypher! Java is a procedural programming language, while Cypher is a declarative querying language. If anything else comes closer to Cypher, it is SQL.

I know you are referring to the doc, but... Their Cypher Manual is actually great: https://neo4j.com/docs/cypher-manual/4.0/

Lists and most other "confusing" parts could be much better understood if you read something about functional programming:

  • list (and pattern) comprehensions are borrowed from Python
  • map-reduce techniques are pure functional concepts (intensively used in Hadoop)
  • switching lists with rows are also common transpositions in the functional world (but not only)
  • chaining operations using intermediate WITH is somehow similar (not identical!) to the WITH nested subquery approach in SQL...

Cheers,
-C

1 Like

I've been coding in Python for fifteen years. I used the published information about Neo4J to write something similar in Python. I've coded in Haskell, and I understand the differences between a functional, imperative, and declarative language. I've been buried in the 4.0 manual since mid March, so I know it pretty much chapter and verse. :slight_smile:

List and pattern comprehensions in Python have very well structured and documented inputs and semantics. The same is actually true of SQL, at least to some extent. That's what I'm looking for here.

When I add a "WITH <> AS <>" statement in Cypher, I want to KNOW what ends up being bound in subsequent statements. Not just the name, but the structure. I don't want to have to discover, through trial and error, when I need "UNWIND", "COLLECT", "LIST", and so on. That's what I'm missing from the cypher docs.

BTW, I'm also familiar with jQuery, Javascript, node JS, and so on. So I think I have a reasonable grasp of the concepts involved and how they're implemented. It is the Cypher language itself that I'm struggling with, more than the abstractions and behaviors it describes.

I DO appreciate your help, BTW. This is very cool technology, and I want to help increase its acceptance.

2 Likes

Ok, here's a very specific example of what I mean.

I have a query that is working and very slow. I'm following guidance and using "profile" to identify and somehow remove "eager" operators. Since it takes a LONG time (several hundred thousand nodes in a MATCH on a VM with limited resources), I wanted to follow other guidance and introduce a LIMIT to the query being profiled. But where? I check the docs, and find nothing that describes where, in a Cypher query, I'm allowed to use LIMIT. The part that's there provides great detail about the argument -- in my case, that's going to be trivial (100). All the examples in the section on LIMIT show its use in RETURN. I can't use RETURN unless I introduce a sub-query because I'm in the middle of a much longer query.

It isn't until I go digging in blog posts that I find I can also use LIMIT as part of a WITH clause.

Aha. Once I know to look at WITH in the docs, I'm all set. I lost about an hour digging and experimenting, rather than optimizing. Even an index of the docs would have avoided this for me.

Once more, I'm not trying to flame or bust anybody's chops, I'm instead hoping to provide feedback for how to make the documentation -- and therefore Neo4J -- easier to learn.

Oh, and by the way -- as so many others have apparently learned, avoiding the EAGER operator is so far my hardest single obstacle. I find the "path" concept easy and intuitive, and it dramatically speeds my creation of an initial schema. The price I'm apparently paying is that my use of this intuitively appealing schema introduces EAGER at multiple steps. At least so far, finding and working around those makes actual query construction in Neo4J far more challenging for me.