How to more efficiently store and search nodes in dealing with string mixed cases?

Ideally, I want to keep the original form of strings in the graph, so I want to store the case sensitive form. However, at query time. I want to ignore the case sensitivity by using regular expression search as below:

WHERE m.name =~ '(?i)neo'

The problem with regular expression search is that it doesn't take usage of indexes, thus reducing speed a lot. I tested one query and the time spent are 1ms VS 120 ms respectively observed in the browser.

If I always store lowercase forms of string and at query time I convert query string into lower case, this works best for performance, but it looks bad in the graph, i.e. netbase instead of NetBase.

Is there any suggestion on handling case more efficiently?

Try this:

Use apoc.text.clean

apoc.text.clean(x.value) = apoc.text.clean('Hxb')

apoc.text.clean('Hxb') = 'hub'

This converts lower case to upper case and store the lower case in the graph. My question is whether there is a way to store the original form and at query time still do case insensitive matching without using regular expression matching?

Say you have a node property as: a.name = "NetBase".
You can query this:

WHERE apoc.text.clean(a.name) = apoc.text.clean("net base")

apoc.text.clean(a.name)  = "netbase"
apoc.text.clean("net base") = "netbase"

apoc.text.clean removes blank spaces, special characters and gives the result all in lower case as a continuous string.

I see. But the testing shows that adding a function apoc.text.clean() or ToLower() significantly degraded the same query performance compared with the pure equality checking in the Where clause, almost the same as using regular expression. So probably it doesn't make use of the index either.

Another approach that is working for me is to add another property and store the lower case value. Then query against this lower case value.
a.name = "NetBase", a.lowerName = "netbase"

Yes. That's probably is a good approach to go.

Alternately you can use fulltext schema indexes, which you can create and query using procedures, which support case insensitive lookup. That way you won't have to store multiple properties.

@andrew_bowman The fulltext search schema indexes use Lucene to create indexes and return results based on probabilistic ranking. That's a keyword based search and is useful for a lot of cases. However, if I just want to do exact match except for case insensitivity, I can't use fulltext search. Right, because fulltext does partial matches?

You can almost get there. If we quote the string you're searching for, it will do a phrase search, case insensitive, but in the field there may be words before or after the phrase.

So you could use this to quickly narrow down possibilities, and then do one additional round of filtering on the results to find your exact match:

CALL db.index.fulltext.queryNodes("title", '"the matrix"') YIELD node, score
WHERE toLower(node.title) = 'the matrix'
RETURN node

@andrew_bowman That's a smart way to do this, and thanks. But when I tested it in the browser, the warning says that :slight_smile:
One of the property names in your query is not available in the database, make sure you didn't misspell it or that the label is available when you run this statement in your application (the missing property name is: NameIndex)

'NameIndex' is the index name I created for the 'name' property. And the full query is below:

CALL db.index.fulltext.queryNodes("NameIndex", '"the matrix"')  YIELD node, score 
WHERE toLower(node.NameIndex) = “the matrix"
return node, score

If I remove the WHERE clause, I will get names containing the keyword 'matrix'.

Anything wrong in my query?

@andrew_bowman Please take a look at my issue (comment) above and see whether there is a solution to that. Thanks!

Regarding " If we quote the string you're searching for, it will do a phrase search, case insensitive, but in the field there may be words before or after the phrase.". In Lucene, isn't a phrase search for strings with double quotes an exact search? If yes, why 'in the field there may be words before or after the phrase'?

According to this documentation, the quotes should return exact matches, not partial matches:

https://neo4j.com/docs/cypher-manual/current/administration/indexes-for-full-text-search/

But my test shows that even if the string is quoted, it still returns non-exact matches.

You're confusing the name of the index (NameIndex) with the name of the property on the node (name) which is used to populate the NameIndex.

The CALL is against the NameIndex, that's done properly.

The WHERE clause is acting on the node, and you want to filter on the node's property, so you should use node.name here.

However, my NameIndex is built on a few properties, 'name', 'alias_name', 'foreign_name', just like this example, in which the index is built upon both 'title' & 'description':

fulltext.createNodeIndex("titlesAndDescriptions",["Movie", "Book"],["title", "description"])

In such a case, the filtering based on WHERE clause may be impossible?

It's possible, but you may need to make use of coalesce() or have more complex boolean logic on your filtering.

If, on a node, only one of those properties exists, but not the others, then you can use coalesce() across the properties, and the first non-null found will be used.

Otherwise, you can use a more complex boolean logic with OR'd terms, or use a list predicate, something like:

...
WHERE any(prop IN ['name', 'alias_name', 'foreign_name'] WHERE toLower(node[prop]) = 'the matrix')
...

I will try the 2nd option and it should work. Thanks!

I received this message:

Expected a string value for toLower, but got: StringArray[Basire]; consider converting it to a string with toString().

I think the reason is that, my 'name' and 'foreign_name' property is string, but 'alias_name' is created as an array property. Then how to handle this situation? I tried to add another embedded WHERE clause but I failed, as below:

CALL db.index.fulltext.queryNodes("NameIndex", "the matrix")
YIELD node
WHERE any(prop IN ['name', 'foreign_name', 'alias_name''] 
   WHERE any (p in node[prop]) 
       WHERE p ='the matrix')
RETURN properties(node) as properties LIMIT 1

Also, the above query assumes all are array properties, but 'name' is a string property.

@andrew_bowman I just realized that if I created array properties, i.e.

node.alias_name = ['US', 'USA', 'U.S.A']
node.name = 'United States'

Even if I created NameIndex to include the alias_name, it won't work because full_text only works with string values, not array. Is that right?

CALL db.index.fulltext.createNodeIndex('NameIndex', ['Country'], ['name', 'foreign_name', 'alias_name'])

To take advantage of the full_text index, I need to change my model, for example, to create separate nodes for array values, instead of treating them as properties.

I believe that's so.