How to more efficiently store and search nodes in dealing with string mixed cases?

lingvisa · January 30, 2021, 1:50am

Ideally, I want to keep the original form of strings in the graph, so I want to store the case sensitive form. However, at query time. I want to ignore the case sensitivity by using regular expression search as below:

WHERE m.name =~ '(?i)neo'

The problem with regular expression search is that it doesn't take usage of indexes, thus reducing speed a lot. I tested one query and the time spent are 1ms VS 120 ms respectively observed in the browser.

If I always store lowercase forms of string and at query time I convert query string into lower case, this works best for performance, but it looks bad in the graph, i.e. netbase instead of NetBase.

Is there any suggestion on handling case more efficiently?

ameyasoft · January 30, 2021, 5:40am

Try this:

Use apoc.text.clean

apoc.text.clean(x.value) = apoc.text.clean('Hxb')

apoc.text.clean('Hxb') = 'hub'

lingvisa · January 31, 2021, 4:08pm

This converts lower case to upper case and store the lower case in the graph. My question is whether there is a way to store the original form and at query time still do case insensitive matching without using regular expression matching?

ameyasoft · January 31, 2021, 6:38pm

Say you have a node property as: a.name = "NetBase".
You can query this:

WHERE apoc.text.clean(a.name) = apoc.text.clean("net base")

apoc.text.clean(a.name)  = "netbase"
apoc.text.clean("net base") = "netbase"

apoc.text.clean removes blank spaces, special characters and gives the result all in lower case as a continuous string.

lingvisa · February 1, 2021, 6:46pm

I see. But the testing shows that adding a function apoc.text.clean() or ToLower() significantly degraded the same query performance compared with the pure equality checking in the Where clause, almost the same as using regular expression. So probably it doesn't make use of the index either.

ameyasoft · February 1, 2021, 8:42pm

Another approach that is working for me is to add another property and store the lower case value. Then query against this lower case value.
a.name = "NetBase", a.lowerName = "netbase"

lingvisa · February 1, 2021, 9:28pm

Yes. That's probably is a good approach to go.

andrew_bowman · February 2, 2021, 1:54am

Alternately you can use fulltext schema indexes, which you can create and query using procedures, which support case insensitive lookup. That way you won't have to store multiple properties.

lingvisa · February 2, 2021, 2:50am

@andrew_bowman The fulltext search schema indexes use Lucene to create indexes and return results based on probabilistic ranking. That's a keyword based search and is useful for a lot of cases. However, if I just want to do exact match except for case insensitivity, I can't use fulltext search. Right, because fulltext does partial matches?

andrew_bowman · February 2, 2021, 3:06am

You can almost get there. If we quote the string you're searching for, it will do a phrase search, case insensitive, but in the field there may be words before or after the phrase.

So you could use this to quickly narrow down possibilities, and then do one additional round of filtering on the results to find your exact match:

CALL db.index.fulltext.queryNodes("title", '"the matrix"') YIELD node, score
WHERE toLower(node.title) = 'the matrix'
RETURN node

lingvisa · February 2, 2021, 3:33am

@andrew_bowman That's a smart way to do this, and thanks. But when I tested it in the browser, the warning says that
One of the property names in your query is not available in the database, make sure you didn't misspell it or that the label is available when you run this statement in your application (the missing property name is: NameIndex)

'NameIndex' is the index name I created for the 'name' property. And the full query is below:

CALL db.index.fulltext.queryNodes("NameIndex", '"the matrix"')  YIELD node, score 
WHERE toLower(node.NameIndex) = “the matrix"
return node, score

If I remove the WHERE clause, I will get names containing the keyword 'matrix'.

Anything wrong in my query?

lingvisa · February 2, 2021, 11:21pm

@andrew_bowman Please take a look at my issue (comment) above and see whether there is a solution to that. Thanks!

lingvisa · February 3, 2021, 1:13am

Regarding " If we quote the string you're searching for, it will do a phrase search, case insensitive, but in the field there may be words before or after the phrase.". In Lucene, isn't a phrase search for strings with double quotes an exact search? If yes, why 'in the field there may be words before or after the phrase'?

According to this documentation, the quotes should return exact matches, not partial matches:

https://neo4j.com/docs/cypher-manual/current/administration/indexes-for-full-text-search/

But my test shows that even if the string is quoted, it still returns non-exact matches.

andrew_bowman · February 3, 2021, 9:48pm

lingvisa:

'NameIndex' is the index name I created for the 'name' property. And the full query is below:
CALL db.index.fulltext.queryNodes("NameIndex", '"the matrix"')  YIELD node, score 
WHERE toLower(node.NameIndex) = “the matrix"
return node, score
If I remove the WHERE clause, I will get names containing the keyword 'matrix'.

You're confusing the name of the index (NameIndex) with the name of the property on the node (name) which is used to populate the NameIndex.

The CALL is against the NameIndex, that's done properly.

The WHERE clause is acting on the node, and you want to filter on the node's property, so you should use node.name here.

lingvisa · February 3, 2021, 10:11pm

However, my NameIndex is built on a few properties, 'name', 'alias_name', 'foreign_name', just like this example, in which the index is built upon both 'title' & 'description':

fulltext.createNodeIndex("titlesAndDescriptions",["Movie", "Book"],["title", "description"])

In such a case, the filtering based on WHERE clause may be impossible?

andrew_bowman · February 3, 2021, 10:18pm

It's possible, but you may need to make use of coalesce() or have more complex boolean logic on your filtering.

If, on a node, only one of those properties exists, but not the others, then you can use coalesce() across the properties, and the first non-null found will be used.

Otherwise, you can use a more complex boolean logic with OR'd terms, or use a list predicate, something like:

...
WHERE any(prop IN ['name', 'alias_name', 'foreign_name'] WHERE toLower(node[prop]) = 'the matrix')
...

lingvisa · February 3, 2021, 10:56pm

I will try the 2nd option and it should work. Thanks!

lingvisa · February 4, 2021, 1:38am

I received this message:

Expected a string value for toLower, but got: StringArray[Basire]; consider converting it to a string with toString().

I think the reason is that, my 'name' and 'foreign_name' property is string, but 'alias_name' is created as an array property. Then how to handle this situation? I tried to add another embedded WHERE clause but I failed, as below:

CALL db.index.fulltext.queryNodes("NameIndex", "the matrix")
YIELD node
WHERE any(prop IN ['name', 'foreign_name', 'alias_name''] 
   WHERE any (p in node[prop]) 
       WHERE p ='the matrix')
RETURN properties(node) as properties LIMIT 1

Also, the above query assumes all are array properties, but 'name' is a string property.

lingvisa · February 4, 2021, 3:13am

@andrew_bowman I just realized that if I created array properties, i.e.

node.alias_name = ['US', 'USA', 'U.S.A']
node.name = 'United States'

Even if I created NameIndex to include the alias_name, it won't work because full_text only works with string values, not array. Is that right?

CALL db.index.fulltext.createNodeIndex('NameIndex', ['Country'], ['name', 'foreign_name', 'alias_name'])

To take advantage of the full_text index, I need to change my model, for example, to create separate nodes for array values, instead of treating them as properties.

andrew_bowman · February 5, 2021, 5:12pm

I believe that's so.

Topic		Replies	Views
Case Insensitive General migrated	3	344	October 4, 2022
I want to disable case sensitive matching when running a cypher query Aura & Cloud	0	258	October 31, 2023
Case insensitive query for a user filter Neo4j Graph Platform migrated	1	240	July 3, 2022
How to make case-insensitive filtering using neo4j-graphql-js? GraphQL & GRANDstack	4	1680	November 20, 2020
Is there a more efficient version of this query? Neo4j Graph Platform	20	1488	May 31, 2020

How to more efficiently store and search nodes in dealing with string mixed cases?

Related topics