Filtering Non-English Characters In Neo4J

Hello everyone!

I have a database with non-English characters (like accents, namely: “é, í, â, à”) and I'm having issues when trying to filter fields containing these characters. For instance, imagine a node called “Ángeles Martínez”. I attempted the following:

<MATCH (p:PERSON)-[r:APPEARS_IN]->(p2:ARTICLE) WHERE p.Name =~ '(?i)angeles martinez' RETURN p/>

The issue is I want to filter those names having non-English characters in the database without explicitly writing them on the query (i.e.: I would like to write “angeles martinez” and, then, Neo should retrieve the node called “Ángeles Martínez”).

I have implemented the following solutions with no success at all:

  1. Implement a regex like: (?iu), (?u)… (it didn’t work)
  2. Concatenate several “replace” functions aiming to remove accents from the original nodes (this works but it’s not the ideal solution I’m looking for)
  3. Implement the index fulltext search functionality (it didn’t work)
  4. Include in apoc.text.regreplace() a regex to remove those accents (it didn’t work)

I have recently seen that a user defined function (UDF) can be created and it may solve the issues with the filtering. However, I’m planning to use Python to query the Neo4J database and these UDFs seem to work only for Java.

Does anyone know how to address this issue?

Many thanks in advance :slight_smile:

Hi @trinidad.gonzalez

How about separating Person's Name at CREATE?
The Cypher is like this.

CREATE (:Person {
  name: $name,
  englishName : replace(replace(replace(replace(replace($name,'é','e'),'í','i'),'â','a'),'à','a'),'Á','A')
})

You can search the englishName.

I would store a "DisplayName" Property which includes Unicode string and a "NornalizedName" Property which has the removed diacritical marks.

Then you can query:
MATCH (p:PERSON)-[r:APPEARS_IN]->(p2:ARTICLE) WHERE p.NornalizedName = 'angeles martinez' RETURN p

If you don't know which one, you can do both:

MATCH (p:PERSON)-[r:APPEARS_IN]->(p2:ARTICLE) WHERE p.NornalizedName = searchname OR p.DisplayName = searchname RETURN p

There are functions in various languages to remove diacritical marks for Unicode. Unfortunately, these functions aren't in APOC (yet):

You may have to write a UDF using the Java function.

I have made a PR for a function that will remove diacritical marks:

This does have the disadvantage of taking up more storage space, but it will be faster and more flexible.

Use apoc.text.clean:

with "é, í, â, à" as s1
return apoc.text.clean(s1) as s2

Result: "eiaa"
WITH "Ángeles Martínez" AS s1
RETURN apoc.text.clean(s1) AS s2

The result is "angelesmartinez".
It’s good for search.

I've tried Japanese katakana as well.

WITH "アンジェルス・マルティネス" AS s1
RETURN apoc.text.clean(s1) AS s2

The result is "アンシェルスマルティネス".
I found that the voicing diacritic mark (little dash) is gone from the name.
(from ジ to シ)

WITH "はひふへほ ぱぴぷぺぽ ばびぶべぼ" AS s1
RETURN apoc.text.clean(s1) AS s2

The result is "はひふへほはひふへほはひふへほ".
The voicing diacritic (little dash) and p-sound mark (little circle) are converted.

The "apoc.text.clean" can be used in Japanese processing as well.