Filtering Non-English Characters In Neo4J

trinidad.gonzalez · December 4, 2020, 11:39am

Hello everyone!

I have a database with non-English characters (like accents, namely: “é, í, â, à”) and I'm having issues when trying to filter fields containing these characters. For instance, imagine a node called “Ángeles Martínez”. I attempted the following:

<MATCH (p:PERSON)-[r:APPEARS_IN]->(p2:ARTICLE) WHERE p.Name =~ '(?i)angeles martinez' RETURN p/>

The issue is I want to filter those names having non-English characters in the database without explicitly writing them on the query (i.e.: I would like to write “angeles martinez” and, then, Neo should retrieve the node called “Ángeles Martínez”).

I have implemented the following solutions with no success at all:

Implement a regex like: (?iu), (?u)… (it didn’t work)
Concatenate several “replace” functions aiming to remove accents from the original nodes (this works but it’s not the ideal solution I’m looking for)
Implement the index fulltext search functionality (it didn’t work)
Include in apoc.text.regreplace() a regex to remove those accents (it didn’t work)

I have recently seen that a user defined function (UDF) can be created and it may solve the issues with the filtering. However, I’m planning to use Python to query the Neo4J database and these UDFs seem to work only for Java.

Does anyone know how to address this issue?

Many thanks in advance

koji · December 4, 2020, 1:41pm

Hi @trinidad.gonzalez

How about separating Person's Name at CREATE?
The Cypher is like this.

CREATE (:Person {
  name: $name,
  englishName : replace(replace(replace(replace(replace($name,'é','e'),'í','i'),'â','a'),'à','a'),'Á','A')
})

You can search the englishName.

clem · December 4, 2020, 4:32pm

I would store a "DisplayName" Property which includes Unicode string and a "NornalizedName" Property which has the removed diacritical marks.

Then you can query:
MATCH (p:PERSON)-[r:APPEARS_IN]->(p2:ARTICLE) WHERE p.NornalizedName = 'angeles martinez' RETURN p

If you don't know which one, you can do both:

MATCH (p:PERSON)-[r:APPEARS_IN]->(p2:ARTICLE) WHERE p.NornalizedName = searchname OR p.DisplayName = searchname RETURN p

There are functions in various languages to remove diacritical marks for Unicode. Unfortunately, these functions aren't in APOC (yet):

You may have to write a UDF using the Java function.

I have made a PR for a function that will remove diacritical marks:

This does have the disadvantage of taking up more storage space, but it will be faster and more flexible.

ameyasoft · December 4, 2020, 6:05pm

Use apoc.text.clean:

with "é, í, â, à" as s1
return apoc.text.clean(s1) as s2

Result: "eiaa"

koji · December 4, 2020, 7:59pm

WITH "Ángeles Martínez" AS s1
RETURN apoc.text.clean(s1) AS s2

The result is "angelesmartinez".
It’s good for search.

I've tried Japanese katakana as well.

WITH "アンジェルス・マルティネス" AS s1
RETURN apoc.text.clean(s1) AS s2

The result is "アンシェルスマルティネス".
I found that the voicing diacritic mark (little dash) is gone from the name.
(from ジ to シ)

WITH "はひふへほ ぱぴぷぺぽ ばびぶべぼ" AS s1
RETURN apoc.text.clean(s1) AS s2

The result is "はひふへほはひふへほはひふへほ".
The voicing diacritic (little dash) and p-sound mark (little circle) are converted.

The "apoc.text.clean" can be used in Japanese processing as well.

Topic		Replies	Views
Latin Characters with accents Import / Export	4	1269	April 5, 2019
How can i build a query with accent Insensitive in clause Cypher performance , cypher , operations	2	657	November 18, 2020
Neo4j cypher error when node property name contains \u for example: authentication\username Neo4j Graph Platform migrated	19	273	June 17, 2022
Authentication Methods supported by NEo4j Browser	19	1091	July 13, 2020
Data clean-up before import Cypher	9	1196	October 30, 2019

Filtering Non-English Characters In Neo4J

Related topics