Encoding issue with space characters

mojo2go · September 8, 2021, 9:32pm

I have pulled in data from separate data sources. I am of course de-silo-ing the data by making connections across them. However I found a problem. I ran a DISTINCT on the names and got 2 seemingly identical results. Well, okay, to the eye they are identical:

But their space characters are encoded differently. In the example image, the upper name's spaces all have the byte id of 32, and the lower name's spaces all are constructed of the two byte ids of 194 followed by 160.

Here is an article on the exact problem: UTF-8 encoded space (194 160) problem - Programmer All

I need to homogenize these two datasets so that these names are actually comparable. Is there an APOC or other approach, like using apoc.text.replace() to find one kind of space and convert it to the other to homogenize these?

Note: I try to copy and paste the spaces from the Neo4j desktop back into a replace function but I think the encoding gets homogenized when rendered to my screen.

One more thing: I have loaded all the data from the two datasets from csv files that came from large Excel files. Possibly I can change something in Excel or in my Visual Code editor to make them comparable.

koji · September 9, 2021, 3:06am

Hi @mojo2go

Created a node for Keanu Reeves with 1 byte space and 2 bytes space.

CREATE (:Person {name:'Keanu Reeves', code:'ascii'});
CREATE (:Person {name:'Keanu Reeves', code:'utf8'});

This query will only retrieve nodes that have a space of 1 byte.

MATCH (n:Person)
  WHERE n.name = 'Keanu Reeves'
RETURN n;

This query will retrieve both.
It's hard to see the contents of the replace function, but the first character is a UTF8 space, and the next is an ASCII space.

MATCH (n:Person)
  WHERE replace(n.name,' ',' ') = 'Keanu Reeves'
RETURN n;

If I were you, I would create a separate name for the search.

MATCH (n:Person)
SET n.searchName = replace(n.name,' ',' ');

MATCH (n:Person)
  WHERE n.searchName = 'Keanu Reeves'
RETURN n;

Is this answer helpful?

mojo2go · September 9, 2021, 1:33pm

Hi Koji, I finally was able to find the one instance of the name from my source file that had a different encoding for 2 of its 3 spaces. That name, when compared with an equal sign ( = ) is considered different when compared to all other instances of the name. I learned this through extensive trial and error, copyng and pasting from the original source file into Neo4j. Once I found it, then I was able to build the replace query in the Neo4j browser as you described. I was hoping that there could be a way for me to use Neo4j alone to locate the problem character and fix it...like by using a conversion function. But I didn't find that for UTF8. I tried the only reasonable conversion function I could find: convert to string:

RETURN apoc.convert.toString(freakyspace) = apoc.convert.toString(' ')

But it did not homogenize the two spaces ; they did not become equal so the above query returned false.

Microsoft SQL Server has a lot of conversion options https://docs.microsoft.com/en-us/sql/t-sql/functions/cast-and-convert-transact-sql?view=sql-server-ver15 . I don't know if those would help here.

I think what has caused this is a source Microsoft Excel file where a value with a different encoding was pasted into that file, so that there were differnent encodings in the same document...if that's even possible? I exported the excel to a .csv file, and that anomaly was preserved. I would love to be corrected.

koji · September 10, 2021, 9:42pm

Hi @mojo2go

I noticed that c2a0 is not a 2-byte space in UTF8, but a Non-breaking Space.
Non-breaking space can be entered on a Mac with option + space.
The first space in the following is option + space
The second one is space.

MATCH (n:Person)
  WHERE replace(n.name,' ',' ') = 'Keanu Reeves'
RETURN n;

mojo2go · September 30, 2021, 7:23pm

I didn't even know that existed. I had to test in in my editor and it worked perfectly, gluing two adjacent words together. Thanks @koji

Topic		Replies	Views
Filtering Non-English Characters In Neo4J Cypher operations	4	1131	December 4, 2020
MINOR BUG: Output of U+00A0 (non-breaking space) in Browser Table is a space (U+0020) Browser browser	3	447	January 19, 2021
How to get TEXT functions like apoc.text.camelCase(text) in neo4j 3.5.4 Procedures & APOC	5	635	March 27, 2020
To merge similar named nodes in neo4j Cypher	1	1876	July 23, 2019
How to deal with case sensitive problem when we load the csv file? Neo4j Graph Platform	2	565	April 27, 2020

Demystifying Neo4j UX Research

Encoding issue with space characters

Related topics