I have pulled in data from separate data sources. I am of course de-silo-ing the data by making connections across them. However I found a problem. I ran a DISTINCT on the names and got 2 seemingly identical results. Well, okay, to the eye they are identical:
But their space characters are encoded differently. In the example image, the upper name's spaces all have the byte id of 32, and the lower name's spaces all are constructed of the two byte ids of 194 followed by 160.
Here is an article on the exact problem: UTF-8 encoded space (194 160) problem - Programmer All
I need to homogenize these two datasets so that these names are actually comparable. Is there an APOC or other approach, like using apoc.text.replace() to find one kind of space and convert it to the other to homogenize these?
Note: I try to copy and paste the spaces from the Neo4j desktop back into a replace function but I think the encoding gets homogenized when rendered to my screen.
One more thing: I have loaded all the data from the two datasets from csv files that came from large Excel files. Possibly I can change something in Excel or in my Visual Code editor to make them comparable.