Do several string replace functions BEFORE performing match

(Paul Drangeid) #1

I'm importing a list of software from CSV, and it has the Software Vendor, software title, version, and install date.

Part of the import I am trying to match aliases of Software Vendor names to a parent name, so that I can use some fuzzy matches and create proper aliases:
(:Swvendor {name:'Adobe'})-[:HAS_ALIAS]->(:Swvalias {name:'Adobe Systems, Inc.'})

In order to better match I have an array of useless words that's I'd like to filter out by performing a replace function to improve match likelihood:
['systems','industries',' ltd',' inc','networks','technologies,','company','development','l.p.',' llc',' corp',' ltd'] as uselesswords

But if I perform an UNWIND and then a match like this:

NWIND uselesswords as uselessword
WITH *,replace(vendor,uselessword,'') as cleanedvendor,replace(vendor,'?','') as vendor2
MATCH (sv:Swvendor) where not(sv.name =~ vendor2) and sv.name =~ vendor2

it will perform the match on EACH replace string, instead of performing ALL of the replace before attempting the MATCH which is what I want.

Is there a good way in cypher to perform a bulk string replace with ALL the values in an array before performing the MATCH?

0 Likes

(Andrew Bowman) #2

What you're looking for is the reduce() function, which will take an accumulator (your original string, lowercase) and for each useless word perform a replacement on the accumulator. Lastly we can use apoc.text.clean() from APOC Procedures to get rid of whitespace and punctuation.

Here is an example to show how this can work:

WITH ['systems','industries',' ltd',' inc','networks','technologies,','company','development','l.p.',' llc',' corp',' ltd'] as uselesswords, 'Adobe Systems, Inc.' as vendor
WITH reduce(v = toLower(vendor), word in uselesswords | apoc.text.replace(v, word, '')) as filtered
RETURN apoc.text.clean(filtered) as cleaned
0 Likes

(Michael Hunger) #3

Or apoc.text.regreplace should work too.

apoc.text.regreplace(text, regex, replacement) - replace each substring of the given string that matches the given regular expression with the given replacement.

0 Likes

(Paul Drangeid) #4

Andrew - That worked quite well! Thanks for the reply (and on the weekend no-less).

And as an extra bonus (surprise, surprise) the answer is I got to learn another new apoc function...

You guys are rock stars!

PS - anyone looking to dummy down (to alias) a list of company names, here's the words I filtered out to improve matching:

WITH [' a ',' america',' apps','canada',' communications',' communication',' company',' corporations,',' corporation',' corp',' development',' division',' electronics',' enterprise',' entertainment',' global',' incorporated',' inc',' industries',' innovations',' international',' limited' ,' llc' ,' l.p.',' ltd',',ltd',' microelectronics',' networks',' network',' online',' of ',' products',' productions',' project',' s.a.',' security',' semiconductors',' semiconductor',' semiconduct',' services',' software',' solutions',' systems',' system',' team',' the ',' technologies',' technology',' unified',' utilities'] as uselesswords,vendor

WITH vendor,reduce(v = toLower(vendor), word in uselesswords | apoc.text.replace(v, word, ' ')) as filtered

WITH vendor,trim(replace(replace(replace(replace(replace(replace(replace(replace(filtered,'?',''),' co.',''),',',''),'-',''),'(r)',''),'.',''),' ',' '),"'","")) as cleaned
0 Likes

(Paul Drangeid) #5

Update...
I wanted to share my example using a combination of reduce, apoc.text.replace, and apoc.text.regreplace. I first use reduce and apoc.text.replace to replace the list of "helper" words in the company names.

Next I use apoc.text.regreplace to filter the remaining text to only alpha-numeric, and &-. characters.
Finally I replace all multi-spaces with a single space, and trim the final string.

When doing the string comparisons I perform lowercase comparisons with all spaces removed. This allowed consolidation a LARGE variety of different ways to present a company name:
ie: Microsoft, Microsoft corporation, Microsoft corp, inc., Microsoft inc, Microsoft(r) etc etc..
Thanks again for the help gentlemen, and here's the syntax I finally settled on:


WITH [' a ',' ab ',' north america',' america',' apps','canada',' communications',' communication',' company',' corporations,',' corporation',' corp',' cz ',' development',' division',' electronics',' enterprise',' entertainment',' global',' gmbh',' holdings',' incorporated',' inc',' industries',' innovations',' international',' limited' ,' llc' ,' l.p.',' ltd',',ltd',' microelectronics',' networks',' network',' online',' of ',' products',' productions',' project',' s.a.',' security',' semiconductors',' semiconductor',' semiconduct',' services',' software',' solutions',' s.r.o.',' switzerland',' systems',' system',' team',' technologies',' technology',' tech',' unified',' utilities','www.','.com'] as uselesswords,vendor
WITH vendor,reduce(v = toLower(vendor), word in uselesswords | apoc.text.replace(v, word, ' ')) as filtered
with vendor,filtered,"[^a-zA-Z\\d&\\-.]" as myregex
WITH vendor,apoc.text.regreplace(filtered, myregex, ' ') as cleaned2
WITH vendor,trim(replace(cleaned2,' ',' ')) as cleaned
0 Likes

(Paul Drangeid) #6

I am running into a strange result using apoc.text.regreplace
I'm trying to identify date format (U.S.) to strip from version fields. I'm trying to find ##/##/## or ##/##/####
I replaced matches with string:'(daterepl) for easier debugging. Here's the cypher:

WITH *,"[(\\d{2}[\\/]\\d{2}(?:[\\/]\\d{2}(\\d{2})?)?)]" as dateregex
WITH *,apoc.text.regreplace(vers1, dateregex, '(daterepl)') as vrs

this should give a regex pattern like this: (\d{2}[/]\d{2}(?:[/]\d{2}(\d{2})?)?) which I tested using https://regexr.com/ and it worked fine (only matches date strings, and not version strings.
But strangely it is also converting non-date version strings like "7.96.6401.11"
and converting it to: "(daterepl).(daterepl)(daterepl).(daterepl)(daterepl)(daterepl)(daterepl).(daterepl)(daterepl)"

Not sure why dotted version numbers are even meeting the regex pattern.

0 Likes

(Michael Hunger) #7

There are several online regexp testers that you can use.

Usually it's about the greedyness of the pattern. And as you have several "optional" parts that can make a difference too.

0 Likes

(Paul Drangeid) #8

Hmm couldn't find a regex tester that showed the same results I was getting, but I did manage to tweak it a bit and get it less greedy. Here's the date regex that seems to be working for me:

\d{2}\/\d{2}(?:\/\d{2}(\d{2})?)

Thanks Michael!

0 Likes