Neo4j Query to Compare Lists

Greetings,

I have a query that is not working as expected. I'll first show the query in isolated parts that return expected results and counts and then I'll show the troublesome query.

Query 1 of 2 that Works

Call apoc.load.jdbc("jdbc:sqlserver://edw.s.uni;databaseName=dbname;","select TABLE_NAME from INFORMATION_SCHEMA.TABLES
Where TABLE_CATALOG = 'dbname'
ORDER BY TABLE_NAME ASC") Yield row
Return row.TABLE_NAME

This returns 440 rows of table names

Query 2 of 2 that Works

Match (db:Database {Title: 'dbname'})-->(t:Table)-->(tv:TableVersion)
Return tv.Title as TableName

This returns 404 rows of table names

Query that Doesn't Work

What I want is for this query to return the 36 tables that are in the edw database (the 1 of 2 query) that don't yet exist in the Neo4j data set (the 2 of 2 query).

Call apoc.load.jdbc("jdbc:sqlserver://edw.s.uni;databaseName=dbname;","select TABLE_NAME from INFORMATION_SCHEMA.TABLES
Where TABLE_CATALOG = 'dbname'
ORDER BY TABLE_NAME ASC") Yield row
With row.TABLE_NAME as SQLTable
Match (db:Database {Title: 'UWODS'})-->(t:Table)-->(tv:TableVersion)
Where NOT SQLTable = KNTVTitle //or Where SQLTable <> KNTVTItle... is one better?
Return SQLTable

What gets returned are 177760 rows with a lot of rows duplicated.

What must I do to see the expected 36 rows of SQL table names?

Try this:

match (db:Database)
return apoc.coll.sort(collect(distinct db.dbname)) as dbnames
use this list to compare with tables in edw database

Thanks @ameyasoft

I've updated my original 2 of 2 query, I misled you when I typed "Return dbname". I do not want the database name in the 2 of 2 query, I want the names of all the tables in that specific db.

So are you suggesting that the following...

Match (db:Database {Title: 'dbname'})-->(t:Table)-->(tv:TableVersion)
Return apoc.coll.sort(collect(distinct tv.Title)) as TableNames

If so, I don't understand how to incorporate this bit into the greater query.
I'm trying to compare 2 lists of tables (1 from a SQL edw and 1 from a neo4j data set) and return the names of 36 tables sourced from the SQL edw query that don't exist in my neo4j data set.

I've managed to return the 36 tables using the following...

Call apoc.load.jdbc("jdbc:sqlserver://edw.s.uni.;databaseName=dbname;","select TABLE_NAME from INFORMATION_SCHEMA.TABLES
Where TABLE_CATALOG = 'dbname'
ORDER BY TABLE_NAME ASC") Yield row
with row.TABLE_NAME as SQLTable

Match (db:Database {Title: 'dbname'})-->(t:Table)-->(tv:TableVersion)
With COLLECT(DISTINCT SQLTable) AS SQLTables, COLLECT(DISTINCT tv.Title) AS TVTitles

With [n IN SQLTables WHERE NOT n IN TVTitles] as missingTab
Unwind missingTab as missingTables
Return missingTables

The result is 36 rows of the expected tables.

Can anyone confirm whether this is a best practice?

1 Like

This is a very neat solution! Congratulations you solved your problem.

This cypher will end of with a cross product join between the SQLTable and the tv.Titles before they are collected. It would be better, especially if you have 1000's of tables to collect the SQLTables before the match and then collect the tv.Titles. Your cypher would performance much faster.

Call apoc.load.jdbc("jdbc:sqlserver://edw.s.uni.;databaseName=dbname;","select TABLE_NAME from INFORMATION_SCHEMA.TABLES
Where TABLE_CATALOG = 'dbname'
ORDER BY TABLE_NAME ASC") Yield row
with COLLECT(row.TABLE_NAME) as SQLTables

Match (db:Database {Title: 'dbname'})-->(t:Table)-->(tv:TableVersion)
With SQLTables, COLLECT(tv.Title) AS TVTitles

With [n IN SQLTables WHERE NOT n IN TVTitles] as missingTab
Unwind missingTab as missingTables
Return missingTables

Thanks for the improvement @myron_higerd !