TECHNIQUE: Methods of simplifying long and complex queries


(Sam) #1

Hi all,

I searched for some answers both on the board and via google, but I'm not finding exactly what I'm looking for. This has come up for me a bunch of times in the past and I never walk away from even successful queries feeling totally satisfied that I 'got it right' in terms of best practices like I can do with SQL (most of the time).

I'm working on a query now that does things in steps and I'm not sure if the details are super important so much as the general question of how to approach situations like this. The query below doesn't work YET, so don't worry too much about de-cyphering (haha, sorry), but what I'd LOVE to do is break it out into really clear steps where what's happening in each step is super easy to see. In a SQL script, I'd probably create temp tables and have intermediary queries so I can inspect the output at each stage. But here, because I am not doing that, I'm struggling with a monolithic query that is going to eat me at some point, I'm sure of it.

Thanks for the ideas and help!
Sam

match (prgs:PROGRAM_STATUS)
with prgs limit 100
with collect(distinct prgs.MBR_DK) as mbr_dks
unwind mbr_dks as mbr_dk
optional match (s:PROGRAM_STATUS {MBR_DK:mbr_dk})-[:UPDATED_TO*]->(e:PROGRAM_STATUS)
where not (()-[]->(s)) and not ((e)-[]->()) 
with mbr_dk, 
	case when s is not null
    	then collect({MBR_DK:s.MBR_DK, CM_CASE_ID:s.CM_CASE_ID, ENROLL_START_DT:s.ENROLL_START_DT, ENROLL_END_DT:coalesce(e.ENROLL_END_DT,date('9999-01-01'))})
    	else [] end 
	as records 
optional match (s:PROGRAM_STATUS {MBR_DK:mbr_dk})
where not (()-[]->(s)) and not ((s)-[]->())
with
	case when s is not null
    	then records + collect({MBR_DK:s.MBR_DK, CM_CASE_ID:s.CM_CASE_ID, ENROLL_START_DT:s.ENROLL_START_DT, ENROLL_END_DT:coalesce(s.ENROLL_END_DT,date('9999-01-01'))})
        else records end
	as allRecords
unwind allRecords as records
with records order by records.ENROLL_START_DT, records.ENROLL_END_DT
with collect(records)[0..count(records)-1] as r_collection_head, tail(collect(records)) as r_collection_tail
unwind r_collection_head as r1
with r1, r_collection_tail, apoc.coll.indexOf(r_collection_head, r1) as i
with r1, r_collection_tail[i] as r2
with r1,r2, max(r1.ENROLL_START_DT, r2.ENROLL_START_DT) < min(r1.ENROLL_END_DT, r2.ENROLL_END_DT) as has_overlap
with collect ([{
			MBR_DK:r1.MBR_DK
            ,CM_CASE_ID:r1.CM_CASE_ID + ':' + r2.CM_CASE_ID
			,R1_ENROLL_START_DT:r1.ENROLL_START_DT
            ,R1_ENROLL_END_DT:r1.ENROLL_END_DT
            ,R2_ENROLL_START_DT:r2.ENROLL_START_DT
            ,R2_ENROLL_END_DT:r2.ENROLL_END_DT
            ,HAS_OVERLAP:has_overlap}]) as results
return results

(Andrew Bowman) #2

Can you PROFILE this query and add the query plan (with all elements expanded)?

Also quick tip, if there is a label you can use in your first MATCH, please add one, otherwise this has to look through all nodes in your graph to find those with the given SOURCE_TABLE property.


(Sam) #3

Hi Andrew,
Yep, agree. This graph only has one type of node now, but it's a good practice. I updated it and see profile below.
Sam


(Andrew Bowman) #4

Also, is :PROGRAM_STATUS(MBR_DK) unique (or should it be)? If so, then there should be no need to collect, unwind, and rematch using the MBR_DK property.


(Sam) #5

MBR_DK is not unique. It is a primary key for a different object, but there are multiple PROGRAM_STATUS records per MBR_DK, so therefore, not unique in this case. The object is to find programs that overlap in time per MBR_DK, so I started with gathering all of the unique MBR_DKs so I could iterate through each to compare their PROGRAM_STATUS record dates. Also may be good to call out that a program can have multiple PROGRAM_STATUS records that chain together. This is why I'm finding the first and last record in each chain and taking the dates from them.


(Andrew Bowman) #6

Is there some way you can alter your approach to only execute over distinct MBR_DK properties? You said this is a primary key for a different object, so it should be unique on nodes of that label. If you start by matching to nodes of this label, then that should make your subsequent matches on :PROGRAM_STATUS nodes more efficient, as you'll never be repeating properties in your matches.


(Sam) #7

That's a good idea. I didn't import those because I didn't think I'd need them for the logic. But my thinking is still not right for cypher, I suppose. I'll do it, though, and try again.

But... there is one overarching question here... if I saw a SQL statement that big, I'd be in dismay. Reason is that it's hard to build and maintain code when it's this big and unwieldy. So either I'm not getting how cypher works well enough to break it down into chunks within the one statement or there is a better way of splitting out statements.

Regardless, I'll bring in the objects that the MBR_DK is unique to. Good idea!


(Andrew Bowman) #8

There's still more we can do here to simplify.

Your two OPTIONAL MATCHes (and mapping the results) can actually be collapsed into just one (I'm working on a knowledge base article on this techniqu, and we can use pattern comprehension in place of your OPTIONAL MATCH + collect() operations, and we can use map projection to simplify the projection of the map values.

Here's that snippet using these enhancements:

...
WITH mbr_dk, [(s:PROGRAM_STATUS {MBR_DK:mbr_dk})-[:UPDATED_TO*0..]->(e:PROGRAM_STATUS)
       where not (()-->(s)) and not ((e)-->()) | s {.MBR_DK, .CM_CASE_ID, .ENROLL_START_DT, ENROLL_END_DT:coalesce(e.ENROLL_END_DT,date('9999-01-01'))}] as records 
...

The *0.. will allow e to possibly be the same as s (so in cases where there is no :UPDATED_TO relationship), and in that case e and s will be referring to the same node.

I am still somewhat confused about your processing of records past this point. since r_collection_head seems to be the entire list. Could you elaborate a bit more on what this part is attempting to do? Also keep in mind that apoc.coll.indexOf() actually only returns the first index of the value, not all indexes where it occurs.

One other thing to note is that max() and min() are aggregation functions that apply across multiple rows, but it seems like your usage here is to try to find the maximum value between multiple values for a single row. You may want to use apoc.coll.max() (and its associated min() function) instead, using a list of the two values.


(Andrew Bowman) #9

Looking at your query plan, you don't seem to have an index on :PROGRAM_STATUS(MBR_DK). By creating this, you'll switch from a label scan and filter to an index lookup, which should be much faster.


(Sam) #10

Very cool, Andrew. This does shrink the query by a lot. Maybe half, even. I also created the index on MBR_DK as you suggested.

I'm posting the new query below, but I can't run it because the apoc.coll.indexOf() won't take a date and I haven't figure out how to convert it to apoc (or something) yet. (And by the way, good catch on the fact that I was using min/max wrong. It would have taken some time to realize what was going on there.

r_collection_head is the entire list minus the last element. This way, when I iterate through it, it stops on the penultimate element in the list. What it's doing comparing the start/end date windows for each element against the previous element per mbr_dk to check for overlap. The ask was to find out how many people (mbr_dks) have programs that overlap.

I'll try it and post the profile results as soon as I figure out the date thing. Thanks so much!

match (m:MEMBER)
with m limit 100
with m, [(s:PROGRAM_STATUS {MBR_DK:m.MBR_DK})-[:UPDATED_TO*0..]->(e:PROGRAM_STATUS)
    where (not (()-[:TRANSITIONS_TO]->(s)) and not (()-[:UPDATED_TO]->(s))) 
      and (not ((e)-[:TRANSITIONS_TO]->()) and not ((e)-[:UPDATED_TO]->())) |
      s {.MBR_DK, .CM_CASE_ID, .ENROLL_START_DT, ENROLL_END_DT:coalesce(e.ENROLL_END_DT,date('9999-01-01'))}]
    as allRecords
unwind allRecords as records
with records order by records.ENROLL_START_DT, records.ENROLL_END_DT
with collect(records)[0..count(records)-1] as r_collection_head, tail(collect(records)) as r_collection_tail
unwind r_collection_head as r1
with r1, r_collection_tail, apoc.coll.indexOf(r_collection_head, r1) as i
with r1, r_collection_tail[i] as r2
with r1,r2, apoc.coll.max([r1.ENROLL_START_DT, r2.ENROLL_START_DT]) 
    < apoc.coll.min([r1.ENROLL_END_DT, r2.ENROLL_END_DT]) as has_overlap
with collect ([{
		MBR_DK:r1.MBR_DK
          ,CM_CASE_ID:r1.CM_CASE_ID + ':' + r2.CM_CASE_ID
		      ,R1_ENROLL_START_DT:r1.ENROLL_START_DT
          ,R1_ENROLL_END_DT:r1.ENROLL_END_DT
          ,R2_ENROLL_START_DT:r2.ENROLL_START_DT
          ,R2_ENROLL_END_DT:r2.ENROLL_END_DT
          ,HAS_OVERLAP:has_overlap}]) as results
return results

(Sam) #11

Andrew,
I think I have something that should work, however, I think that the *0.. isn't sticking the same results in both s and e. Here's the full script, but below is a simple test that isolates the issue I think I'm seeing.

Full Script:

match (m:MEMBER)
with m limit 100
with m, [(s:PROGRAM_STATUS {MBR_DK:m.MBR_DK})-[:UPDATED_TO*0..]->(e:PROGRAM_STATUS)
    where (not (()-[:TRANSITIONS_TO]->(s)) and not (()-[:UPDATED_TO]->(s))) 
      and (not ((e)-[:TRANSITIONS_TO]->()) and not ((e)-[:UPDATED_TO]->())) |
      s {.MBR_DK, .CM_CASE_ID, .ENROLL_START_DT, ENROLL_END_DT:coalesce(e.ENROLL_END_DT,date('9999-01-01'))}]
    as allRecords
unwind allRecords as records
with records order by records.ENROLL_START_DT, records.ENROLL_END_DT
with collect(records)[0..count(records)-1] as r_collection_head, tail(collect(records)) as r_collection_tail
unwind r_collection_head as r1
with r1, r_collection_tail, apoc.coll.indexOf(r_collection_head, r1) as i
with r1, r_collection_tail[i] as r2
with r1, r2,
	case when r1.ENROLL_START_DT > r2.ENROLL_START_DT then r1.ENROLL_START_DT else r2.ENROLL_START_DT end as MAX_ENROLL_START_DT
    ,case when r1.ENROLL_END_DT < r2.ENROLL_END_DT then r1.ENROLL_END_DT else r2.ENROLL_END_DT end as MIN_ENROLL_START_DT
with r1,r2, 
	MAX_ENROLL_START_DT < MIN_ENROLL_START_DT 
    AND r1.ENROLL_START_DT <> r1.ENROLL_END_DT // IF A CASE STARTS AND ENDS ON THE SAME DAY,
    AND r2.ENROLL_START_DT <> r2.ENROLL_END_DT // IT COUNTS AS OVERLAP
    as has_overlap
with collect ([{
		MBR_DK:r1.MBR_DK
          ,CM_CASE_ID:r1.CM_CASE_ID + ':' + r2.CM_CASE_ID
		      ,R1_ENROLL_START_DT:r1.ENROLL_START_DT
          ,R1_ENROLL_END_DT:r1.ENROLL_END_DT
          ,R2_ENROLL_START_DT:r2.ENROLL_START_DT
          ,R2_ENROLL_END_DT:r2.ENROLL_END_DT
          ,HAS_OVERLAP:has_overlap}]) as results
return results

Isolated test:

match (m:MEMBER {MBR_DK:565})
with m limit 100
match (s:PROGRAM_STATUS {MBR_DK:m.MBR_DK})-[:UPDATED_TO*0..]->(e:PROGRAM_STATUS)
	where (not (()-[:TRANSITIONS_TO]->(s)) and not (()-[:UPDATED_TO]->(s))) 
      and (not ((e)-[:TRANSITIONS_TO]->()) and not ((e)-[:UPDATED_TO]->()))
return s,e

The MBR_DK:565 should return two paths; the first is for one set of PROGRAM_STATUS records that connect by the UPDATED_TO relationships and the second for a single record that where we'd expect e to refer to the same node as s.

I'm using DB version 3.4.9, in case that's relevant.

Thanks again! Feel like I'm close.
Sam


(Sam) #12

Hey Andrew,
Just wanted to follow up so you can see how things ended. Not that there isn't always room for improvement, but... this seems to work.

You called out the fact that the logic on the bottom half of the query was confusing. That's because it wasn't doing exactly what I wanted and it wasn't until I sorted out the top part that I was able to test and work with the bottom half. Every set of memberRecords grouped by member needed to be compared, not just the ones next to each other in the two collections. So I removed the head/tail logic and used apoc.coll.combinations() instead. I'm a little shaky on the 'min' parameter, but I experimented until I got a cartesian product that I needed. Seems 2 is the magic number because if I had a list of 3 records, I wound up with 6 combinations.

Anyway, always open to more suggestions, but this seems to work and I learned a lot on this one. Thanks very much for the help!
Sam

match (m:MEMBER)
with m,
	[
    	(s:CASE_STATUS {MBR_DK:m.MBR_DK})-[:UPDATED_TO*0..]->(e:CASE_STATUS)
          where (not (()-[:TRANSITIONS_TO]->(s)) and not (()-[:UPDATED_TO]->(s)))
            and (not ((e)-[:TRANSITIONS_TO]->()) and not ((e)-[:UPDATED_TO]->())) |
         s{.MBR_DK, .CM_CASE_ID, .ENROLL_START_DT, ENROLL_END_DT:coalesce(e.ENROLL_END_DT,date('9999-01-01'))}
    ]
    as allRecords
unwind allRecords as records
with m, records order by m, records.ENROLL_START_DT, records.ENROLL_END_DT
with m, count(records) as numRecords, collect(records) as memberRecords
where numRecords > 1
with m, apoc.coll.combinations(memberRecords,2) as combos
unwind combos as combo
with combo[0] as case1, combo[1] as case2
with case1, case2,
	case when case1.ENROLL_START_DT > case2.ENROLL_START_DT then case1.ENROLL_START_DT else case2.ENROLL_START_DT end as MAX_ENROLL_START_DT
    ,case when case1.ENROLL_END_DT < case2.ENROLL_END_DT then case1.ENROLL_END_DT else case2.ENROLL_END_DT end as MIN_ENROLL_START_DT
with case1, case2,
	MAX_ENROLL_START_DT < MIN_ENROLL_START_DT
    as has_overlap
FOREACH(one_to_run IN
	CASE WHEN has_overlap = TRUE
		THEN [1] ELSE [] END |
    MERGE (c1:CASE {CM_CASE_ID:case1.CM_CASE_ID})
    MERGE (c2:CASE {CM_CASE_ID:case2.CM_CASE_ID})
		MERGE (c1)-[:OVERLAPS_WITH]->(c2)
    )

(Andrew Bowman) #13

Hi Sam, this looks much better! Glad apoc.coll.combinations() seems to be working for you, I think you're using that right. I wanted to use a single function to cover both cases of combinations of only a single size, as well as combinations covering a range of sizes, so the description is a bit more complicated than I'd like.

There's a few more improvements we can make on this one to streamline the complexity.

For the pattern comprehension at the beginning, we should be able to specify multiple relationship types at once using the | between the types:

where not ()-[:TRANSITIONS_TO | UPDATED_TO]->(s) and not (e)-[:TRANSITIONS_TO | UPDATED_TO]->() |

We can also avoid the case statements and use apoc.coll.min() and max() to get the minimum and maximum values from each pair:

with case1, case2, apoc.coll.max([case1.ENROLL_START_DT, case2.ENROLL_START_DT]) as MAX_ENROLL_START_DT, 
case2, apoc.coll.max([case1.ENROLL_END_DT, case2.ENROLL_END_DT]) as MIN_ENROLL_START_DT

And as for the last FOREACH, if this is the final operation, you can just use a filter on has_overlap so you only process the subsequent MERGE operations on the desired rows:

with case1, case2,
	MAX_ENROLL_START_DT < MIN_ENROLL_START_DT
    as has_overlap
WHERE has_overlap
MERGE (c1:CASE {CM_CASE_ID:case1.CM_CASE_ID})
MERGE (c2:CASE {CM_CASE_ID:case2.CM_CASE_ID})
MERGE (c1)-[:OVERLAPS_WITH]->(c2)

(Sam) #14

Hi Andrew -- great stuff.
I tried the apoc.coll.min() and max() functions, but it was throwing cast exception.

Neo.ClientError.Procedure.ProcedureCallFailed: Failed to invoke function `apoc.coll.min`: Caused by: java.lang.ClassCastException: java.time.LocalDate cannot be cast to java.lang.Number

That said, not only does it work (and quickly at that), but I learned a lot from your help in this one use case.
Much appreciated!
Sam


(Andrew Bowman) #15

Got it, we'll have to fix that up to play nice with date types.

Glad to help!