How to improve the performance of a query that compares properties relative to other nodes?


(Shane A Husson) #1

I have a simple model in which a Patient has many events. Each event has a start date and a category like 'medication'.

It is normal for a patient to have 10s of events of each category.

I would like to find patients where:

a measurement event happened after an admission
and an operation event happened after the same admission event
and the admission event itself happened after a medication event

I've modelled the data in neo4j like:

(:Patient {id: "ABCD"})-[:HAS]->(:Event {startDate: '2018-01-01', category: 'medication'})

Which produces a graph like:

The query I've come up with is:

MATCH p1 = (measurement:Event {category: 'measurement'})<-[:HAS]-(patient:Patient)
MATCH p2 = (admission:Event {category: 'admission'})<-[:HAS]-(patient)
MATCH p3 = (operation:Event {category: 'operation'})<-[:HAS]-(patient)
MATCH p4 = (medication:Event {category: 'medication'})<-[:HAS]-(patient)
WHERE date(measurement.startDate) > date(admission.startDate)
    AND date(operation.startDate) > date(admission.startDate)
    AND date(admission.startDate) > date(medication.startDate)
RETURN collect(distinct patient.pseudoId);

However this query times out with my test dataset of 1000 patients on my machine. The Explain produces:

Is there more efficient way to model or query the data in neo4j?

I've uploaded my test data here.


Find multiple crimes in the same location but happened at a relative time to each other
(Stefan Armbruster) #2

Some ideas to improve the data model:

  • convert your category property into a second label for :Event nodes, so you have e.g. :Event:Measurement nodes.
  • consider adding [:AFTER] relationships between Events of the same person. So you have a timeline of that patient explicitly and don't need to compare dates - that approach is what I call "graph thinking" in contrast to "relational thinking" (comparing dates)

(Shane A Husson) #3

Thanks for the ideas @stefan.armbruster.

Can you point to any examples of a model that has :AFTER? This approach sounds promising but we also have queries that want to know events that happened before or on the same day/month/year as another event. So I'm concerned that it will be complicated to create the graph and the cost will also be high.


(Stefan Armbruster) #4

On important factor for a decision is IMO if you need to do cross-patient comparisions on dates.

If you only do this in a non-cross-patient way I'd maintain for each patient a chain of events. If "same day" is an important aspect of your domain, you can overlay the rather generic AFTER relationship with SAME_DAY relationships.


(Shane A Husson) #5

Yeah we only need to do this in a non-cross-patient way.

If you only do this in a non-cross-patient way I'd maintain for each patient a chain of events. If "same day" is an important aspect of your domain, you can overlay the rather generic AFTER relationship with SAME_DAY relationships.

Ok, I'll give it a shot.


(Michael Hunger) #6

You can also drive the query with a USING SCAN ON patient so it doesn't do the index lookups but instead filters.


(Michael Hunger) #7

Hmm I played around a bit with it, my original ideas (turning the categories into labels or rel-types) didn't help so much due to the 4x expand that the planner does.

It's much better to do the predicate on a collection per patient.

profile
MATCH (patient:Patient)
WITH patient, [(patient)-[:HAS]->(e:Event) WHERE e.category IN ['measurement','admission','operation','medication'] | e] as events
WITH patient, apoc.map.groupByMulti(events,'category') as grouped
WHERE all(pair IN [['admission','measurement'],['admission','operation'],['medication','admission']] WHERE ANY(first IN grouped[pair[0]] WHERE ANY(second IN grouped[pair[1]] WHERE first.startDate < second.startDate)))
RETURN size(collect(distinct patient.pseudoId));
// Started streaming 1 records after 85 ms and completed after 85 ms.
// 959

TODO does it have to be the same admission or not?
If yes then the predicate has to be adapted a bit.

profile
MATCH (patient:Patient)
WITH patient, [(patient)-[:HAS]->(e:Event) WHERE e.category IN ['measurement','admission','operation','medication'] | e] as events
WITH patient, apoc.map.groupByMulti(events,'category') as grouped
WHERE any(admission IN grouped['admission'] WHERE 
          ANY(medication IN grouped['medication'] WHERE admission.startDate > medication.startDate) AND 
          ANY(operation IN grouped['operation'] WHERE admission.startDate < operation.startDate) AND 
          ANY(measurement IN grouped['measurement'] WHERE admission.startDate < measurement.startDate)
      )
RETURN size(collect(distinct patient.pseudoId));
// Started streaming 1 records after 127 ms and completed after 127 ms.
// 875

(Shane A Husson) #8

Thanks @michael.hunger, the query you suggested is much faster and seems to scale as more filters are added. I'll continue to investigate it.