Return collection of unique property values from different nodes

Hello,

How to best use MATCH across 5 different node labels to collect in a list unique values found on all 5 nodes with the same property label? I'm writing a somewhat functional query here but it's getting quite long so I don't know if I'm taking the right approach.

As a hypothetical scenario, I have these 5 nodes with these uniqueness constraints:

Actor, actor_id
Director, director_id
Producer, producer_id
Writer, writer_id
Designer, designer_id

And all of these 5 nodes also contain a business_id as an additional node property for example. I created indexes on these 5 node properties too for faster search performance. Indexes were created as follows:

CREATE INDEX ActorBusID FOR (a:Actor) ON (a.bus_id);
CREATE INDEX DirectorBusID FOR (d:Director) ON (d.bus_id);
CREATE INDEX ProducerBusID FOR (p:Producer) ON (p.bus_id);
CREATE INDEX WriterBusID FOR (w:Writer) ON (w.bus_id);
CREATE INDEX DesignerBusID FOR (dg:Desginer) ON (dg.bus_id);

This is what I'm trying:

MATCH(a:Actor)
WITH a.bus_id AS actorids
ORDER BY actorids ASC

MATCH(d:Director)
WITH  d.bus_id AS directorids
ORDER BY directorids ASC

MATCH(p:Producer)
WITH p.bus_id AS producerids
ORDER BY producerids ASC

MATCH(w:Writer)
WITH w.bus_id AS writerids
ORDER BY writerids ASC

MATCH(dg:Desginer)
WITH dg.bus_id AS designerids
ORDER BY designerids ASC

RETURN collect( DISTINCT [actorids, directorids, producerids,  writerids, designerids]) AS `Business_IDs`

I'm hoping to return a list of ordered and ascending unique values

My end result should be something like:

Business_IDs
[1, 2, 3, 4, 5]

It seems to me like the query I wrote is super long and it could be a lot shorter or simpler.

Try this:

MATCH(a:Actor)
WITH collect(distinct a.bus_id) AS actorids

MATCH(d:Director)
WITH  collect(distinct d.bus_id) AS directorids, actorids
with apoc.coll.sort(apoc.coll.union(actorids, directorids)) as actdir


MATCH(p:Producer)
WITH collect(distinct p.bus_id) AS producerids, actdir
with apoc.coll.sort(apoc.coll.union(actdir, producerids)) as actdirpd

MATCH(w:Writer)
WITH Collect(distinct w.bus_id) AS writerids, actdirpd
with apoc.coll.sort(apoc.coll.union(actdirpd, writerids)) as actdirpdwr

MATCH(dg:Desginer)
WITH collect(distinct dg.bus_id) AS designerids, actdirpdwr
with apoc.coll.sort(apoc.coll.union(actdirpdwr, designerids)) as final
RETURN final
1 Like

The simplest approach would be to come up with some label that could be applied to all of these nodes, such as :BusinessEntity or something, and add the label to all of those (remember nodes can be multi-labeled) as well as create a corresponding index (which might replace the individual indexes on bus_id on the other labels).

Then your query simplifies to:

MATCH (n:BusinessEntity)
WITH DISTINCT n.bus_id as bus_id
ORDER BY bus_id ASC
RETURN collect(bus_id) as `Business_IDs`

And if you don't want to do that, then you should be able to get Cypher to perform unions of each label scan under the hood with something like:

MATCH (n)
WHERE n:Actor OR n:Director OR n:Producer OR n:Writer OR n:Designer
WITH DISTINCT n.bus_id as bus_id
ORDER BY bus_id ASC
RETURN collect(bus_id) as `Business_IDs`

Use an EXPLAIN of the plan to verify this. You should see a series of branching at the start, with NodeByLabelScan and Union operations. You do NOT want to see AllNodesScan operators.

1 Like

@ameyasoft Thank you for your reply! I did not know about those two APOC functions. I'll give those a try.

@andrew.bowman
Oh ok, yeah I can definitely add another label to those 5 nodes. These 5 nodes get updated on a daily basis so adding a new label to these nodes when they get updated shouldn't be a problem.

Thank you Neo4j community! I think this is what I was looking for.

Just to make sure, when I create this index on a multi-labeled node, is this the correct syntax so that the index is applied to all of the bus_id on all 5 nodes?

CREATE INDEX BusinessIDs FOR (x:BusinessEntity) ON (x.bus_id);

Then, will this index apply to all of the 5 bus_id properties on each node when I run MATCH(n:BusinessEntity)?

n:Actor:BusinessEntity
n:Producer:BusinessEntity
n:Director:BusinessEntity
n:Writer:BusinessEntity
n:Designer:BusinessEntity

If I'm reading the documentation correctly, it seems like Indexes on multi-labeled nodes are not allowed. No?

A named index on a single property for all nodes that have a particular label can be created with CREATE INDEX index_name FOR (n:Label) ON (n.property) .

I'm trying this out right now, and will report my findings here.

Thanks!

You will need to add the new label on all the nodes of all the other labels:

MATCH (n)
WHERE n:Actor OR n:Director OR n:Producer OR n:Writer OR n:Designer
SET n:BusinessEntity

After that, you can use the simpler query on just :BusinessEntity nodes for the bus_id and it will use the index.

However this does mean you will have to make sure for every new node of those labels that you also add the :BusinessEntity label.