Distribute Children Relationships across Parents

nazevedo3 · May 31, 2024, 3:27pm

I'm working on a project where I need to try to evenly distribute children nodes across parent nodes of the same type along with another requirement I'll explain as I setup the scenario.

Let's say I my starting point is this graph(s)

The orange Server nodes represent servers and the goal is to evenly distribute the Config nodes across the servers so that the customer configs do not land on the same server IF multiple servers are available.

The goal would to have the graph look something like this:

If however, there is only one server available, then I need to assign both Config nodes to the single Server;

If a customer were to add additional config but there's not a server where they DON'T already have config applied, the new config should be applied to a server that has the least amount of Configs applied to it.

Additionally, when the config nodes are applied, there needs to be two attributes on the config node set. Let's say customer_id and config_id.

The customer ID's range from let's say 1-100 and config ID's from 1000-2000. These are unique to the server the configs get applied to.

I'm about 90% there, but I think my query can be optimized and I the way I'm handling the ID's isn't the best because I'm essentially just looking for what's already configured, grabbing the max and adding 1, but if a config is deleted, I potentially lose an ID because I'm always grabbing the max.

CALL {
    MATCH(config:Config)-[:CONFIG_FOR]->(c:Customer) WHERE config.state = 'SCHEDULED'
    RETURN COLLECT(config) AS configs
}
UNWIND configs AS config
CALL {
   WITH config
   WITH config
   // LOOK FOR AVAILABLE SERVERS IN THE SAME LOCATION THAT DON'T HAVE
  // CONFIG FROM THE SAME CUSTOMER
   Match(server:Server) WHERE server.location = config.location
    AND NOT EXISTS((:Config{uuid: c.uuid})-[:CONFIG_FOR]->(server))
   WITH server, config
    LIMIT 1
    MERGE(config)-[:CONFIG_FOR]->(server)
  // SEARCH TO SEE IF THERE ARE OTHER CONFIGS APPLIED TO DETERMINE THE ID ASSIGNMENTS
    OPTIONAL MATCH(c:Config)-[:CONFIG_FOR]->(server) WHERE c.uuid <> config.uuid
    CALL {
     WITH c, config
     WITH c, config
    WHERE c IS NOT NULL
    WITH MAX(c.customer_id) + 1 AS customerId, MAX(c.config_id) + 1 AS configId, c
    SET config.customer_id = customerId
    SET config.config_id = configId

   UNION
    WITH  c, config
    WITH c, config
    WHERE c IS NULL
   SET config.customer_id = 1
   SET config.config_id = 1000

}
}
// THIS NEXT SUBQUERY WAS ADDED TO HANDLE THE CASE WHERE THERE NO SERVERS AVAILABLE THAT DON'T ALREADY HAVE A CUSTOMER CONFIG APPLIED.
CALL {
  OPTIONAL MATCH(config:Config)-[:CONFIG_FOR]->(c:Customer) WHERE config.state = 'SCHEDULED'
  WHERE NOT EXISTS((config)-[:CONFIG_FOR]->(:Server))
  Match(server:Server) WHERE server.location = config.location
  WITH server, config
  LIMIT 1
  MERGE(config)-[:CONFIG_FOR]->(server)
  WITH config, server
  OPTIONAL MATCH(existingConfig:Config)-[:CONFIG_FOR]->(server) WHERE existingConfig.customer_id = config.customer_id
  SET config.customer_id = existingConfig.customer_id
 WITH config, server
 MATCH(allConfigs:Config)-[:CONFIGURED_FOR]->(server)
 WITH config, allConfigs ORDER BY allConfigs.config_id
 WITH MAX(allConfigs.config_id) + 1 AS new_config_id, config
SET config.config_id = new_config_id
}

As mentioned at the start, the above query does seem to be assigning the configs in the way I would like, but I'm wondering if there's an easier way to go about it (less query statements), however, the generating of the ID's works in that I get the behavior I want, but let's say a server has configs with IDs 1 and 2, but then 1 is deleted, the next config will get 3 and I'm not able to re-use 1.

glilienfield · June 1, 2024, 4:41pm

Does this query work? I think the problem has been complicated by your model. You are mixing graph and relational concepts. You should not use foreign keys to relate entities, like you are doing with the uuid, customer_id, and config_id attributes. Relationships should be used to relate entities. Each entity should have a primary key so you can query for them, but all queries should use relationships to identify related entities.

You should have a mechanism for generating the primary keys, so they are guaranteed unique in a multiuser environment. What you have done here to determine the keys based on a 'max' value of all existing keys could result in a race condition where concurrent executions of this query determine the same id's.

I would do something like this, if there were no id's involved. The server and configuration should already have unique identifiers created when the nodes were created. There should be no id's added to the configuration, as you can determine the customer's id from its relationship to its customer and the configuration should already have one.

Basically, the query finds a server with no existing customer configurations (if at least one exists) and a server with the minimum number of existing customer configurations (if one exists), and takes the preferred one if it exists (server with no existing customer configurations).

MATCH(config:Config{state: 'SCHEDULED'})-[:CONFIG_FOR]->(customer:Customer)

Call {

    //Scenario: Find one empty server in same location if exists
    WITH config, customer
    MATCH(server:Server{location: config.location})
    WHERE NOT EXISTS((customer)<-[:CONFIG_FOR]-(:Config)-[:APPLIED_TO]->(server))
    RETURN 1 as scenario, server
    LIMIT 1

UNION

    //Scenario: find one server with the least number of configurations for customer if it exists 
    WITH config, customer
    MATCH(server:Server{location: config.location})
    Match(customer)<-[:CONFIG_FOR]-(:Config)-[:APPLIED_TO]->(server)
    WITH server, count(*) as numberOfConfigurations
    RETURN 2 as scenario, server
    ORDER By numberOfConfigurations asc
    Limit 1

}

WITH config, scenario, server
ORDER BY scenario asc
LIMIT 1

MERGE(config)-[:CONFIG_FOR]->(server)

In the first match, do you need to worry about the configuration being applied to a server already, or is implied by the state equal to 'SCHEDULED'. Something like this:

MATCH(config:Config{state: 'SCHEDULED'})-[:CONFIG_FOR]->(customer:Customer)
WHERE NOT EXISTS ( (config)-[:APPLIED_TO]->(server) )

nazevedo3 · June 1, 2024, 7:19pm

Hey Gary,

I hope you're doing well!

Let me try to clarify the customer_id and config_id I mentioned previously. I don't think I did a great job explaining what they are used for or why I need them.

They are not meant to be used to determine any relationships whatsoever. These ID's only come into play after a config has been assigned/scheduled to a server.

What they are used for is for the downstream service that actually does the configuring of the customer configs on the server.

My thought process was, once the config has been assigned to a server, I then need to look at all of the existing configs on the server. If the customer already has a config assigned to that server, the customer_id attribute needs to be the same on the new config node. If this is the first config for the specific customer on the server that was a selected, I need to find the next available ID in a range between X-X.

The config_id is similar but slightly different. I still need to know what existing configs are applied to the server, but the difference is, even if the same customer already has a config on the server, the config_id for each config needs to be unique. I need to find the next available config_id between a range of X-X.

Regarding your last question, I don't need to worry about a configuration already being applied to a server because I'm leveraging the 'SCHEDULED' state to determine what needs to be assigned.

glilienfield · June 1, 2024, 8:16pm

I am still confused. Each customer should have a unique id that is assigned when the customer node is created. The same for configuration and server nodes.

As such, the downstream system can figure out the customer id from the given configuration node as follows, identified to the downstream service with its config_id:

MATCH(config:Config{id: $config_id})
MATCH(customer:Customer)<-[:CONFIG_FOR]-(config)
RETURN config, customer

glilienfield · June 1, 2024, 8:22pm

Here is a thread were I discussed creating unique ids. Can’t you create these ids at creation of your entities, then use them in your assignment query?

nazevedo3 · June 1, 2024, 8:33pm

These ID's are not related to the Customer or Config nodes themselves. I'm not using them like a primary/foreign key in a relational model.

These ID's are to be determined only after a config has been assigned to a server. The ID's are not unique to a Customer/Config node, they are unique to a Customer/Config node per Server node it's assigned to.

glilienfield · June 1, 2024, 9:22pm

Ok then. Will the query be called concurrently by multiple threads/processes? If so, you could possible get a race condition where multiple threads calculate the same max customer_id or config_id value for a given server before the newly attached config node is updated.

If this is a concern, you could lock the server node, calculate the max value of with id among all the config nodes attached to the locked server node, then unlock the sever node. This will force multiple threads that matched on the same server node to perform their operation sequentially, instead of in parallel.

I can suggest a potential if this is a concern.

Do you really care if you have gaps in your if values?

nazevedo3 · June 1, 2024, 9:27pm

No, the query will not be called concurrently.

Regarding the gaps in values, not a huge concern right now, my range is pretty wide and I can't imagine I'd reach a limit anytime soon. If it was trivial to be able to re-use id's after they were deleted, that would be great, but if that's a headache I can make that a future me problem

glilienfield · June 1, 2024, 10:05pm

I don’t think the juice is worth the squeeze

nazevedo3 · June 2, 2024, 1:14am

Same.

I would still like to hear your idea for handling the situation where multiple processes may be running the query concurrently if you don’t mind.

glilienfield · June 2, 2024, 4:10am

Sure. I provided a detailed explanation of the issue with trying to create a unique identifier in a multi-threaded environment in the link I provided in an earlier reply in this thread. I also provided a potential approach. It was premised on locking a node which is used to synchronize multiple threads from executing the same block of code concurrently.

The issue with your calculation of the customerId and clientId as the maximum across a list of nodes is that multiple threads can executing this code concurrently and could calculate the same maximum values before one of them updates the new configuration node.

For this discussion, assume the code I provided worked, so I can provide conceptual example. In this code multiple threads could end up with the same server node in which they want to calculate the maximum values across its related configuration nodes. The approach I outlined in the other post is to lock the node each thread needs, perform the processing, and then unlock the node. This will serialize all the threads, so only one at a time executes the critical block of code that is not thread-safe. Below is a modified version of my query that locks the server node as soon as it is identified, so a unique max can be calculated. Locking occurs as soon as a SET is done on the node. I used a dummy property on the server node to SET, since we don't want to update the server node. The property is removed at the end. There server node will be unlocked once the query ends, allowing other threads to proceed. That point the max would have been set on the config node and a new max will be calculated for the next thread.

Beware, I have not tested any of the code.

MATCH(config:Config{state: 'SCHEDULED'})-[:CONFIG_FOR]->(customer:Customer)

Call {

    //Scenario: Find one empty server in same location if exists
    WITH config, customer
    MATCH(server:Server{location: config.location})
    WHERE NOT EXISTS((customer)<-[:CONFIG_FOR]-(:Config)-[:APPLIED_TO]->(server))
    RETURN 1 as scenario, server
    LIMIT 1

UNION

    //Scenario: find one server with the least number of configurations for customer if it exists 
    WITH config, customer
    MATCH(server:Server{location: config.location})
    Match(customer)<-[:CONFIG_FOR]-(:Config)-[:APPLIED_TO]->(server)
    WITH server, count(*) as numberOfConfigurations
    RETURN 2 as scenario, server
    ORDER By numberOfConfigurations asc
    Limit 1

}

WITH config, scenario, server
ORDER BY scenario asc
LIMIT 1

SET server._LOCK = true

WITH 
    config, 
    server, 
    [(c:Config)-[:APPLIED_TO]->(server) | c] as serverConfigs

WITH 
    config, 
    server,
    reduce(s=-1, i in serverConfigs | CASE WHEN i.customer_id > s THEN i.customer_id ELSE s END) as max_customerId,
    reduce(s=-1, i in serverConfigs | CASE WHEN i.config_id > s THEN i.config_id ELSE s END) as max_configId

WITH 
    config,
    server,
    CASE WHEN max_customerId = -1 THEN 100 ELSE max_customerId END as customerId,
    CASE WHEN max_configId = -1 THEN 100 ELSE max_configId END as configId

MERGE(config)-[:CONFIG_FOR]->(server)

SET config.customer_id = customerId
SET config.config_id = configId

REMOVE server._LOCK

NOTE: I believe there is also another race condition in this query. It occurs because multiple threads could execute the code in the CALL concurrently looking for the Server node based on your business rules. It is possible that multiple threads would pick the same server when no configuration exists or pick the same server based on it having the minimum number of existing configurations. Of course, if these threads would not executing the code concurrently, only the first one would find a server with no existing configurations, or the server with the minimum number of configurations for the first thread may not be the server with the minimum number of configurations when the subsequent threads execution this code for the same given configuration. Since the server selection criteria in each part of the UNION is dependent on the customer you may be able to resolve this concurrency issue by locking the Customer node once it is returned from the first Match.

As a final issue, You could also have an issue with multiple threads executing this query concurrently with in the first Match. In this case, multiple threads could determine a set of Config nodes with the state of SCHEDULED that overlap each other. This would pose a problem since it could result in the same Config node being processed multiple times. BTW- when do you change the status from SCHEDULED to PROCESSED or something else?

nazevedo3 · June 2, 2024, 5:14am

Thanks for that info, that’s good to know as up to this point I haven’t really considered concurrency issues.

To give some more context, this specific query will not be called my multiple services concurrently, but it is called by a single service in a loop, once every 3-5 seconds.

What I was seeing just now was inconsistent id’s being assigned. One time if I have three different configs, the ids get assigned as I expect. The other time, two of the configs get the same max value, etc.

Right now, I’m changing the state from SCHEDULED to READY at the very end of the query.

Topic		Replies	Views
How to spread the path to childs if relation of parents exists Cypher cypher	2	95	May 31, 2024
TO increase the process time of getting data for data mutliple realtionship Neo4j Graph Platform performance , cypher , operations , relationship	7	277	May 23, 2022
Dynamically Create Nodes & Relationships Cypher	5	971	November 22, 2023
Operations between related nodes of the same relation type Cypher data-modeling	2	415	April 12, 2021
Creating a level attribute for speed Cypher cypher	11	329	December 21, 2020

Distribute Children Relationships across Parents

Related topics