Are virtual nodes/relationships the best option here?

I'm building software for animal breeders that allows them to create a "hypothetical mating" between two animals so they can view potential traits of their would-be offspring. We store very basic information in Neo4j right now -- just Animal nodes with an animal_id property, the species, and either a HAS_SIRE or HAS_DAM relationships between the nodes to model the lineage of a more complex DB stored in a RDBMS.

Our current hypothetical mating tool creates an Animal node with a negative animal_id and deletes it at the end of each request (we create temporary node/relationships -> run queries against DB -> detach delete temp node). I'm wondering if there's a better way? I just came across virtual nodes, relationships, and graphs in the APOC library, which is very similar to what I'm doing internally.

I've been playing around with them with a little success but can't quite get what I'm looking for yet. Here's a sample query I have so far that actually returns something (the hypothetical offspring with the sire, dam, and the HAS_SIRE/HAS_DAM relationships):

MATCH (hypoSire:Animal {animal_id: 333})
MATCH (hypoDam:Animal {animal_id: 321})
WITH hypoSire, hypoDam, apoc.create.vNode(['Animal'], {animal_id:-333321}) AS hypo
CALL apoc.create.vRelationship(hypo,'HAS_SIRE',{},hypoSire) YIELD rel as hypoSireRel
CALL apoc.create.vRelationship(hypo,'HAS_DAM',{},hypoDam) YIELD rel as hypoDamRel
RETURN *;

Now I'd like to take this a little further and query a 4-generation family lineage with this. Something like the following:

MATCH (hypoSire:Animal {animal_id: 333})
MATCH (hypoDam:Animal {animal_id: 321})
WITH hypoSire, hypoDam, apoc.create.vNode(['Animal'], {animal_id:-333321}) AS hypo
CALL apoc.create.vRelationship(hypo,'HAS_SIRE',{},hypoSire) YIELD rel as hypoSireRel
CALL apoc.create.vRelationship(hypo,'HAS_DAM',{},hypoDam) YIELD rel as hypoDamRel
MATCH ped = (hypo)-[:HAS_SIRE|HAS_DAM*0..4]->(ancestor:Animal) 
RETURN ped;

In my final MATCH I've also tried replacing the (hypo) variable with the newly-created virtual node (:Animal {animal_id: -333321}), but it always returns an empty result.

So a few questions...

  1. Am I on the right track with using virtual nodes/relationships to solve this problem?
  2. Would virtual graphs be a better solution? I wasn't sure how to utilize them, but it seems like it might be.
  3. If nodes/relationships are the answer, could someone help point me in the right direction for my 2nd query?
  4. How does Neo4j handle indexes and constraints with virtual nodes? If I have two completely separate requests simultaneously creating a node with a duplicate animal_id (which has a unique index/constraint), will one of them throw an error?

Thanks in advance!

Virtual nodes and relationships were created primarily for visualization, to see on the visualizer nodes and relationships that don't actually exist in the graph. Querying over them and using them in calculations isn't something that's well supported at this time, though we are keeping an eye on these for future development.

Virtual nodes don't actually exist in the graph, and as such are not indexed or subject to constraint checks.

At this point I don't think we have general virtual graph projection available (this is used in graph algos under the hood), but when implemented this may be more helpful for your case.

Ahh bummer, thanks for the info. So it sounds like the current method of creating a temporary node/relationships at runtime might be the best option? I'm very new to Neo4j so I might be overlooking something completely obvious.

Bearing in mind that I don't know the complete usage of your graph, one option to consider would be to not get into virtual nodes and relationships at all, but model the offspring as a regular node. It can be differentiated with a new label (or an additional label depending on impact to queries) indicating that it is a hypothesis.
The advantage of this would be that the hypotheses are persisted, so it can be "shared" or worked upon for a longer duration of time, and you could have more than one should you want to compare them for a given breeder. Then it would be quite trivial to delete them once the perfect match is found (as you do currently).
The disadvantage of this would be making sure that your existing queries do not traverse unintentionally through the HAS_SIRE or HAS_DAM relationships to the hypothetical offspring. Modifying these queries will be easy, it's just the fact that there's an additional factor to keep in mind when writing new ones.

Thanks for your input. I like this approach and will explore the idea with our application.

We're currently creating regular nodes and relationships for the hypothetical relationships. The main difference is that we're using the same label as our regular nodes and we're not persisting these nodes or relationships longer than a single request within our application. The reason for this is that often times breeders will create hundreds of temporary hypotheticals that we don't want to store in the database due to a tight memory budget. However, to your point we do have a feature that allows them to save a hypothetical once they've found one that they like. In this instance we persist the animal_ids of the sire and dam in our RDBMS and continue to perform the temporary create->delete lifecycle on each hypothetical request. The requests don't happen very often so this hasn't been a problem but obviously this isn't ideal and it's why I'm in search for a better method.

I might use a hybrid of the two ideas. We might be able to persist all hypotheticals for X days and run a script that deletes all non-saved ones to "garbage collect" so to speak. We can modify our existing queries to only traverse non-hypotheticals, I don't think that'll be too much of an issue. I'm just thinking out loud, but let me know if anything in that sounds unreasonable.

Thanks again for your input :smiley:

Makes sense.
As an aside, you might want to look at the neo4j-expire module (disclaimer: I work at GraphAware)