Showing results for 
Search instead for 
Did you mean: 

Head's Up! Site migration is underway. Phase 2: migrate recent content

Bitemporal modeling with millions of nodes: VM stop-the-world


Hi all!

I am trying to ingest a lot of data where the data arrives continously to a Java application. From here, we send queries to Neo4J through your driver for ingestion. The computer used is a MacBook Pro with 16GB RAM, we set both heap size and page cache to 5GB in settings.

We have two labels: Person and Building, where Persons belong in a Building. For now, Buildings do not contain anything other than their "buildingId" and we set directed relationships from persons to buildings. A Person is always connected to at least one PersonState that contains the persons information for a specific visit. In our case, the first state will contain the person's enter time, received from our Java application.

The relationships are always stamped with a transaction time, txStart and a valid time, vtStart. When the person has exited, we create a new person state and logically delete the last state by setting its txEnd to the current time, and the new person state now contains both vtStart, vtEnd and a txStart (but no txEnd since it is the active person state). Please see below example graph.

At first, this was the queries used to load each person:

WITH datetime.transaction('Z') AS timestamp
MERGE (p:Person{personId: '5921'})
MERGE (pz:Building{buildingId: '127'})
MERGE (p)-[rp:PRESENT_AT]->(pz)
ON CREATE SET rp.txStart = timestamp

MERGE (ps:PersonState{personId: '5921', name: 'Thomas'})
WITH p, ps, timestamp
ON CREATE SET r.txStart = timestamp, r.vtStart = date("2020-01-01")
ON MATCH SET r.txEnd = timestamp

WITH p, ps, timestamp
WHERE r.txEnd = timestamp
CREATE (p)-[r2:HAS_PERSON_STATE{txStart: timestamp, vtStart: date("2020-01-01"), vtEnd: r.vtEnd}]->(ps);

This, however, gave us many VM stop-the-world warnings and the database crashed shortly after. We saw that this was due to an eager caused in the last rows (13-16), so instead we add a RETURN (p, ps) on row 12. We then write another transaction, although in the same session, that uses the returned node IDs in order to match the nodes we want to create a relationship between faster. Yes, we use ID(n) and not id of the label. This split also has a check so that if r.txEnd does not exist, we do not run the rows 14-16 to match and create relationship.

This did indeed speed everything up and the warning messages disappeared, however they returned when the amount of nodes closed in on over 35 million.

Will it help if we perform the last match and create operation in its own session and transaction? If not, is there anything else we can do? More RAM? We expect the total amount of nodes to finally reach over 200 million.

note: information has been masked due to privacy reasons, however the model remains the same

Many thanks in advance!