Neosemantics import turtle triples is resource intensive

madsbuch · September 27, 2021, 7:41am

Hi there!

I am trying to import a turtle dataset of just under 700MB containing roughly 2M nodes.

I am doing it into a community edition of the Neo4J database started as follows in docker:

docker run \
    --name neo_realm_neo4j \
    -p7474:7474 -p7687:7687 \
    --add-host host.docker.internal:host-gateway \
    -d \
    --env NEO4JLABS_PLUGINS='["apoc", "n10s"]' \
    --env NEO4J_AUTH=neo4j/test \
    --env NEO4J_dbms_unmanaged__extension__classes="n10s.endpoint=/rdf" \
    --env NEO4J_dbms_memory_heap_initial__size=2G \
    --env NEO4J_dbms_memory_heap_max__size=55G \
    -v $HOME/neo4j/data:/data \
    -v $HOME/neo4j/logs:/logs \
    -v $HOME/neo4j/import:/var/lib/neo4j/import \
    -v $HOME/neo4j/plugins:/plugins \
    neo4j:latest

with the following Cypher query

CALL n10s.rdf.import.fetch('{url}', 'Turtle')

(where URL is a docker accessible URL returning the turtle file)

As is evident the container is being given 55GB of RAM.

This process takes all RAM, and even slows down importing towards the end. Upon restarting the docker image everything is persisted, but only takes up 35GB of space.

Looking around, it seems like a 2M node dataset is rather small compared to what other people work with.

Are there any ways to go with optimizations in order to make this a bit more manageable?

dana_canzano · September 27, 2021, 12:11pm

@madsbuch
Not specific to neosemantics but we generally do not recommend setting min/max heap to differerent values,

The heap memory size is determined by the parameters dbms.memory.heap.initial_size
 and dbms.memory.heap.max_size. It is recommended to set these two parameters
 to the same value to avoid unwanted full garbage collection pauses.

and also, we generally do no see where customers set max heap to be over 31G.

How did you arrive at a min and max heap of 2G and 55G respectively?

Further you have not defined dbms.mempory.pagecache.size where this parameted is used to defined to describe how much RAM should be allocated to record the graph structure in RAM. If not defined it will default but we generally see where customers explicitly define.

The doc reference above provides some good details as to these parameters

Topic		Replies	Views
Docker performance problems Docker performance , import	3	1348	May 1, 2024
Neo4j Import error- There is insufficient memory for the Java Runtime Environment to continue. - 2.3 TB dataset Import / Export performance , neo4j-import , cloud	8	3512	November 8, 2018
Issue with Neosemantics v4 Neo4j V4.1.2 + loading in Turtle files + IRI included an unencoded space: '32' Linked Data, RDF, Ontology import , knowledge-base	2	841	December 15, 2020
Recommended memory config for importing 10GB dataset with 16GB RAM Neo4j Graph Platform cypher	2	2887	September 8, 2020
How can I load a very large dataset with limited memory? Neo4j Graph Platform migrated	5	163	October 11, 2022

Neosemantics import turtle triples is resource intensive

Related topics