cancel
Showing results for 
Search instead for 
Did you mean: 

Neosemantics import turtle triples is resource intensive

madsbuch
Node

Hi there!

I am trying to import a turtle dataset of just under 700MB containing roughly 2M nodes.

I am doing it into a community edition of the Neo4J database started as follows in docker:

docker run \
    --name neo_realm_neo4j \
    -p7474:7474 -p7687:7687 \
    --add-host host.docker.internal:host-gateway \
    -d \
    --env NEO4JLABS_PLUGINS='["apoc", "n10s"]' \
    --env NEO4J_AUTH=neo4j/test \
    --env NEO4J_dbms_unmanaged__extension__classes="n10s.endpoint=/rdf" \
    --env NEO4J_dbms_memory_heap_initial__size=2G \
    --env NEO4J_dbms_memory_heap_max__size=55G \
    -v $HOME/neo4j/data:/data \
    -v $HOME/neo4j/logs:/logs \
    -v $HOME/neo4j/import:/var/lib/neo4j/import \
    -v $HOME/neo4j/plugins:/plugins \
    neo4j:latest

with the following Cypher query

CALL n10s.rdf.import.fetch('{url}', 'Turtle')

(where URL is a docker accessible URL returning the turtle file)

As is evident the container is being given 55GB of RAM.

This process takes all RAM, and even slows down importing towards the end. Upon restarting the docker image everything is persisted, but only takes up 35GB of space.

Looking around, it seems like a 2M node dataset is rather small compared to what other people work with.

Are there any ways to go with optimizations in order to make this a bit more manageable?

1 REPLY 1

dana_canzano
Neo4j
Neo4j

@madsbuch
Not specific to neosemantics but we generally do not recommend setting min/max heap to differerent values,

The heap memory size is determined by the parameters dbms.memory.heap.initial_size
 and dbms.memory.heap.max_size. It is recommended to set these two parameters
 to the same value to avoid unwanted full garbage collection pauses.

and also, we generally do no see where customers set max heap to be over 31G.

How did you arrive at a min and max heap of 2G and 55G respectively?

Further you have not defined dbms.mempory.pagecache.size where this parameted is used to defined to describe how much RAM should be allocated to record the graph structure in RAM. If not defined it will default but we generally see where customers explicitly define.

The doc reference above provides some good details as to these parameters