How to upload local data into Neo4j on EC2

martin_ebert · October 31, 2019, 8:16pm

My Neo4j Graph is running on EC2 (on Docker).

Where do you want me to start?
1.What I wanted to achieve
Import local data from my desktop into Neo EC2 instance without the need to move it first into the import folder in the docker container. Even users (with admin rights) should be able to upload their csv files.

neo4.conf

#*****************************************************************
# Neo4j configuration
#
# For more details and a complete list of settings, please see
# https://neo4j.com/docs/operations-manual/current/reference/configuration-settings/
#*****************************************************************

# The name of the database to mount
#dbms.active_database=graph.db

# Paths of directories in the installation.
#dbms.directories.data=data
#dbms.directories.plugins=plugins
#dbms.directories.certificates=certificates
#dbms.directories.logs=logs
#dbms.directories.lib=lib
#dbms.directories.run=run

# This setting constrains all `LOAD CSV` import files to be under the `import` directory. Remove or comment it out to
# allow files to be loaded from anywhere in the filesystem; this introduces possible security problems. See the
# `LOAD CSV` section of the manual for details.
dbms.directories.import=import

# Whether requests to Neo4j are authenticated.
# To disable authentication, uncomment this line
#dbms.security.auth_enabled=false

# Enable this to be able to upgrade a store from an older version.
#dbms.allow_upgrade=true

# Java Heap Size: by default the Java heap size is dynamically
# calculated based on available system resources.
# Uncomment these lines to set specific initial and maximum
# heap size.
#dbms.memory.heap.initial_size=512m
#dbms.memory.heap.max_size=512m

# The amount of memory to use for mapping the store files, in bytes (or
# kilobytes with the 'k' suffix, megabytes with 'm' and gigabytes with 'g').
# If Neo4j is running on a dedicated server, then it is generally recommended
# to leave about 2-4 gigabytes for the operating system, give the JVM enough
# heap to hold all your transaction state and query context, and then leave the
# rest for the page cache.
# The default page cache memory assumes the machine is dedicated to running
# Neo4j, and is heuristically set to 50% of RAM minus the max Java heap size.
#dbms.memory.pagecache.size=10g

#*****************************************************************
# Network connector configuration
#*****************************************************************

# With default configuration Neo4j only accepts local connections.
# To accept non-local connections, uncomment this line:
#dbms.connectors.default_listen_address=0.0.0.0

# You can also choose a specific network interface, and configure a non-default
# port for each connector, by setting their individual listen_address.

# The address at which this server can be reached by its clients. This may be the server's IP address or DNS name, or
# it may be the address of a reverse proxy which sits in front of the server. This setting may be overridden for
# individual connectors below.
#dbms.connectors.default_advertised_address=localhost

# You can also choose a specific advertised hostname or IP address, and
# configure an advertised port for each connector, by setting their
# individual advertised_address.

# Bolt connector
dbms.connector.bolt.enabled=true
#dbms.connector.bolt.tls_level=OPTIONAL
#dbms.connector.bolt.listen_address=:7687

# HTTP Connector. There can be zero or one HTTP connectors.
dbms.connector.http.enabled=true
#dbms.connector.http.listen_address=:7474

# HTTPS Connector. There can be zero or one HTTPS connectors.
dbms.connector.https.enabled=true
#dbms.connector.https.listen_address=:7473

# Number of Neo4j worker threads.
#dbms.threads.worker_count=

#*****************************************************************
# SSL system configuration
#*****************************************************************

# Names of the SSL policies to be used for the respective components.

# The legacy policy is a special policy which is not defined in
# the policy configuration section, but rather derives from
# dbms.directories.certificates and associated files
# (by default: neo4j.key and neo4j.cert). Its use will be deprecated.

# The policies to be used for connectors.
#
# N.B: Note that a connector must be configured to support/require
#      SSL/TLS for the policy to actually be utilized.
#
# see: dbms.connector.*.tls_level

#bolt.ssl_policy=legacy
#https.ssl_policy=legacy

#*****************************************************************
# SSL policy configuration
#*****************************************************************

# Each policy is configured under a separate namespace, e.g.
#    dbms.ssl.policy.<policyname>.*
#
# The example settings below are for a new policy named 'default'.

# The base directory for cryptographic objects. Each policy will by
# default look for its associated objects (keys, certificates, ...)
# under the base directory.
#
# Every such setting can be overridden using a full path to
# the respective object, but every policy will by default look
# for cryptographic objects in its base location.
#
# Mandatory setting

#dbms.ssl.policy.default.base_directory=certificates/default

# Allows the generation of a fresh private key and a self-signed
# certificate if none are found in the expected locations. It is
# recommended to turn this off again after keys have been generated.
#
# Keys should in general be generated and distributed offline
# by a trusted certificate authority (CA) and not by utilizing
# this mode.

#dbms.ssl.policy.default.allow_key_generation=false

# Enabling this makes it so that this policy ignores the contents
# of the trusted_dir and simply resorts to trusting everything.
#
# Use of this mode is discouraged. It would offer encryption but no security.

#dbms.ssl.policy.default.trust_all=false

# The private key for the default SSL policy. By default a file
# named private.key is expected under the base directory of the policy.
# It is mandatory that a key can be found or generated.

#dbms.ssl.policy.default.private_key=

# The private key for the default SSL policy. By default a file
# named public.crt is expected under the base directory of the policy.
# It is mandatory that a certificate can be found or generated.

#dbms.ssl.policy.default.public_certificate=

# The certificates of trusted parties. By default a directory named
# 'trusted' is expected under the base directory of the policy. It is
# mandatory to create the directory so that it exists, because it cannot
# be auto-created (for security purposes).
#
# To enforce client authentication client_auth must be set to 'require'!

#dbms.ssl.policy.default.trusted_dir=

# Client authentication setting. Values: none, optional, require
# The default is to require client authentication.
#
# Servers are always authenticated unless explicitly overridden
# using the trust_all setting. In a mutual authentication setup this
# should be kept at the default of require and trusted certificates
# must be installed in the trusted_dir.

#dbms.ssl.policy.default.client_auth=require

# It is possible to verify the hostname that the client uses
# to connect to the remote server. In order for this to work, the server public
# certificate must have a valid CN and/or matching Subject Alternative Names.

# Note that this is irrelevant on host side connections (sockets receiving
# connections).

# To enable hostname verification client side on nodes, set this to true.

#dbms.ssl.policy.default.verify_hostname=false

# A comma-separated list of allowed TLS versions.
# By default only TLSv1.2 is allowed.

#dbms.ssl.policy.default.tls_versions=

# A comma-separated list of allowed ciphers.
# The default ciphers are the defaults of the JVM platform.

#dbms.ssl.policy.default.ciphers=

#*****************************************************************
# Logging configuration
#*****************************************************************

# To enable HTTP logging, uncomment this line
#dbms.logs.http.enabled=true

# Number of HTTP logs to keep.
#dbms.logs.http.rotation.keep_number=5

# Size of each HTTP log that is kept.
#dbms.logs.http.rotation.size=20m

# To enable GC Logging, uncomment this line
#dbms.logs.gc.enabled=true

# GC Logging Options
# see http://docs.oracle.com/cd/E19957-01/819-0084-10/pt_tuningjava.html#wp57013 for more information.
#dbms.logs.gc.options=-XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCApplicationStoppedTime -XX:+PrintPromotionFailure -XX:+PrintTenuringDistribution

# For Java 9 and newer GC Logging Options
# see https://docs.oracle.com/javase/10/tools/java.htm#JSWOR-GUID-BE93ABDC-999C-4CB5-A88B-1994AAAC74D5
#dbms.logs.gc.options=-Xlog:gc*,safepoint,age*=trace

# Number of GC logs to keep.
#dbms.logs.gc.rotation.keep_number=5

# Size of each GC log that is kept.
#dbms.logs.gc.rotation.size=20m

# Log level for the debug log. One of DEBUG, INFO, WARN and ERROR. Be aware that logging at DEBUG level can be very verbose.
#dbms.logs.debug.level=INFO

# Size threshold for rotation of the debug log. If set to zero then no rotation will occur. Accepts a binary suffix "k",
# "m" or "g".
#dbms.logs.debug.rotation.size=20m

# Maximum number of history files for the internal log.
#dbms.logs.debug.rotation.keep_number=7

#*****************************************************************
# Miscellaneous configuration
#*****************************************************************

# Enable this to specify a parser other than the default one.
#cypher.default_language_version=3.0

# Determines if Cypher will allow using file URLs when loading data using
# `LOAD CSV`. Setting this value to `false` will cause Neo4j to fail `LOAD CSV`
# clauses that load data from the file system.
#dbms.security.allow_csv_import_from_file_urls=true


# Value of the Access-Control-Allow-Origin header sent over any HTTP or HTTPS
# connector. This defaults to '*', which allows broadest compatibility. Note
# that any URI provided here limits HTTP/HTTPS access to that URI only.
#dbms.security.http_access_control_allow_origin=*

# Value of the HTTP Strict-Transport-Security (HSTS) response header. This header
# tells browsers that a webpage should only be accessed using HTTPS instead of HTTP.
# It is attached to every HTTPS response. Setting is not set by default so
# 'Strict-Transport-Security' header is not sent. Value is expected to contain
# directives like 'max-age', 'includeSubDomains' and 'preload'.
#dbms.security.http_strict_transport_security=

# Retention policy for transaction logs needed to perform recovery and backups.

# Only allow read operations from this Neo4j instance. This mode still requires
# write access to the directory for lock purposes.
#dbms.read_only=false

# Comma separated list of JAX-RS packages containing JAX-RS resources, one
# package name for each mountpoint. The listed package names will be loaded
# under the mountpoints specified. Uncomment this line to mount the
# org.neo4j.examples.server.unmanaged.HelloWorldResource.java from
# neo4j-server-examples under /examples/unmanaged, resulting in a final URL of
# http://localhost:7474/examples/unmanaged/helloworld/{nodeId}
#dbms.unmanaged_extension_classes=org.neo4j.examples.server.unmanaged=/examples/unmanaged

# A comma separated list of procedures and user defined functions that are allowed
# full access to the database through unsupported/insecure internal APIs.
#dbms.security.procedures.unrestricted=my.extensions.example,my.procedures.*

# A comma separated list of procedures to be loaded by default.
# Leaving this unconfigured will load all procedures found.
#dbms.security.procedures.whitelist=apoc.coll.*,apoc.load.*

#********************************************************************
# JVM Parameters
#********************************************************************

# G1GC generally strikes a good balance between throughput and tail
# latency, without too much tuning.

# Have common exceptions keep producing stack traces, so they can be
# debugged regardless of how often logs are rotated.

# Make sure that `initmemory` is not only allocated, but committed to
# the process, before starting the database. This reduces memory
# fragmentation, increasing the effectiveness of transparent huge
# pages. It also reduces the possibility of seeing performance drop
# due to heap-growing GC events, where a decrease in available page
# cache leads to an increase in mean IO response time.
# Try reducing the heap memory, if this flag degrades performance.

# Trust that non-static final fields are really final.
# This allows more optimizations and improves overall performance.
# NOTE: Disable this if you use embedded mode, or have extensions or dependencies that may use reflection or
# serialization to change the value of final fields!

# Disable explicit garbage collection, which is occasionally invoked by the JDK itself.

# Remote JMX monitoring, uncomment and adjust the following lines as needed. Absolute paths to jmx.access and
# jmx.password files are required.
# Also make sure to update the jmx.access and jmx.password files with appropriate permission roles and passwords,
# the shipped configuration contains only a read only role called 'monitor' with password 'Neo4j'.
# For more details, see: http://download.oracle.com/javase/8/docs/technotes/guides/management/agent.html
# On Unix based systems the jmx.password file needs to be owned by the user that will run the server,
# and have permissions set to 0600.
# For details on setting these file permissions on Windows see:
#     http://docs.oracle.com/javase/8/docs/technotes/guides/management/security-windows.html
#dbms.jvm.additional=-Dcom.sun.management.jmxremote.port=3637
#dbms.jvm.additional=-Dcom.sun.management.jmxremote.authenticate=true
#dbms.jvm.additional=-Dcom.sun.management.jmxremote.ssl=false
#dbms.jvm.additional=-Dcom.sun.management.jmxremote.password.file=/absolute/path/to/conf/jmx.password
#dbms.jvm.additional=-Dcom.sun.management.jmxremote.access.file=/absolute/path/to/conf/jmx.access

# Some systems cannot discover host name automatically, and need this line configured:
#dbms.jvm.additional=-Djava.rmi.server.hostname=$THE_NEO4J_SERVER_HOSTNAME

# Expand Diffie Hellman (DH) key size from default 1024 to 2048 for DH-RSA cipher suites used in server TLS handshakes.
# This is to protect the server from any potential passive eavesdropping.

# This mitigates a DDoS vector.

# This filter prevents deserialization of arbitrary objects via java object serialization, addressing potential vulnerabilities.
# By default this filter whitelists all neo4j classes, as well as classes from the hazelcast library and the java standard library.
# These defaults should only be modified by expert users!
# For more details (including filter syntax) see: https://openjdk.java.net/jeps/290
#dbms.jvm.additional=-Djdk.serialFilter=java.**;org.neo4j.**;com.neo4j.**;com.hazelcast.**;net.sf.ehcache.Element;com.sun.proxy.*;org.openjdk.jmh.**;!*

#********************************************************************
# Wrapper Windows NT/2000/XP Service Properties
#********************************************************************
# WARNING - Do not modify any of these properties when an application
#  using this configuration file has been installed as a service.
#  Please uninstall the service before modifying this section.  The
#  service can then be reinstalled.

# Name of the service
dbms.windows_service_name=neo4j

#********************************************************************
# Other Neo4j system properties
#********************************************************************

dbms.connector.https.listen_address=0.0.0.0:7473

dbms.connectors.default_listen_address=0.0.0.0

dbms.connector.http.listen_address=0.0.0.0:7474

dbms.connector.bolt.listen_address=0.0.0.0:7687

wrapper.java.additional=-Dneo4j.ext.udc.source=docker
dbms.tx_log.rotation.retention_policy=100M size
dbms.security.procedures.unrestricted=apoc.\*,algo.\*
dbms.security.allow_csv_import_from_file_urls=true
dbms.memory.pagecache.size=4G
dbms.memory.heap.initial.size=5g
dbms.jvm.additional=-Dunsupported.dbms.udc.source=docker
dbms.directories.logs=/logs
apoc.import.file.use_neo4j_config=true
apoc.import.file.enabled=true
apoc.export.file.enabled=true
HOME=/var/lib/neo4j
EDITION=community

2.What I did

I have read that I simply need to uncomment "dbms.directories.import=import" but even if I uncomment it manually I cannot restart the graph itself.

I tried:

service neo4j restart

Does not work:

It's strange that when I query the status, I get the following message even though everything is working:

Let me restart:

What is in the log file:

Then I looked for a solution:

Understood. 1024 is not enough. Therefore always the error message.

I can't get any further here because I can't get systemctl installed.
Correspondingly, commands do not work either.

systemctl restart neo4j

If I go after the official Neo4j manual, even systemd is missing

It is not there (within docker container)

Questions:

How do I get systemctl installed?
How do I get systemd installed?
Solution: Run in your docker container

apt-get update;apt-get install systemd
This must be added to the Docker image.

Do I have the possibility to comment out environment variables in the Docker run command?
Is there anything else to consider here? "Import local data from my desktop into Neo EC2 instance without the need to move it first into the import folder in the docker container."

Best
Martin

dana_canzano · November 1, 2019, 1:28am

When one runs LOAD CSV it can either read from a file hosted on and available from a http(s):// resource or if using the file:/// syntax then it expects the file be in the import directory as specified by dbms.directories.import and this directory is on the host where Neo4j is running. If Neo4j is running on a EC2 instance then using file:/// would require the csv file to be on the EC2 instance itself. If Neo4j is running on a EC2 instance you can not use LOAD CSV to read a file that is hosted on your local desktop.

Now as to your setup. What OS are you running on EC2?

If Neo4j is running as a docker container how was it installed/started? Getting started with Neo4j in Docker - Operations Manual provides instructions and for example it could be as simple as

docker run \
    --publish=7474:7474 --publish=7687:7687 \
    --volume=$HOME/neo4j/data:/data \
    --volume=$HOME/neo4j/logs:/logs \
    neo4j:3.5

though that starts a Docler Neo4j container and the container will be running as long as the process/terminal that started it does not exit. If you want it to persist you would modify the above and include a --detach

docker run \
    --detach \
    --publish=7474:7474 --publish=7687:7687 \
    --volume=$HOME/neo4j/data:/data \
    --volume=$HOME/neo4j/logs:/logs \
    neo4j:3.5

As to usage of service neo4j restart this command is not valid when Neo4j is installed as a Docker container.
Running

docker ps

which may report output similar to

$ sudo docker ps
CONTAINER ID        IMAGE               COMMAND                  CREATED             STATUS              PORTS                                                      NAMES
50a436974712        neo4j:3.5           "/sbin/tini -g -- /d…"   6 seconds ago       Up 5 seconds        0.0.0.0:7474->7474/tcp, 7473/tcp, 0.0.0.0:7687->7687/tcp   adoring_davinci

will report all the docker containers started on the EC2 instance. To restart the docker container and thus Neo4j one can run either

docker restart  <containerID>

or

docker restart <name>

For example either

docker restart 50a436974712

or

docker restart adoring_davinci

As to

How do I get systemctl installed?
How do I get systemd installed?

not sure how either are involved relative to a Docker install/implementation.

Is there anything else to consider here?
Import local data from my desktop into Neo EC2 instance without the need to move it first into the import folder in the docker container.

As from opening paragraph the CSV must either be available from a http/https: resource or be a file that is on the same instance where Neo4j is running, in this case EC2.

Was there a specific need to install Neo4j inside a Docker container as opposed to installing as a simple tar file or as a Debian install ? It should work regardless but Docker just seems to add additional complexity (more related to the Docker infrastructure itself then actually Neo4j)

martin_ebert · November 1, 2019, 9:19am

Understood, i.e. in my scenario (Neo on Docker) I cannot import CSV files with LOAD from my local directories. Ich nutze Docker nur als Zwischenschritt. Perspectively (by the end of the year) Neo4j will run on Kubernetes (with AWS). At the moment I just don't have the time to find everything I need.

Anyway... That is my EC2 OS:

That is my docker run command.

docker run -it --rm \
    --name graph \
    -p 7474:7474 -p 7687:7687 \
    -v $HOME/neo4j/container/conf:/tmp/conf \
    -v $HOME/neo4j/container/logs:/tmp/logs \
    -v $HOME/neo4j/container/backup:/var/lib/neo4j/backup \
    -v $HOME/neo4j/container/data:/var/lib/neo4j/data \
    -v $HOME/neo4j/container/import:/var/lib/neo4j/import \
    -v $HOME/neo4j/container/plugins:/var/lib/neo4j/plugins \
    -e NEO4J_dbms_memory_heap_initial__size=6g \
    -e NEO4J_dbms_memory_heap.max__size=6g \
    -e NEO4J_dbms_memory_pagecache_size=6G \
    -e NEO4J_dbms_security_allow__csv__import__from__file__urls=true \
    -e NEO4J_apoc_export_file_enabled=true \
    -e NEO4J_apoc_import_file_enabled=true \
    -e NEO4J_apoc_import_file_use__neo4j__config=true \
    -e NEO4J_dbms_security_procedures_unrestricted=apoc.\\\*,algo.\\\* \
    -e NEO4J_AUTH=neo4j/password \
    neo4j:3.5.12

I have a running Neo4j instance running on Docker and now I want to adjust something in the config without losing the existing nodes, relationships and co as well as the usersettings.

Question 1: The only change I have to make is to replace --rm by detach (to achieve my goal)?

docker run -it -d \
    --name graph \
    -p 7474:7474 -p 7687:7687 \
    -v $HOME/neo4j/container/conf:/tmp/conf \
    -v $HOME/neo4j/container/logs:/tmp/logs \
    -v $HOME/neo4j/container/backup:/var/lib/neo4j/backup \
    -v $HOME/neo4j/container/data:/var/lib/neo4j/data \
    -v $HOME/neo4j/container/import:/var/lib/neo4j/import \
    -v $HOME/neo4j/container/plugins:/var/lib/neo4j/plugins \
    -e NEO4J_dbms_memory_heap_initial__size=6g \
    -e NEO4J_dbms_memory_heap.max__size=6g \
    -e NEO4J_dbms_memory_pagecache_size=6G \
    -e NEO4J_dbms_security_allow__csv__import__from__file__urls=true \
    -e NEO4J_apoc_export_file_enabled=true \
    -e NEO4J_apoc_import_file_enabled=true \
    -e NEO4J_apoc_import_file_use__neo4j__config=true \
    -e NEO4J_dbms_security_procedures_unrestricted=apoc.\\\*,algo.\\\* \
    -e NEO4J_AUTH=neo4j/password \
    neo4j:3.5.12

Question 2: Do you have any suggestions for improvement?

Question 3: Why Option A instead of B?
Option A
-v=$HOME/neo4j/data:/data \
Option B
-v $HOME/neo4j/data:/var/lib/neo4j/data \

The graph data is stored under /var/lib/neo4j/data and not under /data, as far as I know.

Question 4: Do I understand your statement correctly, you wouldn't recommend changing the parameter dbms.directories.import in the config, because you can already load local files (on EC2 instance) by default via file:/// or from the web with https://?

Question 5: Can I offer the user with apoc.load.csv and apoc.load.xls possibilities to upload local files?

I have already set the corresponding config parameters:

    -e NEO4J_apoc_import_file_enabled=true \
    -e NEO4J_apoc_import_file_use__neo4j__config=true \

Note: That doesn't seem to work either. Can I upload no files at all in my setup?

dana_canzano · November 1, 2019, 12:22pm

Question 3: I see no real reason one is better than the other. but with reference to your comment of

The graph data is stored under /var/lib/neo4j/data and not under /data, as far as I know.

what does a linux find / -name graph.db as run from the EC2 instance reports the location of graph.db ???

Question 4: correct
Question 5: it appears as if you are trying to get apoc.load.csv to upload a file at c:/Users/....... So Neo4j is running on some EC2 instance and you are wanting to have it read a file on your desktop ??? If so this is not supported. For a load csv to work it must be either available from the instance where Neo4j is installed or must be available from a http/https resource

Was there any specific reason to go with a Docker implementation of Neo4j ?

martin_ebert · November 1, 2019, 3:41pm

Initially I started with Neo4j Desktop. We are currently testing (PoC) and wanted to quickly make Neo4j available for a handful of users. In the next step (make it ready for production) I would move the existing developments to Kubernetes on AWS. And exactly for this it is necessary to run Neo4j in a container. Is there anything against trying to take advantage of Kubernetes?

dana_canzano · November 1, 2019, 4:34pm

no specific issue with running as Docker or kubernetes. both should work but in terms of ease of use both do provide a level of complexity that are not the same as a simple untar of the neo4j tar.gz or apt-get

Topic		Replies	Views
Running Neo4j Community on AWS Aura & Cloud aws , ec2 , knowledge-base	20	4857	May 16, 2019
Automate data upload process from AWS S3 to Neo4j CE running on EC2 Import / Export	10	3244	June 2, 2021
Neo4j Docker Container - Load Seed JSON Data at Startup Neo4j Graph Platform apoc , cypher , operations , import	19	2576	January 26, 2021
Import csv with neo4j-admin on a docker Neo4j Graph Platform import	2	2107	August 4, 2020
Import csv with neo4j-admin on a docker Neo4j Graph Platform import , migrated	2	388	November 21, 2022

How to upload local data into Neo4j on EC2

Related topics