Hi,
I want to write python code in order to manage huge volumes of bibliographic data. Specifically i want to write & execute efficiently python code in order to check dataset quality, access huge volumes of bibliographic data in streaming json format and produce csv files from them. Then from the csv files i want to create quickly a Neo4j database that will be accessible through internet. What is more, i want to use for my execution runtimes GPU in order to speed up execution. What infrastructure can implement this kind of work load efficiently and efficiency? This kind of workload must be executed in a fully automated way.
Thanks in advance for your time!
That's a lot of questions:
- GPU - there is currently no special treatment of GPUs
- CSV you can use neo4j-admin import to quickly create databases in bulk from huge CSVs it also supports GZ compression
Python and data import.
The Python driver is not the fastest driver for high volumes, but it should be good enough if you have some patience.
Some people had more success from python using the http API in terms of throughput
they even created a separate library: Announcing neo4j-connector 1.0.0 (python 3.5+)
Otherwise for queries on data quality etc. I suggest the cypher query tuning course to make sure you have always fast queries:
Thank you very much. Your tips are very valuable!