I am reading "A comprehensive Guide to Graph Algorithms in Neo4j" ebook. According to this book, I have downloaded YELP data to experiment some algorithms. However, I cannot import the data into my Neo4j server.
I have followed guides from github (in the book) but errors have happened. Anybody here can help me or you have an yelp-graph-database so that you can upload somewhere?
Thanks for you reply,
There are different errors, one of them is that
Traceback (most recent call last):
File "json_to_csv.py", line 51, in <module>
with open("dataset/businessLocations.json") as business_locations_json, \
FileNotFoundError: [Errno 2] No such file or directory: 'dataset/businessLocations.json'
After extracting the database file, there is no file such as "businessLocations.json". I think this python code is not update. Also on the Yelp website, there is no file like that.
Hi @mark.needham, this is the error I'm getting while running the json to csv.
Traceback (most recent call last):
File "json_to_csv.py", line 109, in
for category in item["categories"]:
TypeError: 'NoneType' object is not iterable
Tried installing the libraries from requirements.txt and had errors with pkg-resources and PyYAML.
Could not find a version that satisfies the requirement pkg-resources==0.0.0 (from -r requirements.txt (line 10)) (from versions: )
No matching distribution found for pkg-resources==0.0.0 (from -r requirements.txt (line 10))
and Cannot uninstall 'PyYAML'. It is a distutils installed project and thus we cannot accurately determine which files belong to it which would lead to only a partial uninstall.
Hi @mark.needham, Thanks for your response! I am working with @shivanandiyer on a project to analyse the Yelp dataset with Neo4j. I initially tried using the following apoc procedure to load the json and it took more than a day and I had to abort:
CALL apoc.load.json("path") YIELD value AS user
MERGE (u:User {user_id: user.user_id})
SET u.name = user.name,
u.review_count = user.review_count,
u.average_stars = user.average_stars,
u.fans = user.fans
I later came across your GitHub repository to convert JSON to CSV and directly import the data file with relations (GitHub - mneedham/yelp-graph-algorithms). I followed the steps and got this error while running the import.sh script file:
Available resources:
Total machine memory: 16.00 GB
Free machine memory: 2.81 GB
Max heap memory : 3.56 GB
Processors: 8
Configured max memory: 11.20 GB
High-IO: true
Import starting 2019-05-11 00:24:07.432+1000
Estimated number of nodes: 2.83 M
Estimated number of node properties: 8.24 M
Estimated number of relationships: 1.82 G
Estimated number of relationship properties: 0.00
Estimated disk space usage: 58.61 GB
Estimated required memory usage: 1.03 GB
InteractiveReporterInteractions command list (end with ENTER):
c: Print more detailed information about current stage
i: Print more detailed information
(1/4) Node import 2019-05-11 00:24:07.487+1000
Estimated number of nodes: 2.83 M
Estimated disk space usage: 967.71 MB
Estimated required memory usage: 1.03 GB
.......... .......... .......... .......... .......... 5% ∆1s 822ms
.......... .......... .......... .......... .......... 10% ∆403ms
.......... .......... .......... .......... .......... 15% ∆403ms
.......... .......... .......... .......... .......... 20% ∆458ms
.......... .......... .......... .......... .......... 25% ∆1s 407ms
.......... .......... .......... .......... .......... 30% ∆1s 804ms
.......... .......... .......... ........-. .......... 35% ∆236ms
.......... .......... .......... .......... .......... 40% ∆0ms
.......... .......... .......... .......... .......... 45% ∆0ms
.......... .......... .......... .......... .......... 50% ∆605ms
.......... .......... .......... .......... .......... 55% ∆0ms
.......... .......... .......... .......... .......... 60% ∆202ms
.......... .......... .......... .......... .......... 65% ∆202ms
.......... .......... .......... .......... .......... 70% ∆0ms
.......... .......... .......... .......... .......... 75% ∆2s 410ms
.......... .......... .......... .......... .......... 80% ∆0ms
.......... .......... .......... .......... .......... 85% ∆1ms
.......... .......... .......... .......... .......... 90% ∆0ms
.......... .......... .......... .......... .......... 95% ∆0ms
.......... .......... .......... .......... .........Exception in thread "Thread-50" java.lang.RuntimeException: org.neo4j.unsafe.impl.batchimport.cache.idmapping.string.DuplicateInputIdException: Id '#NAME?' is defined more than once in group 'Review'
at org.neo4j.unsafe.impl.batchimport.staging.AbstractStep.issuePanic(AbstractStep.java:155)
at org.neo4j.unsafe.impl.batchimport.staging.AbstractStep.issuePanic(AbstractStep.java:147)
at org.neo4j.unsafe.impl.batchimport.staging.LonelyProcessingStep.lambda$receive$0(LonelyProcessingStep.java:59)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.neo4j.unsafe.impl.batchimport.cache.idmapping.string.DuplicateInputIdException: Id '#NAME?' is defined more than once in group 'Review'
at org.neo4j.unsafe.impl.batchimport.input.BadCollector$NodesProblemReporter.exception(BadCollector.java:278)
at org.neo4j.unsafe.impl.batchimport.input.BadCollector.collect(BadCollector.java:168)
at org.neo4j.unsafe.impl.batchimport.input.BadCollector.collectDuplicateNode(BadCollector.java:135)
at org.neo4j.unsafe.impl.batchimport.cache.idmapping.string.EncodingIdMapper.detectDuplicateInputIds(EncodingIdMapper.java:606)
at org.neo4j.unsafe.impl.batchimport.cache.idmapping.string.EncodingIdMapper.buildCollisionInfo(EncodingIdMapper.java:522)
at org.neo4j.unsafe.impl.batchimport.cache.idmapping.string.EncodingIdMapper.prepare(EncodingIdMapper.java:239)
at org.neo4j.unsafe.impl.batchimport.IdMapperPreparationStep.process(IdMapperPreparationStep.java:56)
at org.neo4j.unsafe.impl.batchimport.staging.LonelyProcessingStep.lambda$receive$0(LonelyProcessingStep.java:53)
... 1 more
IMPORT FAILED in 13s 620ms.
Data statistics is not available.
Peak memory usage: 1.02 GB
Duplicate input ids that would otherwise clash can be put into separate id space, read more about how to use id spaces in the manual: https://neo4j.com/docs/operations-manual/3.5/tools/import/file-header-format/#import-tool-id-spaces
Caused by:Id '#NAME?' is defined more than once in group 'Review'
WARNING Import failed. The store files in /Users/abishekarunachalam/Downloads/NEO4J_HOME/data/databases/yelp.db are left as they are, although they are likely in an unusable state. Starting a database on these store files will likely fail or observe inconsistent records so start at your own risk or delete the store manually
unexpected error: Id '#NAME?' is defined more than once in group 'Review'
I checked the review.csv file and found '#NAME' repeated multiple times in Column1 as seen in the attached screenshot:
Considering we are beginners, any guidance on what could have gone wrong or any other way to efficiently import Yelp data in NEO4j would be much appreciated. Thank you!
We managed to sort some of those issues out for now by loading data using cypher instead of python.
How long did it take for you for load the complete Yelp dataset? Loading the business.json took me around 7 hours with heapsize configured to 12G and pagecache size 6GB. I'm running neo4j desktop on my laptop - 4 core, 32GB
This is what I ran. Wondering if setting the batch size and parallel = true would have made some difference.
CALL apoc.load.json('file:///business.json')
YIELD value
WITH value
MERGE (b:Business {id:value.business_id})
SET b += apoc.map.clean(value, ['attributes','hours','business_id','categories','address','postal_code'], )
WITH b,value.categories as categories
UNWIND categories as category
MERGE (c:Category{name:category})
MERGE (b)-[:IN_CATEGORY]->(c);
Hey, I just came upon this thread and I'm also trying to import the yelp data. How long did it end up taking your for import? so far everything appears to be working for me just using the apoc commands here The Neo4j Graph Data Science Library Manual v2.5 - Neo4j Graph Data Science, but it is just taking a while.