Import Yelp dataset

harvey_nguyen · February 14, 2019, 4:24am

Hi everybody,

I am reading "A comprehensive Guide to Graph Algorithms in Neo4j" ebook. According to this book, I have downloaded YELP data to experiment some algorithms. However, I cannot import the data into my Neo4j server.
I have followed guides from github (in the book) but errors have happened. Anybody here can help me or you have an yelp-graph-database so that you can upload somewhere?

Thanks for you help,
Harvey Nguyen

lyonwj · February 14, 2019, 5:57am

Can you share the error messages?

There is also a Cypher based import script for the same data here: The Neo4j Graph Data Science Library Manual v2.2 - Neo4j Graph Data Science

dominicvivek06 · February 14, 2019, 6:10am

please, share the error message.

harvey_nguyen · February 14, 2019, 6:14am

Thanks for you reply,
There are different errors, one of them is that

Traceback (most recent call last):
  File "json_to_csv.py", line 51, in <module>
    with open("dataset/businessLocations.json") as business_locations_json, \
FileNotFoundError: [Errno 2] No such file or directory: 'dataset/businessLocations.json'

After extracting the database file, there is no file such as "businessLocations.json". I think this python code is not update. Also on the Yelp website, there is no file like that.

I will try with your suggestion.
Thanks a lot.

mark.needham · February 14, 2019, 9:34am

Hey,

That file is generated by running this command:

python lat_long_expansion.py

Or did you try that already and it didn't work?

harvey_nguyen · February 14, 2019, 9:41am

@mark.needham really?
I have followed the instructions on github

After extracting, I run

python json_to_csv.py

So, we need to change the order ?

mark.needham · February 14, 2019, 10:06am

Yeh I must have those instructions in the wrong order

shivanandiyer · May 9, 2019, 10:32pm

Hi @mark.needham, I tried the python as well as the json file load using cypher and it gets stuck loading user.json. I'm loading it on neo4j desktop.

mark.needham · May 10, 2019, 7:16am

Is there an error message / more info that you can share?

shivanandiyer · May 10, 2019, 2:23pm

Hi @mark.needham, this is the error I'm getting while running the json to csv.

Traceback (most recent call last):
File "json_to_csv.py", line 109, in
for category in item["categories"]:
TypeError: 'NoneType' object is not iterable

Tried installing the libraries from requirements.txt and had errors with pkg-resources and PyYAML.

Could not find a version that satisfies the requirement pkg-resources==0.0.0 (from -r requirements.txt (line 10)) (from versions: )
No matching distribution found for pkg-resources==0.0.0 (from -r requirements.txt (line 10))

and Cannot uninstall 'PyYAML'. It is a distutils installed project and thus we cannot accurately determine which files belong to it which would lead to only a partial uninstall.

Abishek · May 10, 2019, 2:31pm

Hi @mark.needham, Thanks for your response! I am working with @shivanandiyer on a project to analyse the Yelp dataset with Neo4j. I initially tried using the following apoc procedure to load the json and it took more than a day and I had to abort:

CALL apoc.load.json("path") YIELD value AS user
MERGE (u:User {user_id: user.user_id})
SET u.name = user.name,
u.review_count = user.review_count,
u.average_stars = user.average_stars,
u.fans = user.fans

I later came across your GitHub repository to convert JSON to CSV and directly import the data file with relations (GitHub - mneedham/yelp-graph-algorithms). I followed the steps and got this error while running the import.sh script file:

Available resources:
  Total machine memory: 16.00 GB
  Free machine memory: 2.81 GB
  Max heap memory : 3.56 GB
  Processors: 8
  Configured max memory: 11.20 GB
  High-IO: true

Import starting 2019-05-11 00:24:07.432+1000
  Estimated number of nodes: 2.83 M
  Estimated number of node properties: 8.24 M
  Estimated number of relationships: 1.82 G
  Estimated number of relationship properties: 0.00
  Estimated disk space usage: 58.61 GB
  Estimated required memory usage: 1.03 GB

InteractiveReporterInteractions command list (end with ENTER):
  c: Print more detailed information about current stage
  i: Print more detailed information

(1/4) Node import 2019-05-11 00:24:07.487+1000
  Estimated number of nodes: 2.83 M
  Estimated disk space usage: 967.71 MB
  Estimated required memory usage: 1.03 GB
.......... .......... .......... .......... ..........   5% ∆1s 822ms
.......... .......... .......... .......... ..........  10% ∆403ms
.......... .......... .......... .......... ..........  15% ∆403ms
.......... .......... .......... .......... ..........  20% ∆458ms
.......... .......... .......... .......... ..........  25% ∆1s 407ms
.......... .......... .......... .......... ..........  30% ∆1s 804ms
.......... .......... .......... ........-. ..........  35% ∆236ms
.......... .......... .......... .......... ..........  40% ∆0ms
.......... .......... .......... .......... ..........  45% ∆0ms
.......... .......... .......... .......... ..........  50% ∆605ms
.......... .......... .......... .......... ..........  55% ∆0ms
.......... .......... .......... .......... ..........  60% ∆202ms
.......... .......... .......... .......... ..........  65% ∆202ms
.......... .......... .......... .......... ..........  70% ∆0ms
.......... .......... .......... .......... ..........  75% ∆2s 410ms
.......... .......... .......... .......... ..........  80% ∆0ms
.......... .......... .......... .......... ..........  85% ∆1ms
.......... .......... .......... .......... ..........  90% ∆0ms
.......... .......... .......... .......... ..........  95% ∆0ms
.......... .......... .......... .......... .........Exception in thread "Thread-50" java.lang.RuntimeException: org.neo4j.unsafe.impl.batchimport.cache.idmapping.string.DuplicateInputIdException: Id '#NAME?' is defined more than once in group 'Review'
	at org.neo4j.unsafe.impl.batchimport.staging.AbstractStep.issuePanic(AbstractStep.java:155)
	at org.neo4j.unsafe.impl.batchimport.staging.AbstractStep.issuePanic(AbstractStep.java:147)
	at org.neo4j.unsafe.impl.batchimport.staging.LonelyProcessingStep.lambda$receive$0(LonelyProcessingStep.java:59)
	at java.lang.Thread.run(Thread.java:748)
Caused by: org.neo4j.unsafe.impl.batchimport.cache.idmapping.string.DuplicateInputIdException: Id '#NAME?' is defined more than once in group 'Review'
	at org.neo4j.unsafe.impl.batchimport.input.BadCollector$NodesProblemReporter.exception(BadCollector.java:278)
	at org.neo4j.unsafe.impl.batchimport.input.BadCollector.collect(BadCollector.java:168)
	at org.neo4j.unsafe.impl.batchimport.input.BadCollector.collectDuplicateNode(BadCollector.java:135)
	at org.neo4j.unsafe.impl.batchimport.cache.idmapping.string.EncodingIdMapper.detectDuplicateInputIds(EncodingIdMapper.java:606)
	at org.neo4j.unsafe.impl.batchimport.cache.idmapping.string.EncodingIdMapper.buildCollisionInfo(EncodingIdMapper.java:522)
	at org.neo4j.unsafe.impl.batchimport.cache.idmapping.string.EncodingIdMapper.prepare(EncodingIdMapper.java:239)
	at org.neo4j.unsafe.impl.batchimport.IdMapperPreparationStep.process(IdMapperPreparationStep.java:56)
	at org.neo4j.unsafe.impl.batchimport.staging.LonelyProcessingStep.lambda$receive$0(LonelyProcessingStep.java:53)
	... 1 more

IMPORT FAILED in 13s 620ms.
Data statistics is not available.
Peak memory usage: 1.02 GB
Duplicate input ids that would otherwise clash can be put into separate id space, read more about how to use id spaces in the manual: https://neo4j.com/docs/operations-manual/3.5/tools/import/file-header-format/#import-tool-id-spaces
Caused by:Id '#NAME?' is defined more than once in group 'Review'

WARNING Import failed. The store files in /Users/abishekarunachalam/Downloads/NEO4J_HOME/data/databases/yelp.db are left as they are, although they are likely in an unusable state. Starting a database on these store files will likely fail or observe inconsistent records so start at your own risk or delete the store manually
unexpected error: Id '#NAME?' is defined more than once in group 'Review'

I checked the review.csv file and found '#NAME' repeated multiple times in Column1 as seen in the attached screenshot:

Considering we are beginners, any guidance on what could have gone wrong or any other way to efficiently import Yelp data in NEO4j would be much appreciated. Thank you!

shivanandiyer · May 15, 2019, 2:55am

Hi @mark.needham ,

We managed to sort some of those issues out for now by loading data using cypher instead of python.
How long did it take for you for load the complete Yelp dataset? Loading the business.json took me around 7 hours with heapsize configured to 12G and pagecache size 6GB. I'm running neo4j desktop on my laptop - 4 core, 32GB

This is what I ran. Wondering if setting the batch size and parallel = true would have made some difference.
CALL apoc.load.json('file:///business.json')
YIELD value
WITH value
MERGE (b:Business {id:value.business_id})
SET b += apoc.map.clean(value, ['attributes','hours','business_id','categories','address','postal_code'], )
WITH b,value.categories as categories
UNWIND categories as category
MERGE (c:Category{name:category})
MERGE (b)-[:IN_CATEGORY]->(c);

wisundstrom · October 7, 2019, 6:42pm

Abishek:

Estimated number of node properties: 8.24 M Estimated number of relationships: 1.82 G Estimated number of relationship properties: 0.00 Estimated disk space usage: 58.61 GB Estimated required memory usage: 1.03 GB InteractiveReporterInteractions command list (end with ENTER): c: Print more detailed information about current stage i: Print more detailed information (1/4) Node import 2019-05-11 00:24:07.487+1000 Estimated number of nodes: 2.83 M Estimated disk space usage: 967.71 MB Estimated required memory usage: 1.03 GB .......... .......... .......... .......... .......... 5% ∆1s 822ms .......... .......... .......... .......... .......... 10% ∆403ms .......... .......... .......... .......... .......... 15% ∆403ms .......... .......... .......... .......... .......... 20% ∆458ms .......... .......... .......... .......... .......... 25% ∆1s 407ms .......... .......... .......... .......... .......... 30% ∆1s 804ms .......... .......... .......... ........-. .......... 35% ∆236ms .......... .......... .......... .......... .......... 40% ∆0ms .......... .......... .......... .......... .......... 45% ∆0ms .......... .......... .......... .......... .......... 50% ∆605ms .......... .......... .......... .......... .......... 55% ∆0ms .......... .......... .......... .......... .......... 60% ∆202ms .......... .......... .......... .......... .......... 65% ∆202ms .......... .......... .......... .......... .......... 70% ∆0ms .......... .......... .......... .......... .......... 75% ∆2s 410ms .......... .......... .......... .......... .......... 80% ∆0ms .......... .......... .......... .......... .......... 85% ∆1ms .......... .......... .......... .......... .......... 90% ∆0ms .......... .......... .......... .......... .......... 95% ∆0ms .......... .......... .......... .......... .........Exception in thread "Thread-50" java.lang.RuntimeException: org.neo4j.unsafe.impl.batchimport.cache.idmapping.string.DuplicateInputIdException: Id '#NAME?' is defined more than once in group 'Review' at org.neo4j.unsafe.impl.batchimport.staging.AbstractStep.issuePanic(AbstractStep.java:155) at org.neo4j.unsafe.impl.batchimport.staging.AbstractStep.issuePanic(AbstractStep.java:147) at org.neo4j.unsafe.impl.batchimport.staging.LonelyProcessingStep.lambda$receive$0(LonelyProcessingStep.java:59) at java.lang.Thread.run(Thread.java:748) Caused by: org.neo4j.unsafe.impl.batchimport.cache.idmapp

Hey, I just came upon this thread and I'm also trying to import the yelp data. How long did it end up taking your for import? so far everything appears to be working for me just using the apoc commands here The Neo4j Graph Data Science Library Manual v2.5 - Neo4j Graph Data Science, but it is just taking a while.

aparna.b1207 · November 3, 2020, 12:38am

Hi @mark.needham. The following couple of edits was required for me to successfully run the Python files:

Enclose the bulk of the code in 'lat_long_expansion.py' inside a function and call it in "if name == 'main': functionname()"
Add 'encoding="utf-8"' to all open(filename) statements
Add "if not item['categories'] == None:"

If there are any more updates, will add to this list...

Topic		Replies	Views
Import Yelp with instructions from github yelp-graph-algorithms Import / Export	0	812	April 12, 2019
Updates to the graph algorithms docs, chapter 3.4 import yelp dataset Documentation knowledge-base	9	865	April 15, 2020
Yelp DB load is taking a long time. Is it hung or how to assess if will fail. What option can i use to fasten the process Import / Export import	0	135	April 28, 2022
Yelp dataset missing from Neo4j Sandbox Graph Academy & Certifications	5	689	January 21, 2020
Yelp Dataset Import Issue Graph Algorithms/Graph Data Science	0	625	July 18, 2019

July Summer Fun!

Import Yelp dataset

Related topics