Import Yelp dataset

(Harvey Nguyen) #1

Hi everybody,

I am reading "A comprehensive Guide to Graph Algorithms in Neo4j" ebook. According to this book, I have downloaded YELP data to experiment some algorithms. However, I cannot import the data into my Neo4j server.
I have followed guides from github (in the book) but errors have happened. Anybody here can help me or you have an yelp-graph-database so that you can upload somewhere?

Thanks for you help,
Harvey Nguyen

(William Lyon) #2

Can you share the error messages?

There is also a Cypher based import script for the same data here: https://neo4j.com/docs/graph-algorithms/current/yelp-example/#yelp-import

(DKumar) #3

please, share the error message.

(Harvey Nguyen) #4

Thanks for you reply,
There are different errors, one of them is that

Traceback (most recent call last):
  File "json_to_csv.py", line 51, in <module>
    with open("dataset/businessLocations.json") as business_locations_json, \
FileNotFoundError: [Errno 2] No such file or directory: 'dataset/businessLocations.json'

After extracting the database file, there is no file such as "businessLocations.json". I think this python code is not update. Also on the Yelp website, there is no file like that.

I will try with your suggestion.
Thanks a lot.

(Mark Needham) #5

Hey,

That file is generated by running this command:

python lat_long_expansion.py

Or did you try that already and it didn't work?

(Harvey Nguyen) #6

@mark.needham really?
I have followed the instructions on github

After extracting, I run

python json_to_csv.py

So, we need to change the order ?

(Mark Needham) #7

Yeh I must have those instructions in the wrong order

(Shivanandiyer) #9

Hi @mark.needham, I tried the python as well as the json file load using cypher and it gets stuck loading user.json. I'm loading it on neo4j desktop.

1 Like
(Mark Needham) #10

Is there an error message / more info that you can share?

(Shivanandiyer) #11

Hi @mark.needham, this is the error I'm getting while running the json to csv.

Traceback (most recent call last):
File "json_to_csv.py", line 109, in
for category in item["categories"]:
TypeError: 'NoneType' object is not iterable

Tried installing the libraries from requirements.txt and had errors with pkg-resources and PyYAML.

Could not find a version that satisfies the requirement pkg-resources==0.0.0 (from -r requirements.txt (line 10)) (from versions: )
No matching distribution found for pkg-resources==0.0.0 (from -r requirements.txt (line 10))

and Cannot uninstall 'PyYAML'. It is a distutils installed project and thus we cannot accurately determine which files belong to it which would lead to only a partial uninstall.

(Abishek) #12

Hi @mark.needham, Thanks for your response! I am working with @shivanandiyer on a project to analyse the Yelp dataset with Neo4j. I initially tried using the following apoc procedure to load the json and it took more than a day and I had to abort:

CALL apoc.load.json("path") YIELD value AS user
MERGE (u:User {user_id: user.user_id})
SET u.name = user.name,
u.review_count = user.review_count,
u.average_stars = user.average_stars,
u.fans = user.fans

I later came across your GitHub repository to convert JSON to CSV and directly import the data file with relations (https://github.com/mneedham/yelp-graph-algorithms). I followed the steps and got this error while running the import.sh script file:

Available resources:
  Total machine memory: 16.00 GB
  Free machine memory: 2.81 GB
  Max heap memory : 3.56 GB
  Processors: 8
  Configured max memory: 11.20 GB
  High-IO: true

Import starting 2019-05-11 00:24:07.432+1000
  Estimated number of nodes: 2.83 M
  Estimated number of node properties: 8.24 M
  Estimated number of relationships: 1.82 G
  Estimated number of relationship properties: 0.00
  Estimated disk space usage: 58.61 GB
  Estimated required memory usage: 1.03 GB

InteractiveReporterInteractions command list (end with ENTER):
  c: Print more detailed information about current stage
  i: Print more detailed information

(1/4) Node import 2019-05-11 00:24:07.487+1000
  Estimated number of nodes: 2.83 M
  Estimated disk space usage: 967.71 MB
  Estimated required memory usage: 1.03 GB
.......... .......... .......... .......... ..........   5% ∆1s 822ms
.......... .......... .......... .......... ..........  10% ∆403ms
.......... .......... .......... .......... ..........  15% ∆403ms
.......... .......... .......... .......... ..........  20% ∆458ms
.......... .......... .......... .......... ..........  25% ∆1s 407ms
.......... .......... .......... .......... ..........  30% ∆1s 804ms
.......... .......... .......... ........-. ..........  35% ∆236ms
.......... .......... .......... .......... ..........  40% ∆0ms
.......... .......... .......... .......... ..........  45% ∆0ms
.......... .......... .......... .......... ..........  50% ∆605ms
.......... .......... .......... .......... ..........  55% ∆0ms
.......... .......... .......... .......... ..........  60% ∆202ms
.......... .......... .......... .......... ..........  65% ∆202ms
.......... .......... .......... .......... ..........  70% ∆0ms
.......... .......... .......... .......... ..........  75% ∆2s 410ms
.......... .......... .......... .......... ..........  80% ∆0ms
.......... .......... .......... .......... ..........  85% ∆1ms
.......... .......... .......... .......... ..........  90% ∆0ms
.......... .......... .......... .......... ..........  95% ∆0ms
.......... .......... .......... .......... .........Exception in thread "Thread-50" java.lang.RuntimeException: org.neo4j.unsafe.impl.batchimport.cache.idmapping.string.DuplicateInputIdException: Id '#NAME?' is defined more than once in group 'Review'
	at org.neo4j.unsafe.impl.batchimport.staging.AbstractStep.issuePanic(AbstractStep.java:155)
	at org.neo4j.unsafe.impl.batchimport.staging.AbstractStep.issuePanic(AbstractStep.java:147)
	at org.neo4j.unsafe.impl.batchimport.staging.LonelyProcessingStep.lambda$receive$0(LonelyProcessingStep.java:59)
	at java.lang.Thread.run(Thread.java:748)
Caused by: org.neo4j.unsafe.impl.batchimport.cache.idmapping.string.DuplicateInputIdException: Id '#NAME?' is defined more than once in group 'Review'
	at org.neo4j.unsafe.impl.batchimport.input.BadCollector$NodesProblemReporter.exception(BadCollector.java:278)
	at org.neo4j.unsafe.impl.batchimport.input.BadCollector.collect(BadCollector.java:168)
	at org.neo4j.unsafe.impl.batchimport.input.BadCollector.collectDuplicateNode(BadCollector.java:135)
	at org.neo4j.unsafe.impl.batchimport.cache.idmapping.string.EncodingIdMapper.detectDuplicateInputIds(EncodingIdMapper.java:606)
	at org.neo4j.unsafe.impl.batchimport.cache.idmapping.string.EncodingIdMapper.buildCollisionInfo(EncodingIdMapper.java:522)
	at org.neo4j.unsafe.impl.batchimport.cache.idmapping.string.EncodingIdMapper.prepare(EncodingIdMapper.java:239)
	at org.neo4j.unsafe.impl.batchimport.IdMapperPreparationStep.process(IdMapperPreparationStep.java:56)
	at org.neo4j.unsafe.impl.batchimport.staging.LonelyProcessingStep.lambda$receive$0(LonelyProcessingStep.java:53)
	... 1 more

IMPORT FAILED in 13s 620ms.
Data statistics is not available.
Peak memory usage: 1.02 GB
Duplicate input ids that would otherwise clash can be put into separate id space, read more about how to use id spaces in the manual: https://neo4j.com/docs/operations-manual/3.5/tools/import/file-header-format/#import-tool-id-spaces
Caused by:Id '#NAME?' is defined more than once in group 'Review'

WARNING Import failed. The store files in /Users/abishekarunachalam/Downloads/NEO4J_HOME/data/databases/yelp.db are left as they are, although they are likely in an unusable state. Starting a database on these store files will likely fail or observe inconsistent records so start at your own risk or delete the store manually
unexpected error: Id '#NAME?' is defined more than once in group 'Review'

I checked the review.csv file and found '#NAME' repeated multiple times in Column1 as seen in the attached screenshot:

Considering we are beginners, any guidance on what could have gone wrong or any other way to efficiently import Yelp data in NEO4j would be much appreciated. Thank you!

(Shivanandiyer) #13

Hi @mark.needham ,

We managed to sort some of those issues out for now by loading data using cypher instead of python.
How long did it take for you for load the complete Yelp dataset? Loading the business.json took me around 7 hours with heapsize configured to 12G and pagecache size 6GB. I'm running neo4j desktop on my laptop - 4 core, 32GB

This is what I ran. Wondering if setting the batch size and parallel = true would have made some difference.
CALL apoc.load.json('file:///business.json')
YIELD value
WITH value
MERGE (b:Business {id:value.business_id})
SET b += apoc.map.clean(value, ['attributes','hours','business_id','categories','address','postal_code'], )
WITH b,value.categories as categories
UNWIND categories as category
MERGE (c:Category{name:category})
MERGE (b)-[:IN_CATEGORY]->(c);