Hello everyone! I am a new member of the NEO4J community. I am just getting started with graph databases and I decided to explore NEO4J, the leader in the market. I have a few doubts that I'd like to discuss with the community here and hopefully receive some helpful responses.
The scenario is, I have a EC2 instance where I am running my Python code to generate an adjacency list that has 50,000 nodes and 100 million edges. I have this data sitting in a dataframe with me. I would like to store this data in a graph database (NEO4J). For that, I need to configure NEO4J on an EC2 instance such that I can access it from my current EC2 instance via Python Drivers. My end goal here is to have my graph database up and running on an EC2 instance and then have my other EC2 instance be able to create nodes and relations via Python drivers. And ultimately analyze the graph to answer my questions like what is the single source shortest path from node 1, what are the max cliques etc.
Can anyone kindly help me out with what steps do I exactly need and what kind of edition of NEO4J would suit my needs.
I would start by looking at the community edition to see if that suits your needs (avoid the need for licensing). Neo4j 4.0.0 was released just the other day, so you could opt for that or just stick with 3.5 (whatever is latest version of that). Personally, I'd say just start with 4.0 so you don't have to upgrade to it later.
Installing neo4j should be fairly simple, depending on what EC2 you go with. Are you currently using Ubuntu or some other one?
Depending on how much ram and CPU you are using for your python work, you could just add neo4j service to your current EC2 instance. Otherwise, connecting to the database will involve communication between your two instances. With appropriate security of course! I would say go step by step, if you can just start with a neo4j alongside the python to cut out the networking between EC2.
Hopefully that gives you a start? Just reply back with any questions!
Everything you describe sounds very reasonable and Neo4j should be able to fit your needs.
The community edition is where I'd also recommend starting. Community has a limit of 34 billion nodes so your graph will easily fit. If you're really curious to compare editions here's the comparison
If you haven't already found the documentation for the python driver here is that documentation. Because you mentioned data frame, if you're running more than just python but you're running apache spark, there's libraries for working with Neo4j + Apache Spark. Also because you mentioned that you're in AWS, I've been able to use AWS Glue (which we all know is spark under the covers) to connection to Neo4j.
As @jsmccrumb already mentioned just like with any applications running on EC2, you'll need to make sure you deploy your EC2 instances in appropriate VPC and security groups so the two machines can talk to each other. To separate the networking vs. the programming, the Neo4j Desktop is very friendly to use and you can verify your python code is working correctly running a locally hosted graph database on your desktop running your python there too. Then when you deploy to EC2 if there are any issues you know it's AWS networking and not your code at fault.
Hello! Thank you for the response. I tried doing what you suggested but for some reason I was facing a few issues setting up NEO4J on an EC2 instance by myself. Instead now what I have done is, I found an existing NEO4J configured AMI on Market Place, so I just launched it with a m4.large ec2 instance type. It was fairly easy on me. I have SSH'ed into my instance but now I don't really have a clue of how to move forward. Basically all I want is, start neo4j on that instance. And have my another ec2 instance communicate with this neo4j ec2 instance such that I am able to access neo4j via Python Drivers. Also, if possible, I would like to have a browser access to neo4j too. Please let me know of all the steps that I would need to go through. Including all the configs, opening ports for specific IPs and getting my system running by having access to neo4j in my python script.
There is a good chance that the AMI image will have the neo4j running when it starts up.
After SSH in, you can try typing curl http://localhost:7474 in the terminal and see if that comes back with something, if so then your neo4j is live.
If both of your EC2 instances are in the same security group, you can set up a policy so that the Python EC2 box can access the neo4j EC2 on ports 7474, 7473, and 7687.
To test if they can talk, get the public IP of the neo4j EC2 and SSH into the python EC2 and try curl http://[NEO4J_IP]:7474 and see if you get a similar response to the localhost curl from the neo4j EC2.