Sample datasets?

Are there sample datasets out there that are easy to load besides the ubiquitous Movie and Northwinds datasets for Neo4J version 4? I have looked at GraphGist but many of the datasets seem really tiny and the site is very hard to search. I am looking in particular for datasets that are easy to load.

In general, it seems like it is really complex to import Neo4j databases, with many steps. In MongoDB, as well as standard relational DBMSs, there are import and export tools that make this happen in a flash. Is that not the case with Neo4j?

Hi @mackellb ,

You're right that the GraphGists are mostly small examples. We're working on a dataset catalog which will combine openly available data along with import instructions.

There is a developer guide about data import which details the available approaches, which are:

  • LOAD CSV using Cypher
  • CALL apoc.load.json using Cypher along with the APOC library
  • Loading from DBMS using either an ETL tool or Kettle

While we're building the data catalog, what kinds of datasets would be of most interest to you? Is there a topic or source you'd like to request?

Best,
ABK

Hi,

Thanks for gettig back to me.I am teaching a course on NoSQL databases. I need a couple of datasets that are fast and easy to load for student use. I have looked at the datasets on the sandbox but the problem is that they come with guides. I need just the dataset without the guides so I have more flexibility for designing homework and tests. And again, it needs to be simple to load. Students tend to get tangled when there are a lot of steps before they can even start. With other systems, like MySQL or MongoDB, it is easy to just load a single file and then get going but I don't see anything like that for Neo4j. I am hoping that people on this board might know of samples that are easy to bring in.

Hello,

We're missing a standard file format for graph databases which precisely describes an instance of a property graph model. Existing data formats are either drawing-oriented (graphviz dot, GraphML) or semantic web (RDF, JSON-lD).

For MongoDB, you can load JSON. For RDBMS, you can load CSV. For Neo4j, you have to import an external format and transform it into a graph. Typically LOAD CSV is used to load two files -- one for nodes, and the second for relationships.

Consequently, the convention has been to either use a multi-statement Cypher query or to create a Browser guide which describes each code block to run.

For example, I've combined each of the load statements for the :play northwind example that is included with Neo4j into a single northwind.cypher multi-statement query, attached to this post.

Alternatively, you could create Browser guides specifically for your class. That's the approach taken for the Graph Academy courses, where each lesson has an accompanying Browser guide with Cypher queries and exercises. Northwind is a reasonable boilerplate for creating a new guide. See the detailed instructions for creating a custom Browser guide.

If you have some data in mind I'd be happy to help craft either the cypher import, or a Browser guide.

Best,
ABK

northwind.cypher.txt (1.3 KB)

Acually, for RDBMS you load SQL files- either one big file that contains all the schema creation and data insert statements, or sometimes two. And for Stardog and some of the other triple stores, databases are loaded, and exported, via RDF files. Which leads to the question - why isn't it possible to have files that just contain the statements that create the nodes and relationships, specified in the proper order? Also, is there an import/export facility in Neo4j, and if so, what format does it use?

The guides are really lovely, but a lot of work for a lone professor who is just doing a 3 week unit on this system, plus it isn't what we need for testing setup.

Great point. I meant CSVs from the standpoint of an exchange format. The movie dataset you mention is an example which inserts using a sequence of Cypher statements.

APOC (Neo4j's standard library of procedures and functions) can be used to export to a Cypher script, which can then be directly run on a separate database.

Because there is no canonical file format for graph databases, the choices are:

The dump file is the most efficient for moving large-ish datasets around compatible versions of Neo4j.

If you're familiar with RDF, perhaps you might load dbpedia using Neosemantics to prepare a dataset, then share the dump file with the students.

@jesus_barrasa has a fantastic collection of blog posts where he loads various RDF datasets into Neo4j.

Best,
ABK

1 Like

Hi @mackellb

You seem to look for a specific dataset.
If all of your students are only interested in learning the Cypher language quickly in a graph database that you created from a relational database. I think the quickest path would be to:

For you:
1 - Get Neo4j Desktop app
2 - Create en empty local neo4j database with the add button
2 - Start your RDBMS hosting your SQL database
3 - Use the ETL tool app in the Neo4j desktop to fill your neo4j database with your SQL data
4 - Export the .dump with the Neo4j Desktop DBMS buttons to create a backup
5 - Give the .dump to your student

For your student
1 - Install Neo4j Desktop ( If it's their personal computer )
1 - Open their Neo4j Desktop app
2 - Add the dump file provided by you in their Neo4j Desktop
3 - Create the a database from the dump file, 3 dots button by the file in Neo4j Desktop
4 - Start the database and Enjoying it!

None of these steps required knowledge of importation process with APOC, command lines or Cypher clauses, it's all user friendly and can be done in quickly, mostly for your student.

I was wondering if it's possible to dump the file in an online empty Sandbox, @abk , you won't have the guides in the empty sandbox I think and you will be able to share a link to your students without having to care about their Neo4j Desktop installation process. But you will have to be careful about the network of the school.

If you needs are more complex, all @abk said is the way to go.

By the way, I'm jealous a bit. I'm a teacher in my heart and I would love to teach it.

I did not know that it is possible to import RDF. That may be a good way to go since I have plenty of RDF data.

Loading a dump file would be perfect - but are the example datasets available in that format? For example, the Twitter dataset?

Thanks! There is a lot of Neo4j information out there, but it is almost too overwhelming for my needs - I just need to get them through 3 weeks of Neo4j and I don't have time to go through huge numbers of tutorials and blogs since I have many other topics in this course plus other courses to deal with.

1 Like

Use Titanic dataset and you can split the file into different age groups like age between 60 and 90 or 40 and 60 and anything less than 40. You can observe different patterns and is easy for the students to understand.

Download from this site. It's a clean data.
https://public.opendatasoft.com/explore/dataset/titanic-passengers/export/

1 Like