Trying to generate a "filesystem" from a list of books and authors

I am experimenting with a P2P distributed filesystem hosted in Ceramic & IPFS. Pulling the data directly from the network is far too slow, so I am hoping to cache the structure in Neo4j and use it for queries.

Currently, I have an import of books, awards, and series from the Internet Speculative Fiction Database. I would like to generate a filesystem tree using the book data.

graph

There are a couple different trees I would like to generate. The first is of the format /book/by/{author}/{title}/. I also want to keep track of when filesystem locations map to semantic ones so that I can allow users to see all file paths that lead to a particular book.

The Cypher query I have is:

MATCH (creators:Creators)-->(book:Book)
MERGE (:Root)-[:CHILD {name: "book"}]->(:Position)
  -[:CHILD {name: "by"}]->(:Position)
  -[:CHILD {name: creators.name}]->(pc:Position)
  -[:CHILD {name: book.title}]->(pb:Position)
MERGE (pc)-[:EQUALS]->(creators)
MERGE (pb)-[:EQUALS]->(book)

I've created indices on Book.title & Creators.name, but the query is still taking ages (read hours) and hasn't completed.
My query explanation looks really tall to me, but I don't know what to do about it.


Anyone have any feedback as to if my approach is on point?

Hi @dysbulic

The graph connected by CHILD is complex.
This answer may not be what you want, but you could try to make it simpler.

CREATE INDEX filesystem_dir FOR (n:Filesystem) ON (n.dir)
MATCH (creators:Creators)-[CREATED]->(book:Book)
MERGE (creators)-[:EQUALS]->(:Filesystem {dir:"/book/by/"+creators.name+"/"+book.title+"/"})-[:EQUALS]->(book)

Some of your model doesn't quite look correct to me. A :Year node pointing to the category? You seem to be approaching this as if the model should reflect a DSL (domain specific langauge), but that isn't the case, or at least it doesn't make sense to store the data like that. Although having such relationships in your model may read well, it isn't a correct approach to modeling.

Instead, the :Award node may have a :Year node (unless set as a property) as well as a :Category (it makes more sense for the category to be attached to the award, it doesn't make sense for the category to be attached to the year). Also the :Award node should be the one attached to the :Book or :Movie node as the nominee, it doesn't make sense for the category node to be attached like that.

Similarly, your book by author title model doesn't look correct. The nodes are the significant thing here, not the relationships, so the :Book should have the title, and the :Creator node should have the name. The node model should not blindly follow the syntax of the url here. This is also one of the reasons why your query is slow: no indexes are being used, you need to have node properties that are indexed to find starting places in the graph to traverse, otherwise you end up looking at and filtering over a larger dataset, and that becomes less efficient as more data is added.

For each award there are a set of years it was awarded. In each of those years there was a set of categories. For each of those categories there was a set of nominees.

The set of categories is not the same from year to year. I could also go (:Award)-->(:Category)-->(:Year) with only the years the category was offered being present. I don't see why one is more intuitive than the other.

I can do (:Award { name: "Hugo Award" })-[*]->(:Book) to get all the books for an award. ¿What is the benefit to having a link directly between the award and the book?

As mentioned, it looks like you're trying to model this as a DSL.

I would have instead expected something more like (:Year)<--(:Award)-->(:Category) and (:Book)<-[:NOMINEE]-(:Award), since the award has a year and a category, and the book or movie is the nominee for an award. This allows also reuse of both :Year and :Category nodes for other purposes, other than being specific only to awards. For example, (:Book)-[:CLASSIFIED_AS]->(:Category) or (:Book)-[:WRITTEN_IN]->(:Year) or similar.

In your model, we can't reuse :Year that way, since that model would look like:

(:Book)-[:WRITTEN_IN]->(:Year)-[:FOR]->(:Category)

Do you see the problem? (:Year)-[:FOR]->(:Category) is really in the context of the :Award, but that doesn't make sense when using these in a non-award context, so that's where the strangeness of this model comes in.

1 Like

Ok, I've revamped my graph structure so that Categories and Years are acting as first-class objects rather than being contextualized by the path to get to them.

@andrew.bowman, I couldn't use your exact suggestions because a book can be nominated in different years for different awards, so I created a nominee "object" to hold the various fields together.

This is what my revised graph looks like:

It seems like a more straightforward structure to me. I'm still very much the neophyte however.