Trying to generate a "filesystem" from a list of books and authors

I am experimenting with a P2P distributed filesystem hosted in Ceramic & IPFS. Pulling the data directly from the network is far too slow, so I am hoping to cache the structure in Neo4j and use it for queries.

Currently, I have an import of books, awards, and series from the Internet Speculative Fiction Database. I would like to generate a filesystem tree using the book data.

graph

There are a couple different trees I would like to generate. The first is of the format /book/by/{author}/{title}/. I also want to keep track of when filesystem locations map to semantic ones so that I can allow users to see all file paths that lead to a particular book.

The Cypher query I have is:

MATCH (creators:Creators)-->(book:Book)
MERGE (:Root)-[:CHILD {name: "book"}]->(:Position)
  -[:CHILD {name: "by"}]->(:Position)
  -[:CHILD {name: creators.name}]->(pc:Position)
  -[:CHILD {name: book.title}]->(pb:Position)
MERGE (pc)-[:EQUALS]->(creators)
MERGE (pb)-[:EQUALS]->(book)

I've created indices on Book.title & Creators.name, but the query is still taking ages (read hours) and hasn't completed.
My query explanation looks really tall to me, but I don't know what to do about it.


Anyone have any feedback as to if my approach is on point?

Hi @dysbulic

The graph connected by CHILD is complex.
This answer may not be what you want, but you could try to make it simpler.

CREATE INDEX filesystem_dir FOR (n:Filesystem) ON (n.dir)
MATCH (creators:Creators)-[CREATED]->(book:Book)
MERGE (creators)-[:EQUALS]->(:Filesystem {dir:"/book/by/"+creators.name+"/"+book.title+"/"})-[:EQUALS]->(book)

Some of your model doesn't quite look correct to me. A :Year node pointing to the category? You seem to be approaching this as if the model should reflect a DSL (domain specific langauge), but that isn't the case, or at least it doesn't make sense to store the data like that. Although having such relationships in your model may read well, it isn't a correct approach to modeling.

Instead, the :Award node may have a :Year node (unless set as a property) as well as a :Category (it makes more sense for the category to be attached to the award, it doesn't make sense for the category to be attached to the year). Also the :Award node should be the one attached to the :Book or :Movie node as the nominee, it doesn't make sense for the category node to be attached like that.

Similarly, your book by author title model doesn't look correct. The nodes are the significant thing here, not the relationships, so the :Book should have the title, and the :Creator node should have the name. The node model should not blindly follow the syntax of the url here. This is also one of the reasons why your query is slow: no indexes are being used, you need to have node properties that are indexed to find starting places in the graph to traverse, otherwise you end up looking at and filtering over a larger dataset, and that becomes less efficient as more data is added.

For each award there are a set of years it was awarded. In each of those years there was a set of categories. For each of those categories there was a set of nominees.

The set of categories is not the same from year to year. I could also go (:Award)-->(:Category)-->(:Year) with only the years the category was offered being present. I don't see why one is more intuitive than the other.

I can do (:Award { name: "Hugo Award" })-[*]->(:Book) to get all the books for an award. ¿What is the benefit to having a link directly between the award and the book?

As mentioned, it looks like you're trying to model this as a DSL.

I would have instead expected something more like (:Year)<--(:Award)-->(:Category) and (:Book)<-[:NOMINEE]-(:Award), since the award has a year and a category, and the book or movie is the nominee for an award. This allows also reuse of both :Year and :Category nodes for other purposes, other than being specific only to awards. For example, (:Book)-[:CLASSIFIED_AS]->(:Category) or (:Book)-[:WRITTEN_IN]->(:Year) or similar.

In your model, we can't reuse :Year that way, since that model would look like:

(:Book)-[:WRITTEN_IN]->(:Year)-[:FOR]->(:Category)

Do you see the problem? (:Year)-[:FOR]->(:Category) is really in the context of the :Award, but that doesn't make sense when using these in a non-award context, so that's where the strangeness of this model comes in.

1 Like

Ok, I've revamped my graph structure so that Categories and Years are acting as first-class objects rather than being contextualized by the path to get to them.

@andrew_bowman, I couldn't use your exact suggestions because a book can be nominated in different years for different awards, so I created a nominee "object" to hold the various fields together.

This is what my revised graph looks like:

It seems like a more straightforward structure to me. I'm still very much the neophyte however.

That could work! Especially if :Award nodes represent the award in general and not a specific instance for a given year.

As an alternate model, if modeling it as a specific instance is desired, you can encode the nominee/winner info either in relationship properties of the :Award to the nominee/winner, or in the relationship types themselves.

For example:

(:Book {title:'The Catcher in the Rye'})-[:NOMINATED_FOR {won:true}]->(:Award {name:'The National Book Award', year:1952})-[:CATEGORY]->(:Category {name:'Fiction})

Alternately, you could have :NOMINATED_FOR relationships from all nominated media for the award and a single :WON_AWARD relationship between the winning nominee and the award.

A :Year node itself may not necessarily be something needed in your graph, but that depends on the questions you want to answer. If you are looking for a variety of things happening in the same year where you don't know ahead of time what they are (media, awards, editions) then a :Year node may be useful as the same :Year node may connect to multiple different nodes. If you will only be using year information in a prior-known context (as a filter or as a lookup or as returned data or similar) then modeling it as a property on a node makes more sense, and provides opportunities to create composite indexes for quickly looking up a node by its year as well as whatever defines it (such as a quick index lookup of a particular award for a particular year).

1 Like

I've been working on a script to ingest data from the Internet Speculative Fiction Database. I'm able to create all the constituent parts of what would be a Nominee – an Award, Category, etc. – but I'm having a hell of a time creating a unique Nominee for each intersection of properties.

The query that I have currently is:

MATCH (w:Work {uuid: $workUUID})
MATCH (y:Year {uuid: $yearUUID})
MATCH (c:Category {uuid: $catUUID})
MATCH (a:Award {uuid: $awardUUID})
MATCH (x)
WHERE NOT (
  EXISTS((x)-[:IN]->(y))
  AND EXISTS((x)-[:IS]->(w))
  AND EXISTS((x)-[:FOR]->(a))
  AND EXISTS((x)-[:IN]->(c))
)
CREATE (n:Nominee${parseInt(level, 10) === 1 ? ':Winner' : ''})
CREATE (n)-[:IS]->(w)
CREATE (n)-[:IN]->(y)
CREATE (n)-[:IN]->(c)
CREATE (n)-[:FOR]->(a)
SET n.uuid = apoc.create.uuid()
SET n.place = $level
RETURN n

This particular version is creating hundreds of thousands of relationships. I've had others that linked everything to a single Nominee or only created a tenth of the entries it was supposed to.

¿Do you have any suggestions on how to create a unique Nominee for each combination of Award/Category/Year/Book that I generate?

It's not clear what x is supposed to be or drive. You're asking: find me all nodes in the graph that don't have these patterns, and since operations in Cypher execute per node, this is probably what is driving the creation of what end up to be duplicate nodes and relationships.

If you're trying to check if such a :Nominee exists at the intersection of the given set of nodes and create it otherwise, then this won't work.

You need something more like:

...
WITH w, y, c, a
WHERE NOT EXISTS {
 MATCH (x)-[:FOR]->(a)
 WHERE EXISTS((x)-[:IN]->(y))
  AND EXISTS((x)-[:IS]->(w))
  AND EXISTS((x)-[:IN]->(c))
}
...
1 Like