Discover AuraDB Free — Week 26 Goodreads Books and Recommendations
This week, we’re exploring book recommendation data as a graph. And what better source for that than Goodreads?
If you’d rather watch the recording of our livestream, enjoy it here below. Otherwise, keep on reading.
https://medium.com/media/c30695237d8f609042d4f31200367649/hrefThe Data Set
I found a well-sized dataset on Kaggle that has the following files:
- books.csv (10k books and authors)
- tags.csv (230k tags)
- book_tags.csv (1M tags for books)
- ratings.csv (53k ratings)
- to_read.csv (53k to read intentions)
This should work within our free database limits on AuraDB, except for the tags.
Frankly, those tags are also a mess. So many have been used only a few times, so we can limit our data to those that have at least 100 uses, which leaves us with 2,081 tags.
Often I use xsv as a nice command line tool for CSV processing.
Here we join the data from the two tag-files together (by tag_id) and only keep rows where the count column has at least 3 digits (regexp).
xsv join tag_id tags.csv tag_id book_tags.csv | \
xsv search -s count "\d{3,}" | xsv count
xsv join tag_id tags.csv tag_id book_tags.csv | \
xsv search -s count "\d{3,}" | \
xsv select tag_id,goodreads_book_id,count \
> book_tags_reduced.csv
For the actual tags_reduced.csv file we do the same but then have to de-duplciate ourselves. We can use sort and uniq on a header-less tags file to de-duplicate the tags.
Create a Neo4j AuraDB Free Instance
Go to https://dev.neo4j.com/neo4j-aura to register or log into the service (you might need to verify your email address).
After clicking Create Database you can create a new Neo4j AuraDB Free instance. Select a Region close to you and give it a name, e.g. goodreads.
Choose the “blank database” option as we want to import our data ourselves.
On the Credentials popup, make sure to save the password somewhere safe. The default username is always neo4j.
Then wait 3 to 5 minutes for your instance to be created.
Afterwards you can connect (you’ll need the password) via:
- Import Button with Neo4j Data Importer
- Explore Button with Neo4j Bloom for visual exploration
- Query Button with Neo4j Browser
The connection URL: neo4j+s://xxx.databases.neo4j.io is available and you can copy it to your credentials, as you might need it later.
If you want to see examples of programmatically connecting to the database, go to the “Connect” tab of your instance and pick the language of your choice.
Data Modeling and Import
After our database is running, we can use the Import button to open Data Importer.
TLDR;
If you don’t want to do the steps manually you can click on the three dots … and load the model with data zip file from here.
There we add our 5 CSV files to the left side and start drawing our model. (Follow the video for the individual steps.)
- Create the nodes and relationships (drag from halo of the node)
- Name them and sometimes reverse the relationships
- Select a csv file for each element and map the columns from the file
- Rename / provide types for some fields (like ratings, year, count)
- Select an id field for nodes
The CSV files turn increasingly green as you progress.
Then hit the Run Import button to start the import, and provide the password that you hopefully saved!
Unfortunately, not all the data fits into our free database, but it’s only 4,000 users and 12,000 ratings that are left behind, so not too much to worry about.
As the authors are only provided in a comma-separated list in our data, we can run the following statement to post-process them and split them into individual nodes.
Post Processing to individualize Authors:
// find authors and books
MATCH (n:Author)-[:WROTE]->(b)
// split author name by comma
WITH b, n, split(n.author,', ') as names
// turn list of names in to rows of name
UNWIND names as name
// get-or-create an author with that name
MERGE (a:Author {name:name})
// if it's a new author node, then the previous one was a a combined author
WITH * WHERE n <> a
// get rid of the combined author and it's relationships
DETACH DELETE n
// create a new relationship to the book
MERGE (a)-[:WROTE]->(b);
// Added 5841 labels, created 5841 nodes, deleted 4664 nodes, set 5841 properties, deleted 10000 relationships, created 13209 relationships, completed after 1152 ms.
You see our free database is now pretty full — close to 100 percent of the node and relationship limits.
Initial Queries
Now that we have our data in the database click the Explore data in browser button, which takes you to our database UI.
Here we can run exploratory queries on the data.
The query language is Cypher, which is like SQL for graphs.
It’s centered around expressing visual patterns of your data (plus the regular filters, aggregation, pagination, etc.).
You can learn all about it in this short GraphAcademy course.
Overview
match (n) return labels(n) as label, count(*)
These are the counts that we have in the database now:
╒══════════╤══════════╕
│"label" │"count(*)"│
╞══════════╪══════════╡
│["Book"] │10000 │
├──────────┼──────────┤
│["Tag"] │2081 │
├──────────┼──────────┤
│["User"] │31071 │
├──────────┼──────────┤
│["Author"]│5841 │
└──────────┴──────────┘
Book With Data
MATCH (n:Book) RETURN n LIMIT 25
Prolific Authors
MATCH (n:Author)
RETURN n, size( (n)-[:WROTE]->()) as books
order by books desc limit 20
We knew that Terry Pratchett was really prolific, but I was blown away by Stephen King’s 97 books.
╒═════════════════╤═══════╕
│"n.name" │"books"│
╞═════════════════╪═══════╡
│"James Patterson"│98 │
├─────────────────┼───────┤
│"Stephen King" │97 │
├─────────────────┼───────┤
│"Nora Roberts" │65 │
├─────────────────┼───────┤
│"Dean Koontz" │64 │
├─────────────────┼───────┤
│"Terry Pratchett"│50 │
├─────────────────┼───────┤
│"Agatha Christie"│43 │
├─────────────────┼───────┤
│"J.D. Robb" │41 │
├─────────────────┼───────┤
│"Neil Gaiman" │41 │
├─────────────────┼───────┤
│"Meg Cabot" │38 │
├─────────────────┼───────┤
│"Janet Evanovich"│37 │
└─────────────────┴───────┘
Tags
MATCH (t:Tag {name:"sherlock-holmes"})<-[r:TAGGED]-(b:Book)<-[w:WROTE]-(a)
RETURN *
Related Tags
MATCH (t:Tag)<-[r:TAGGED]-(b:Book)-[:TAGGED]->(other:Tag)
WHERE t.name CONTAINS 'dystop' AND NOT other.name CONTAINS 'dystop'
RETURN other.name, count(*) as freq
ORDER BY freq desc SKIP 20 LIMIT 20
Here are tags related to dystopia— you can already see that it mixes structural (ownership, intent to read, type of book) with genre tags and that the nomenclature is all over the place.
╒════════════════╤══════╕
│"other.name" │"freq"│
╞════════════════╪══════╡
│"ebook" │263 │
├────────────────┼──────┤
│"adventure" │243 │
├────────────────┼──────┤
│"scifi" │241 │
├────────────────┼──────┤
│"audiobooks" │236 │
├────────────────┼──────┤
│"read-in-2014" │233 │
├────────────────┼──────┤
│"teen" │218 │
├────────────────┼──────┤
│"ebooks" │211 │
├────────────────┼──────┤
│"sci-fi-fantasy"│205 │
├────────────────┼──────┤
│"read-in-2015" │199 │
├────────────────┼──────┤
│"my-books" │189 │
└────────────────┴──────┘
We can also look at our users and how they rated books. To visualize it, we can put the rating on each relationship from the detail pane on the right.
A User’s Ratings
match (u:User {user_id:'314'})-[r:RATED]->(b) return *
Let’s compute some recommendations for our users 314.
Recommendation — Content Based by Author
First we start with content based recommendations, usually via genre or authors. As the tags are not well structured for genres, let’s go with the authors.
So we expand from the highly rated (>= 4) books from our user to their authors and which other books they’ve written.
Then we use those books’ average ratings to sort the results and exclude the books from recommendations that the user has already rated (no matter how) — i.e. read.
MATCH (u:User {user_id:'314'})-[r:RATED]->(b)<-[w1:WROTE]-(author)-[w2:WROTE]->(reco)
WHERE r.rating >= 4
AND NOT (u)-[:RATED]->(reco)
AND NOT reco.title contains 'Harry'
RETURN DISTINCT reco.title, author.name, reco.average_rating
ORDER BY reco.average_rating DESC SKIP 20 LIMIT 10
Here is the second page of recommendations, after the first one had a few too obvious ones.
One drawback is that there is no information about which books are the same (just in a different edition or language) or contained in which boxset. That’s why we excluded Harry books manually.
Recommendation — Collaborative Filtering by Peers
In a collaborative filtering or peer recommendation, you try to find the people who are most similar to yourself (have rated most closely) and then look at what else they rated favorably.
That is the “people also bought” recommendation you often see on all kinds of sites.
We can return the frequency a recommended book showed up in this peer group, and use that for sorting our data, or combine/multiply it with its average rating.
So our full recommendation query is only those three lines:
- Find peers via similar rating
- Find their their highly rated books
- Compute frequency of recommendations
MATCH (u:User {user_id:'314'})-[r:RATED]->(book)<-[r2:RATED]-(peer)-[r3:RATED]->(reco)
// exclude already read books
WHERE NOT (u)-[:RATED]->(reco)
// peers show similar rating behavior
AND abs(r.rating-r2.rating) <= 1
// highly rated books from peers
AND r3.rating >= 4
// count how frequently the recommend book shows up
WITH reco, count(*) as freq
// find the authors for our books
MATCH (reco)<-[:WROTE]-(author)
RETURN reco.title, freq, reco.average_rating, freq*reco.average_rating as score, collect(author.name) as authors
ORDER BY score DESC SKIP 10 LIMIT 10
Here is the result from page two, so we have enough to read for our user 314.
Conclusion
This was a really fun episode. We got a lot of runway out of the data.
Here are some ways on how to go further:
- You can export your own Goodreads Data and combine it with this dataset.
- There are many, larger datasets of Goodreads data on Kaggle or elsewhere — e.g. from our colleague Jennifer Reif.
- Classification (structural, generes, behavioral) and de-duplication of Tags.
- Connect similar books, like translations, boxsets, etc. to exclude them from recommendations.
- Develop multi-score recommendations, with weights.
- Built an api, web, or mobile app on top of this data — for example, using GraphQL or Spring Data.
Discover AuraDB Free — Week 26 Goodreads Books and Recommendations was originally published in Neo4j Developer Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.