Discover Aura Free Week 11 โ Importing and Querying Kickstarter Projects
This week for our โDiscover Aura Freeโ series, we want to look at a Kickstarter dataset from Maven Analytics recommended by our colleague Jennifer Reif.
If you want to code along, sign up โ or in โ to https://dev.neo4j.com/aura, create your free database, and join us.
Neo4j Aura - Fully Managed Cloud Solution
Being avid Kickstarter backers ourselves (especially for board games and gadgets), that dataset was very compelling to us.
If youโd rather watch than read, here is the video of our live-stream:
We looked at the Kickstarter site overall โ and at a particular project โ to relate the data in the dataset with the actual site.
The dataset contains 375k Kickstarter projects from 2009 to 2017, including category, country, and pledge data.
Some questions we can ask after importing the datasets:
- What does the (sub)-category tree look like visually in terms of success?
- What are the most successful categories or countries?
- What are highest pledges of successful projects?
Dataset
The data is available as a zip download with 2 CSVs โ one with the data dictionary, the other with 375k rows of Kickstarter projects.
Columns:
- ID โ Internal kickstarter id
- Name โ Name of the project
- Category โ Project category
- Subcategory โ Project subcategory
- Country โ Country the project is from
- Launched โ Date the project was launched
- Deadline โ Deadline date for crowdfunding
- Goal โ Amount of money the creator needs to complete the project (USD)
- Pledged โ Amount of money pledged by the crowd (USD)
- Backers โ Number of backers (integer)
- State โ Current condition the project is in (as of 2018โ01โ02) (Successful, Canceled, Failed, Live, Suspended)
To make it easier to import, I would usually upload the CSV file somewhere publicly to access it easily, like GitHub, Pastebin, or S3.
This time, we wanted to demonstrate how you can use Google Spreadsheets for that.
After uploading the 100k chunk of the file as a new sheet, you can choose โFile โ Publish to the Webโ to publish a single sheet, e.g. as a CSV publicly.
Other tools and people can access just that data read-only, including our Neo4j instance.
Google Spreadsheets didnโt like importing the full 400k rows, so we split the file using xsv split -c 10000 split kickstarter.csv into smaller chunks (csvkit works too, or just head -10000 in any unix shell).
As the long URL is a bit unwieldy, we created a bit.ly shortlink for our needs: https://dev.neo4j.com/kickstarter
Data Model
The data model has some interesting tidbits, especially around modeling the state.
We could model the state just as a property like in the CSV, but as there are only a few relevant states and itโs an important attribute of a project to distinguish them, we can also add labels to a project for:
- Successful
- Failed
- Canceled
That helps us to visualize the state of a project quickly and also sub-select the relevant projects easily.
We extract the subcategory and category and the country as separate nodes.
We could extract the pledge information into a separate node โ e.g. for finance and security reasons โ but we kept it in the project for simplicity.
Import
After creating our blank AuraDB Free database, and opening Neo4j Browser we can get going.
Exploring the CSV
We can look at the first few rows of the data:
load csv with headers from
"https://dev.neo4j.com/kickstarter" as row
return row limit 5;
Which gives us object/hash/dict/map representations of each row with the headers as key and the row data as values.
All values are strings, so if we want to convert them we need to do that manually.
{
"Goal": "3854",
"Category": "Film & Video",
"Subcategory": "Webseries",
"State": "Failed",
"Pledged": "426",
"Deadline": "2016-08-12",
"Country": "Canada",
"Backers": "13",
"ID": "689678626",
"Launched": "2016-07-22 22:53:49",
"Name": "The Vanishing Garden"
}
We also check how many rows our dataset contains.
load csv with headers from
"https://dev.neo4j.com/kickstarter" as row
return count(*);
That returns 75k rows, which is a bit much for our AuraDB free instance (50k nodes, 175k rels).
So we need to cut it down, but because we want to have projects of all kinds of status, we can sub-select a single year to use, and that gives us roughly 22k rows to work with.
load csv with headers from
"https://dev.neo4j.com/kickstarter" as row
with row where row.Launched starts with '2016'
return count(*);
Setup
First, we create a bunch of constraints to ensure data uniqueness and speed up the import when looking up existing data:
create constraint on (p:Project) assert p.id is unique;
create constraint on (c:Category) assert c.name is unique;
create constraint on (c:SubCategory) assert c.name is unique;
create constraint on (c:Country) assert c.name is unique;
Importing Projects
We import the data incrementally, starting with the projects. We use MERGE here so that we can re-run the import without causing duplication.
load csv with headers from
"https://dev.neo4j.com/kickstarter" as row
with row where row.Launched starts with '2016'
MERGE (p:Project {id:row.ID})
ON CREATE SET
p.name = row.Name,
p.launched = date(substring(row.Launched,0,10))
p.deadline = date(row.Deadline)
p.state = row.State;
Weโll add the pledge information later.
We see that it imported some 22k projects. By clicking on the (Project) pill in the sidebar we can quickly pull up a few projects in the visualization.
Now we want to turn the state information into labels. For just the three labels, we can run a single update statement each that adds a label of the right kind to the project node.
MATCH (p:Project) WHERE p.state = 'Successful' SET p:Successful;
MATCH (p:Project) WHERE p.state = 'Failed' SET p:Failed;
MATCH (p:Project) WHERE p.state = 'Canceled' SET p:Canceled;
So if we now query projects from the sidebar we see their success even visually. In the stream, we discuss styling the colors by clicking on the pills on top of the visualization and then choosing a different color.
Importing Categories
Importing the categories involves creating nodes for them and then connecting the subcategory to the project and the subcategory to the category. As we use MERGE here, it also ensures each relationship is created only once.
load csv with headers from
"https://dev.neo4j.com/kickstarter" as row
with row where row.Launched starts with '2016'
match (p:Project {id:row.ID})
merge (sc:SubCategory {name:row.Subcategory})
merge (p)-[:IN_CATEGORY]->(sc)
merge (c:Category {name:row.Category})
merge (sc)-[:IN_CATEGORY]->(c)
return count(*);
Please note that shared names of subcategories in this approach will be merged together into a single one.
If you donโt want this, you have to create the subcategories in the context of a category, as shown below.
load csv with headers from
"https://dev.neo4j.com/kickstarter" as row
with row where row.Launched starts with '2016'
match (p:Project {id:row.ID})
merge (c:Category {name:row.Category})
merge (sc:SubCategory {name:row.Subcategory})-[:IN_CATEGORY]->(c)
merge (p)-[:IN_CATEGORY]->(sc)
return count(*);
We can now run a bunch of queries to see which categories exist, how many there are, and to get a visual.
Querying Categories
MATCH (n:Category) RETURN n LIMIT 25;
MATCH (n:Category) RETURN count(*);
MATCH (n:SubCategory) RETURN count(*);
MATCH path = (c:Category)<-[:IN_CATEGORY]-(n:SubCategory)
RETURN path;
Importing Countries:
Importing countries is pretty straightforward. Just create the node for a country if it doesnโt exist and connect it to the project.
load csv with headers from
"https://dev.neo4j.com/kickstarter" as row
with row where row.Launched starts with '2016'
match (p:Project {id:row.ID})
merge (c:Country {name:row.Country})
merge (p)-[:IN_COUNTRY]->(c)
return count(*);
Importing Pledges and Backers
Originally we had discussed extracting the backing information into separate nodes, but given the questions and the time we had we left that for a future exercise.
So we just find our projects again based on their id and set the few extra properties โ remember to convert them to numeric values as needed.
load csv with headers from
"https://dev.neo4j.com/kickstarter" as row
with row where row.Launched starts with '2016'
match (p:Project {id:row.ID})
set p.pledge = toInteger(row.Pledged)
set p.goal = toInteger(row.Goal)
set p.backers = toInteger(row.Backers)
return count(*);
If we want to explore our imported data model visually, we can use apoc.meta.graph procedure.
Which shows both the base data model but also our extra labels.
call apoc.meta.graph();
Querying
Now with the data in our graph we can start visualizing and querying it and answer our question.
Most Backers of a Failed Project
MATCH (p:Project:Failed)
RETURN p.name, p.backers
ORDER BY p.backers DESC LIMIT 20
Sad for the wolves :(
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโคโโโโโโโโโโโโ
โ"p.name" โ"p.backers"โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโชโโโโโโโโโโโโก
โ"Save The Wolves!" โ3867 โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโค
โ"Voyage of Fortune's Star โ a 7th Sea cRPG" โ1287 โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโค
โ"K11 Bumper | iPhone 7 and 7 Plus" โ1032 โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโค
โ"{THE AND} Global Relationship Project" โ830 โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโค
โ"New55 COLOR 4x5 Peelapart Film" โ805 โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโค
โ"Russian Subway Dogs" โ778 โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโค
โ"Legrand Legacy โ An Homage to the JRPG Classics โ769 โ
โ(PC)" โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโค
โ"Float The Boat - Save Mayflower II!" โ760 โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโค
โ"Dallas Sucks: A Completely Biased Documentary" โ737 โ โ
...
Highest Average Pledge for Successful Projects
MATCH (n:Project:Successful) where n.backers > 0
RETURN n.name, toFloat(n.pledge) / n.backers as pledgeValue
ORDER BY pledgeValue desc limit 10;
Professional chefs cooking at your home, e-bikes, art-guitars, 3d-printersโฆ all expected but still pretty high pledges.
I think my highest was $750 for a laser cutter :)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโคโโโโโโโโโโโโโโ
โ"n.name" โ"pledgeValue"โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโชโโโโโโโโโโโโโโก
โ"Cooks at your home" โ6862 โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโค
โ"Project ArtGuitarยฎ features Palehorse" โ3900 โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโค
โ"Gocycle - the BEST folding electric bike โ2509 โ
โin the world!" โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโค
โ"Futuristic Mechanical Tourbillon Watch Maโ2300 โ
โde Into Reality" โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโค
โ"Tantrum Cycles, the Missing Link in full โ2079 โ
โsuspension bikes" โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโค
โ"New Carbon SUV e-bike" โ2023 โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโค
โ"RoVa4D Full Color Blender 3D Printer" โ1998 โ
Categories
Visual Exploration of Category Tree and Projects
MATCH path = (c:Category)<-[:IN_CATEGORY]-(n:SubCategory)
RETURN path;
We can show the category tree and expand a few subcategories to see their projects (and their state) visually in one picture.
Top Categories and their State
MATCH (c:Category)<-[:IN_CATEGORY]-(n:SubCategory)
<-[:IN_CATEGORY]-(p:Project)
RETURN c.name, p.state, count(*)
ORDER BY count(*) desc LIMIT 50;
Top Categories in Numbers of Projects
MATCH (c:Category)<-[:IN_CATEGORY]-(n:SubCategory)
<-[:IN_CATEGORY]-(p:Project)
RETURN c.name, count(*)
ORDER BY count(*) desc LIMIT 10;
Kinda expected. Letโs look at the successful ones.
โโโโโโโโโโโโโโโโคโโโโโโโโโโโ
โ"c.name" โ"count(*)"โ
โโโโโโโโโโโโโโโโชโโโโโโโโโโโก
โ"Technology" โ2959 โ
โโโโโโโโโโโโโโโโผโโโโโโโโโโโค
โ"Games" โ2889 โ
โโโโโโโโโโโโโโโโผโโโโโโโโโโโค
โ"Film & Video"โ2673 โ
โโโโโโโโโโโโโโโโผโโโโโโโโโโโค
โ"Design" โ2592 โ
โโโโโโโโโโโโโโโโผโโโโโโโโโโโค
โ"Publishing" โ2353 โ
โโโโโโโโโโโโโโโโผโโโโโโโโโโโค
โ"Music" โ2175 โ
โโโโโโโโโโโโโโโโผโโโโโโโโโโโค
โ"Fashion" โ1630 โ
โโโโโโโโโโโโโโโโผโโโโโโโโโโโค
โ"Food" โ1416 โ
โโโโโโโโโโโโโโโโผโโโโโโโโโโโค
โ"Art" โ1355 โ
โโโโโโโโโโโโโโโโผโโโโโโโโโโโค
โ"Comics" โ802 โ
โโโโโโโโโโโโโโโโดโโโโโโโโโโโ
Top Categories in Numbers of Successful Projects
MATCH (c:Category)<-[:IN_CATEGORY]-(n:SubCategory)<-[:IN_CATEGORY]-(p:Project:Successful)
RETURN c.name, count(*)
ORDER BY count(*) desc LIMIT 10;
Not that many in tech or art, but many comics and design projects, and music also has an almost 50% success rate.
โโโโโโโโโโโโโโโโคโโโโโโโโโโโ
โ"c.name" โ"count(*)"โ
โโโโโโโโโโโโโโโโชโโโโโโโโโโโก
โ"Games" โ1108 โ
โโโโโโโโโโโโโโโโผโโโโโโโโโโโค
โ"Design" โ1002 โ
โโโโโโโโโโโโโโโโผโโโโโโโโโโโค
โ"Music" โ969 โ
โโโโโโโโโโโโโโโโผโโโโโโโโโโโค
โ"Film & Video"โ939 โ
โโโโโโโโโโโโโโโโผโโโโโโโโโโโค
โ"Publishing" โ859 โ
โโโโโโโโโโโโโโโโผโโโโโโโโโโโค
โ"Technology" โ599 โ
โโโโโโโโโโโโโโโโผโโโโโโโโโโโค
โ"Art" โ509 โ
โโโโโโโโโโโโโโโโผโโโโโโโโโโโค
โ"Comics" โ465 โ
โโโโโโโโโโโโโโโโผโโโโโโโโโโโค
โ"Fashion" โ426 โ
โโโโโโโโโโโโโโโโผโโโโโโโโโโโค
โ"Food" โ346 โ
โโโโโโโโโโโโโโโโดโโโโโโโโโโโ
Success Rate of Categories
MATCH (c:Category)<-[:IN_CATEGORY]-(n:SubCategory)<-[:IN_CATEGORY]-(p:Project)
WITH c.name as category, count(*) as total,
sum(case when p:Failed then 1 else 0 end) as failed,
sum(case when p:Successful then 1 else 0 end) as success
RETURN category, total, toFloat(failed)/total as fP, toFloat(success)/total as sP
order by sP desc;
We can calculate the totals and partials of failed and successful projects across the category tree and then sort by success rate.
Of course, this is skewed towards categories with fewer projects, as there is less competition and a few more successful projects have oversized impact.
โโโโโโโโโโโโโโโโคโโโโโโโโคโโโโโโโโโโโโโโโโโโโโคโโโโโโโโโโโโโโโโโโโโ
โ"category" โ"total"โ"fP" โ"sP" โ
โโโโโโโโโโโโโโโโชโโโโโโโโชโโโโโโโโโโโโโโโโโโโโชโโโโโโโโโโโโโโโโโโโโก
โ"Comics" โ802 โ0.3329177057356609 โ0.5798004987531172 โ
โโโโโโโโโโโโโโโโผโโโโโโโโผโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโค
โ"Dance" โ195 โ0.40512820512820513โ0.5282051282051282 โ
โโโโโโโโโโโโโโโโผโโโโโโโโผโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโค
โ"Theater" โ462 โ0.42207792207792205โ0.5021645021645021 โ
โโโโโโโโโโโโโโโโผโโโโโโโโผโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโค
โ"Music" โ2175 โ0.4606896551724138 โ0.44551724137931037โ
โโโโโโโโโโโโโโโโผโโโโโโโโผโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโค
โ"Design" โ2592 โ0.44753086419753085โ0.38657407407407407โ
โโโโโโโโโโโโโโโโผโโโโโโโโผโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโค
โ"Games" โ2889 โ0.42990654205607476โ0.38352371062651436โ
โโโโโโโโโโโโโโโโผโโโโโโโโผโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโค
โ"Art" โ1355 โ0.5328413284132841 โ0.3756457564575646 โ
โโโโโโโโโโโโโโโโผโโโโโโโโผโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโค
โ"Publishing" โ2353 โ0.5316617084572886 โ0.36506587335316615โ
โโโโโโโโโโโโโโโโผโโโโโโโโผโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโค
โ"Photography" โ529 โ0.5085066162570888 โ0.3648393194706994 โ
...
Querying Countries
Countries by Projects
MATCH (n:Country)
RETURN n.name,
size((n)<-[:IN_COUNTRY]-()) AS projects
ORDER BY projects DESC;
Surprised to see Canada and Australia so high up there, compared to their population, but the U.S. is leading by an order of magnitude.
โโโโโโโโโโโโโโโโโโคโโโโโโโโโโโ
โ"n.name" โ"projects"โ
โโโโโโโโโโโโโโโโโโชโโโโโโโโโโโก
โ"United States" โ14836 โ
โโโโโโโโโโโโโโโโโโผโโโโโโโโโโโค
โ"United Kingdom"โ2181 โ
โโโโโโโโโโโโโโโโโโผโโโโโโโโโโโค
โ"Canada" โ1217 โ
โโโโโโโโโโโโโโโโโโผโโโโโโโโโโโค
โ"Germany" โ652 โ
โโโโโโโโโโโโโโโโโโผโโโโโโโโโโโค
โ"Australia" โ647 โ
โโโโโโโโโโโโโโโโโโผโโโโโโโโโโโค
โ"Italy" โ454 โ
โโโโโโโโโโโโโโโโโโผโโโโโโโโโโโค
โ"France" โ436 โ
โโโโโโโโโโโโโโโโโโผโโโโโโโโโโโค
โ"Spain" โ355 โ
โโโโโโโโโโโโโโโโโโผโโโโโโโโโโโค
โ"Mexico" โ290 โ
โโโโโโโโโโโโโโโโโโผโโโโโโโโโโโค
โ"Netherlands" โ240 โ
โโโโโโโโโโโโโโโโโโผโโโโโโโโโโโค
โ"Sweden" โ199 โ
Country Success Rate
MATCH (c:Country)<-[:IN_COUNTRY]-(p:Project)
WITH c.name as category, count(*) as total,
sum(case when p:Failed then 1 else 0 end) as failed,
sum(case when p:Successful then 1 else 0 end) as success
RETURN category, total, toFloat(failed)/total as fP,
toFloat(success)/total as sP
ORDER BY sP DESC;
But if you actually want to be successful, being in New Zealand, Singapore, Ireland, or Sweden helps :)
โโโโโโโโโโโโโโโโโโคโโโโโโโโคโโโโโโโโโโโโโโโโโโโโคโโโโโโโโโโโโโโโโโโโโ
โ"category" โ"total"โ"fP" โ"sP" โ
โโโโโโโโโโโโโโโโโโชโโโโโโโโชโโโโโโโโโโโโโโโโโโโโชโโโโโโโโโโโโโโโโโโโโก
โ"United Kingdom"โ2181 โ0.4667583677212288 โ0.41127922971114167โ
โโโโโโโโโโโโโโโโโโผโโโโโโโโผโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโค
โ"New Zealand" โ128 โ0.453125 โ0.3984375 โ
โโโโโโโโโโโโโโโโโโผโโโโโโโโผโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโค
โ"Singapore" โ140 โ0.5214285714285715 โ0.37857142857142856โ
โโโโโโโโโโโโโโโโโโผโโโโโโโโผโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโค
โ"Ireland" โ95 โ0.5263157894736842 โ0.3684210526315789 โ
โโโโโโโโโโโโโโโโโโผโโโโโโโโผโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโค
โ"Sweden" โ199 โ0.46733668341708545โ0.36180904522613067โ
โโโโโโโโโโโโโโโโโโผโโโโโโโโผโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโค
โ"United States" โ14836 โ0.5184685899164195 โ0.35541925047182527โ
Conclusion
With the data in the AuraDB Free database you can now do a lot of things.
- Continue to explore and analyze the full dataset
- Add more data from other sources, e.g. the Kickstarter API to create a fuller knowledge graph
- Built a Kickstarter-like app, or an analytics dashboard, e.g. using GraphQL.
Have fun and happy coding! You can find the data and details from the past few weeks at the github repository for the stream.
Week 11 โ Importing and Querying Kickstarter Projects was originally published in Neo4j Developer Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.