I'm glad that you enjoyed the talk.
This is a great question:
A "traditional" dataset would be a table of training data points - to achieve that you typically have to flatten the real data e.g. from a relational database or a graph like neo4j into a single table.
Imagine a hypothetical experiment using a social graph where I try and predict people's favourite music genre based on their friends (at least those friends whose favourite music genre we already know).
The flattened traditional dataset for ML would be a table like:
Target Person, Friend_1 Favourite Genre, Friend_2 Favourite Genre, Friend_3 Favourite Genre...
Not everyone has the same number of friends so you have to come up with some way of inserting null values, or ignoring extra friends or people who have too few friends.
Lots of information is lost
e.g. friends who you have lots of friends in common with might be much more influential than friends who you don't have lots of friends in common with
For many ML models each column in the data set will be treated independently, so it's harder for it to learn that all friends should be treated the same.
You could try and overcome some of these problems by adding more features to your tabular dataset:
Target Person, Friend_1 Favourite Genre, Friend_1 count(shared friends), Friend_2 Favourite Genre, Friend_2 count(shared friends), Friend_3 Favourite Genre, Friend_3 count(shared friends)...
But which features to add? And each column is still treated as independent and we still have to populate null values somehow. This approach is a lot of work when we could just train the model using the graph which contains all the information and allows the model to learn to treat all entities which are the same (friends) in the same way.