We are currently enriching our knowledge graph with NLP derived metadata about the content on GOV.UK. See this blog post for context.
Our desired outcome is each piece of content (a page on GOV.UK; labelled a Cid; circa. ~ 500k pages) will have a relationship with the various entities it contains or mentions in its text. This can be Organisations, People, Dates and Services etc..
We use a modified version of BERT (trained on our data; we call it GovNER) to identify entities within each page. The output is a csv that looks like this (I've removed some entities and included only one row for ease of reading):
base_path,entities,updated_at,govner_version
/government/news/new-digital-resource-for-charity-trustees-launched,"[{""end"": 144, ""entity"": ""charity"", ""entity_type"": ""ORGANIZATION"", ""start"": 137}, {""end"": 2788, ""entity"": ""Commission"", ""entity_type"": ""ORGANIZATION"", ""start"": 2778}, {""end"": 4155, ""entity"": ""Commission"", ""entity_type"": ""ORGANIZATION"", ""start"": 4145}, {""end"": 4822, ""entity"": ""Commission"", ""entity_type"": ""ORGANIZATION"", ""start"": 4812}, {""end"": 1557, ""entity"": ""questions"", ""entity_type"": ""CONTACT"", ""start"": 1548}, {""end"": 2175, ""entity"": ""email"", ""entity_type"": ""CONTACT"", ""start"": 2166}, {""end"": 2847, ""entity"": ""questions"", ""entity_type"": ""CONTACT"", ""start"": 2838}, {""end"": 2979, ""entity"": ""questions"", ""entity_type"": ""CONTACT"", ""start"": 2970}]",2020-10-15 17:56:30.955360,0.1
The base_path is unique for each Cid, so we can use this to look up our Cids. We then would like to extract the data from the entities column which is in JSON format. As you can see there are a variety of entity_types which corresponds to different node labels. We need to iterate through and create a labelled node if it doesn't exist then create a relationship between the Cid and that entity. We would like to store the location of that entity in the text (start, end) as properties of the edge (in 3.5.16 I believe you can only store one property on an edge?).
Historically, we've used python to get our data into a simple nodelist and edgelist format, with one csv per nodelist label and one edge csv per relationship. We'd normally do the data wrangling in python then have a simple LOAD CSV in Cypher to load in each of the different node labels as the graph gets built. This will result in quite a few intermediary csvs, is there a cleaner way to do this using just Cypher and / or APOC?
We've not reviewed the best way to do this for about a year and are aware that things might have moved on with Cypher and APOC.
We are using Community edition 3.5.16.
