I wrote to Matt Jones, a database guru at the National Centre for Ecological Analysis and Synthesis (NCEAS), introducing ISDL and asking for advice. Specifically about compliance with Darwin Core protocols and ontology as we build knowledge graphs and analyses to share across the environmental sector. ISDL has much to learn from the NCEAS project Southern Alaska Salmon and People or SASAP.
Part of the challenge will be defining the standardization of the data. Darwin Core is a good starting point, but as I have been working through the transformation of the attached data, even the simple data sets feel like I'm fitting a square peg into a circle. We still want to preserve the "natural" way of thinking about the data that Neo4j and a graph model encourages to exercise it's true value. What that will probably mean is that we will have to define/extend a core model to fit more naturally into the domain of analytics. Working through the NUSED data with it's metadata really pushes the Darwin Core mapping to its limits.
I will share the data graphs on graph commons to show what the graph containing the data looks like in the next day or two. Working through the NUSED data may take longer.
We should start thinking about working inside Neo4j itself soon and load data into a fully queryable instance. We can then query through the Neo4j desktop.
The beauty of graphs. This is the hatchery release data for US and Canada, NPAFC Hatchery Statistics Species Data - Canada, United States, graph 2b1841ac-73d1-4603-9492-2f281f375171.
Be careful what you ask for
Bewilderingly fast development by unseen people with strange tools in unfamiliar territory. Seriously, who would ask for that? Me. Now I am pressing frantically on this button marked “Genius on Demand” but nothing is coming out of the machine (translation: struggling to keep up).
At this point, we need define the components of a salmon graph: a set of ~15 "resources," with subtypes.Things like: (:Organization{label:"NMFS", name:"National Marine Fisheries Service", type:"federal government"}
If, and how, these nodes and fields within nodes are compliant with Darwin Core is to be determined. A specific ISDL example would probably not involve all resources.
Knowledge: knowing how things are connected. Thus, a constrained set of links between these nodes. We do not need n(n+1)/2 types of edge (node-link-node) for n nodes. Nor links based on types within nodes. Some links are simple and widely used, (:Place)-[In]-(:Place), (:Organization-[:In]-(:Organization).
Some are not, (:Person)-[HasOrganization {label:"employed by", job_description:"research biologist"]-(:Organization)
Likely to see a large number of optional fields, with defaults, in all nodes and links. In addition to ID and date-stamps created by neo4j.
Neo4j Sandbox - Needed Immediately
ISDL is happening. We have salmon datasets recast as graphs. A database schema is emerging. We are moving toward pipelines/workflows and analyses. ISDL needs a server and stack for neo4j.
There is some affordance for this in Darwin Core with the RecordLevel class, but this class holds foreign keys or IDs for contributors etc. This can be expanded with relationships that point to more detailed information regarding the contributor or the organization. I'll add those relationships and nodes to the schema.
Inspired by Mike Bostock's population pyramid, global salmon hatchery release data plotted as an interactive pyramid. https://isdl-220916.appspot.com/release
Fretting about workflows that bounce between R and Neo4j.
Excited byMax Demarzi's talk at Graph Connect: decision trees and workflows WITHIN neo4j. Big step forward, thanks!
"The code is the data" -- oh yeah, now we're getting somewhere.
Find the rest of the story at decision tree | Max De Marzi
So on.
Confirmed: ISDL Workshop, 2019-01-25, 0800:1600, Pacific Salmon Commission Boardroom, 600-1155 Robson Street, Vancouver, BC, V6E 1B5.
That was the last obstacle. Oh yeah, it is on!
Attendance by Neo4j confirmed. Attendance by neo4j experts confirmed. Attendance by DFO database managers confirmed.
Definitely an interesting location. The Pacific Salmon Commission (USA + Canada) runs perhaps the world's most data intensive -- and just plain intense -- fisheries management process on the planet. Consider: DNA samples from salmon catches are processed and returned as population composition in 36 to 48 hours, informing weekly decisions about which fisheries will open in which fishing areas.
This first ISDL workshop will consider graph-based solutions to problems re data integration, processing, and analysis expressed by salmon ecologists in a preceding two-day workshop, also sponsored by the International Year of the Salmon. "Designated survivors" of preceding ecologists' workshop will attend, and may or may not survive this one.
Building toward a graph db scheme for salmon data integration. Levering and extending the Darwin Core ontology, an international standard. Thank you, John Song.
How will salmon ecologists react to the need for elaborate standards for the salmon knowledge graph and attendent glossary: barriers worth removing, or barriers to participation?
But, ... Are you sure that my "1SW coho" is your "grilse"? They're both age 1.1 right? I mean age 2_1.
You might imagine my excitement upon discovering that the Instituut Natuur-en Bosonderzoek in Brussels -- Institute for Nature and Forest Research -- records fish counts back to 1753! Alas, only 3 observations of Atlantic salmon: 2 in 2012, 1 in 2017. Which seems rather odd.
WorkFlow, finally!
Extensive search revealed several promising workflow packages, but either (A) hopeless as tools for persons other than professional computer programmers/sysadmins, or (B) not being maintained and, frankly, old-fashioned. Australia to the rescue, thank-you CSIRO, for WorkSpace.
An excellent GUI for building workflows, and, to my delight, nested workflows. "Nesting does make working with complex workflow arrangements significantly easier."
I have pounced on this like a Velociraptor on a tourist.
Thanks for keeping everyone in the community updated on this projects' updates. This is awesome!!!! It's really cool to be able to see the process as it moves along.
Thanks for this. I am very interested in what happened with this project, if it's still ongoing, and how I can help. @js1@ScottAkenhead . I was reading through @js1 google doc where he mentions the Cal Academy of sciences data begging to be a graph, and that is exactly what I'd like to do. I am new to graphs and Neo4j but very willing to do the work and learn. Any pointing in the right direction is appreciated.
This first update since 2018-12 is to (a) tell you this project has been steadily developing, (b) to point to some presentations and papers about those developments, and (c) to apologize for rudely ignoring the 3,600 viewers of this topic.
– The January 2018 workshop International Salmon Data Laboratory, was part of the [North Pacific Anadromous Fish Commission] (NPAFC.org) program International Year of the Salmon. See NPAFC Technical Report #14, HERE. with presenation slides but not videos.
A review of progress 2017-2021 is in this 19 video "The Salmon Of Knowledge." I expected to deliver this in Hakodate Japan in May 2020, but actually delivered in my kitchen May 2021 because of expletive deleted COVID-19. See also the 1,500 word extended abstract
– Lastly, today we presented this project at the webinar [Neo4j Connections - A Virtual Event: Accelerating Innovation with Graphs Wed, Aug 25, 2021. If our video is posted (youtube?) I will let you know.