International Salmon Data Laboratory


(Scott Akenhead) #21

A search and rescue team for salmon data
I wrote to Matt Jones, a database guru at the National Centre for Ecological Analysis and Synthesis (NCEAS), introducing ISDL and asking for advice. Specifically about compliance with Darwin Core protocols and ontology as we build knowledge graphs and analyses to share across the environmental sector. ISDL has much to learn from the NCEAS project Southern Alaska Salmon and People or SASAP.

(Js) #22

Part of the challenge will be defining the standardization of the data. Darwin Core is a good starting point, but as I have been working through the transformation of the attached data, even the simple data sets feel like I'm fitting a square peg into a circle. We still want to preserve the "natural" way of thinking about the data that Neo4j and a graph model encourages to exercise it's true value. What that will probably mean is that we will have to define/extend a core model to fit more naturally into the domain of analytics. Working through the NUSED data with it's metadata really pushes the Darwin Core mapping to its limits.

I will share the data graphs on graph commons to show what the graph containing the data looks like in the next day or two. Working through the NUSED data may take longer.

We should start thinking about working inside Neo4j itself soon and load data into a fully queryable instance. We can then query through the Neo4j desktop.

(Js) #23

NPAFC catch data. You can search for "NPAFC" to get a list all of the available public graphs.

(Js) #24

The beauty of graphs. This is the hatchery release data for US and Canada, NPAFC Hatchery Statistics Species Data - Canada, United States, graph 2b1841ac-73d1-4603-9492-2f281f375171.

(Scott Akenhead) #25

Be careful what you ask for
Bewilderingly fast development by unseen people with strange tools in unfamiliar territory. Seriously, who would ask for that? Me. Now I am pressing frantically on this button marked “Genius on Demand” but nothing is coming out of the machine (translation: struggling to keep up).

At this point, we need define the components of a salmon graph: a set of ~15 "resources," with subtypes.Things like:
(:Organization{label:"NMFS", name:"National Marine Fisheries Service", type:"federal government"}
If, and how, these nodes and fields within nodes are compliant with Darwin Core is to be determined. A specific ISDL example would probably not involve all resources.

Knowledge: knowing how things are connected. Thus, a constrained set of links between these nodes. We do not need n(n+1)/2 types of edge (node-link-node) for n nodes. Nor links based on types within nodes. Some links are simple and widely used,
(:Place)-[In]-(:Place), (:Organization-[:In]-(:Organization).
Some are not,
(:Person)-[HasOrganization {label:"employed by", job_description:"research biologist"]-(:Organization)
Likely to see a large number of optional fields, with defaults, in all nodes and links. In addition to ID and date-stamps created by neo4j.

(Scott Akenhead) #26

Neo4j Sandbox - Needed Immediately
ISDL is happening. We have salmon datasets recast as graphs. A database schema is emerging. We are moving toward pipelines/workflows and analyses. ISDL needs a server and stack for neo4j.

And now a word from our sponsors!

(Js) #27

There is some affordance for this in Darwin Core with the RecordLevel class, but this class holds foreign keys or IDs for contributors etc. This can be expanded with relationships that point to more detailed information regarding the contributor or the organization. I'll add those relationships and nodes to the schema.

(Js) #28

In terms of the Neo4j sandbox. I believe the default sandbox expires after 3 days. Will we have access to a sandbox with more longevity for the data?

(Js) #29

Inspired by Mike Bostock's population pyramid, global salmon hatchery release data plotted as an interactive pyramid.

(Scott Akenhead) #30

2015 paper: Trends in IT Innovation to Build a Next Generation Bioinformatics Solution to Manage and Analyse Biological Big Data ... _"The future, as data will continue to grow in size and complexity, lies in online analysis and storage for collaborative work, with methods that allow high interactivity with data for analysis and visualization." They see graph databases as the way out of the box. The real message is how quickly the technology moved.

(Scott Akenhead) #31

Video from my ISDL talk at GraphConnect.
Abstract and slides.
Announcement: first IYS Salmon Data Workshop, Vancouver BC, 2019-01-23/24
Ecologists to contribute datasets and say what they need, technologists then figure out how.

(Scott Akenhead) #32

Fretting about workflows that bounce between R and Neo4j.
Excited byMax Demarzi's talk at Graph Connect: decision trees and workflows WITHIN neo4j. Big step forward, thanks!
"The code is the data" -- oh yeah, now we're getting somewhere.
Find the rest of the story at

(Scott Akenhead) #33

So on.
Confirmed: ISDL Workshop, 2019-01-25, 0800:1600, Pacific Salmon Commission Boardroom, 600-1155 Robson Street, Vancouver, BC, V6E 1B5.
That was the last obstacle. Oh yeah, it is on!
Attendance by Neo4j confirmed. Attendance by neo4j experts confirmed. Attendance by DFO database managers confirmed.

Definitely an interesting location. The Pacific Salmon Commission (USA + Canada) runs perhaps the world's most data intensive -- and just plain intense -- fisheries management process on the planet. Consider: DNA samples from salmon catches are processed and returned as population composition in 36 to 48 hours, informing weekly decisions about which fisheries will open in which fishing areas.

This first ISDL workshop will consider graph-based solutions to problems re data integration, processing, and analysis expressed by salmon ecologists in a preceding two-day workshop, also sponsored by the International Year of the Salmon. "Designated survivors" of preceding ecologists' workshop will attend, and may or may not survive this one.

(Scott Akenhead) #34

Salmon data integration is critical:
“We collated smolt survival and smolt-to-adult (marine) survival data for all regions of the Pacific coast of North America excluding California to examine the forces shaping salmon returns. A total of 3,055 years of annual survival estimates were available for Chinook (Oncorhynchus tshawytscha) and steelhead (O. mykiss). This dataset provides a fundamentally different perspective on west coast salmon conservation problems from the previously accepted view.”

(Scott Akenhead) #35

Building toward a graph db scheme for salmon data integration. Levering and extending the Darwin Core ontology, an international standard. Thank you, John Song.

How will salmon ecologists react to the need for elaborate standards for the salmon knowledge graph and attendent glossary: barriers worth removing, or barriers to participation?
But, ...
Are you sure that my "1SW coho" is your "grilse"? They're both age 1.1 right? I mean age 2_1.

(Scott Akenhead) #36

You might imagine my excitement upon discovering that the Instituut Natuur-en Bosonderzoek in Brussels -- Institute for Nature and Forest Research -- records fish counts back to 1753! Alas, only 3 observations of Atlantic salmon: 2 in 2012, 1 in 2017. Which seems rather odd.


(Scott Akenhead) #37

WorkFlow, finally!
Extensive search revealed several promising workflow packages, but either (A) hopeless as tools for persons other than professional computer programmers/sysadmins, or (B) not being maintained and, frankly, old-fashioned. Australia to the rescue, thank-you CSIRO, for WorkSpace.
An excellent GUI for building workflows, and, to my delight, nested workflows. "Nesting does make working with complex workflow arrangements significantly easier."
I have pounced on this like a Velociraptor on a tourist.

(Karin Wolok) #38

Thanks for keeping everyone in the community updated on this projects' updates. This is awesome!!!! It's really cool to be able to see the process as it moves along. :smile: