International Salmon Data Laboratory

Offerings of Data #4
4. NPAFC Statistics: Pacific Salmonid Catch and Hatchery Release Data
The North Pacific Anadromous Fish Commission (US, RU, JP, SK, CA) catch and hatchery release data are publicly available.
https://npafc.org/statistics/
format: .xlxs pivot tables (note hidden rows)
metadata: See that web page.
source: North Pacific Anadromous Fish Commission npafc.org
contact: Dr. James R. Irvine , Fisheries and Oceans Canada, Pacific Biological Station, 3190 Hammond Bay Road, Nanaimo, BC, V9T 6N7, Canada. James.Irvine@dfo-mpo.gc.ca

1 Like

Great to see this discussion. We salmon folk need help from data experts to help us understand and interpret salmon data. These datasets are currently trivial in terms of size but that is partly because of all the metadata that are not included. How can all this important ancillary information be included? That is where software platforms like Neo4j have promise.

1 Like

Great to see Dr. Jim Irvine join this discussion. Jim is a distinguished research scientist at the Pacific Biological Station, Fisheries and Oceans Canada. Please reference his paper, the source of Offerings of Data #3,https://doi.org/10.1002/mcf2.10023, as:
Ruggerone, G.T., and J.R. Irvine. 2018. Numbers and Biomass of Natural- and Hatchery-Origin Pink Salmon, Chum Salmon, and Sockeye Salmon in the North Pacific Ocean, 1925–2015. Marine and Coastal Fisheries: Dynamics, Management, and Ecosystem Science 10:152-168. DOI: 10.1002/mcf2.10023

1 Like

Offerings of Data #5
Returns and Spawners for Sockeye, Pink, and Chum Salmon from British Columbia
We assembled productivity (recruits per spawner) estimates for BC sockeye, pink, and chum salmon. Annual estimates by brood year of spawner numbers, catch, and population and age composition are in a simple database. Time series were organized by species, Conservation Unit, Pacific Fisheries Management Area, or aggregates of these. Three categories of data quality, unique to each data type, were determined and used to rate annual recruit-per-spawner data annually and across all years for each time series. Our exploration of temporal changes in both field methods and data quality will assist analysts to interpret the reliability of the data and their results.

https://open.canada.ca/data/en/dataset/3d659575-4125-44b4-8d8f-c050d6624758
format: .csv
metadata: In the report.
source: Fisheries and Oceans Canada
citation: Ogden, A.D., Irvine, J.R., English, K.K., Grant, S., Hyatt, K.D., Godbout, L., and Holt,
C.A. 2015. Productivity (Recruits-per-Spawner) data for Sockeye, Pink, and
Chum Salmon from British Columbia. Can. Tech. Rep. Fish. Aquat. Sci. 3130:vi+57p.
ftp://ftp.meds-sdmm.dfo-mpo.gc.ca/pub/openData/Recruits_Spawner/Canadian_Technical_Report_3130.pdf
contact: Dr. James R. Irvine, Fisheries and Oceans Canada, Pacific Biological Station, 3190 Hammond Bay Road, Nanaimo, BC, V9T 6N7, Canada. James.Irvine@dfo-mpo.gc.ca

1 Like

To begin leveraging Neo4j for analyzing the salmon datasets, a core data schema will have to be established. Graph Commons provides a tool for collaborating on the schema design. Using the Darwin Core model to drive the initial design would be a good starting point. Once the core schema is generated, creating publishing pipelines into the schema from disparate datasets would be the next natural course of action. The Neo4j schema is forgiving and extensible, so the data model can evolve with deeper insights into the available datasets.

1 Like

I've created a schema graph at graphcommons.com. If you search for "Salmon Data Laboratory", it will pop up the canvas as a "Work in Progress". It is a public graph. Although the tool is for visualizing real data, it can be used to visualize a data schema model.

There may be better public tools out there, but this seems to have all the features needed for collaborating on a public graph.

GraphCommons
A step forward. Thanks @js1
Alas, not visible today.
I will try to find collaborators re schema and the objects (lists named fields) attached to nodes and links.
Maybe we should split the problem(s) into small and specific examples.

Example #1
Biologists and DB managers at the Pacific Biological Station are thinking about expanding a small fraction of the NuSeds dataset (Offerings #1) by adding the raw sampling data behind each annual estimate. Via graph.

ISDL Documents
Our thinking, planning ,and projects should be an on-going and shared write-up in addition to this discussion. I started that HERE. Anyone with the link should add / edit. I don't like "shoulds" but, please, that would be helpful.

Example #2
What do you think of converting theDarwin Core to a graph? As a resource to:

  • standardize data and practices
  • tag nodes and links
  • link nodes by concepts
1 Like

I will start defining the Darwin Core Simple Model in the Salmon Data Laboratory Graph and attach edges to show how the data may be related. I'll then ingest some of the attach data into a real graph and share that out to show how the data can be fit into the Darwin model. I may get some of the biological terms wrong to start, but it'll be a good learning experience and nothing that can't be remedied with a simple update. Up to this point, salmon for me was something on a menu; not something with Latin and Greek names. :slight_smile:

1 Like

Darwin core classes have been added to the graph commons canvas. First stab at relating the different Darwin Core classes via edges.

International Year of Salmon (IYS) Workshop on Salmon Status and Trends
2019-01-21/22, Vancouver BC Canada

Problem:
In recent decades, the productivity of salmon has become increasingly uncertain with many populations experiencing extremes related to timing, abundance and size with serious social, economic and conservation impacts. Much has been learned in previous workshops but a lack of consistency in approaches to categorize biological status and trends, terminology to indicate status, requirements and standards for different types of data, spatial and temporal scales for comparison and aggregation, and ways of communicating findings significantly impedes the timeliness, efficiency and effectiveness of scientific investigations. Our data systems do not match our technological capacity and social/scientific inquiry needs. Many agencies have a commitment to open data but are challenged to achieve it given the significant costs associated with bringing historic data online. Increasing variability in a rapidly changing environment demands rapid access to integrated data for comparative and mechanistic studies of the distribution and productivity of salmon across life history stages and associated eco- regions.

Solution: (extracts thereof)
The primary goal of this workshop will be to identify a series of legacy datasets and standards associated with major categories of data. These datasets will be the focus of subsequent analytical workshops; additional workshops may concentrate on communication of scientific results. Ideally these separate but linked workshops will [cycle and grow]* over the course of IYS.
* but not yet funded.

Complete announcement HERE..

1 Like

Thanks! Our first graph. See figure below. More to this than meets the eye. Extensive properties for each "resource" (type of node). That was a lot of work.
"Come Watson! The game is afoot."

Now I /we have to figure out how to create a lot of instances of each resource, so:

  1. Created Google Sheets corresponding to this schema, HERE.
  2. Wrote an R script to create (empty) data.frames for this schema, HERE

Darwin Core is for museum catalogues, not for organizing salmon datasets, practices, analyses, workflows, data products,. And the surrounding people, projects,. But:

  1. we do not need to specify every field in each resource.
  2. we must add new resources.
    Obviously Person, Activity, Organization),.
    Less so: WorkFlow, Model, DataSet, Practice (?)
  3. Helpful to future users if we predefine layers (ontologies,networks) such as Place-[In]-Place, Organization-[In]-Organization, Practice-[in]-Practice.

I conclude we are going to have to build our own Salmon Ontology. It will not be small.
This needs to be a group grope, but turning into a standards committee is a trap. We need "good enough to proceed." The Nazi Librarian ("no data for you!") will have her/his/its day.
Another Google Sheet for that. Stand by!

2 Likes

A search and rescue team for salmon data
https://www.nceas.ucsb.edu/news/a-search-and-rescue-team-for-salmon-data
I wrote to Matt Jones, a database guru at the National Centre for Ecological Analysis and Synthesis (NCEAS), introducing ISDL and asking for advice. Specifically about compliance with Darwin Core protocols and ontology as we build knowledge graphs and analyses to share across the environmental sector. ISDL has much to learn from the NCEAS project Southern Alaska Salmon and People or SASAP.
NCEAS
image

1 Like

Part of the challenge will be defining the standardization of the data. Darwin Core is a good starting point, but as I have been working through the transformation of the attached data, even the simple data sets feel like I'm fitting a square peg into a circle. We still want to preserve the "natural" way of thinking about the data that Neo4j and a graph model encourages to exercise it's true value. What that will probably mean is that we will have to define/extend a core model to fit more naturally into the domain of analytics. Working through the NUSED data with it's metadata really pushes the Darwin Core mapping to its limits.

I will share the data graphs on graph commons to show what the graph containing the data looks like in the next day or two. Working through the NUSED data may take longer.

We should start thinking about working inside Neo4j itself soon and load data into a fully queryable instance. We can then query through the Neo4j desktop.

1 Like

NPAFC catch data. You can search for "NPAFC" to get a list all of the available public graphs.

1 Like

The beauty of graphs. This is the hatchery release data for US and Canada, NPAFC Hatchery Statistics Species Data - Canada, United States, graph 2b1841ac-73d1-4603-9492-2f281f375171.

1 Like

Be careful what you ask for
Bewilderingly fast development by unseen people with strange tools in unfamiliar territory. Seriously, who would ask for that? Me. Now I am pressing frantically on this button marked “Genius on Demand” but nothing is coming out of the machine (translation: struggling to keep up).

At this point, we need define the components of a salmon graph: a set of ~15 "resources," with subtypes.Things like:
(:Organization{label:"NMFS", name:"National Marine Fisheries Service", type:"federal government"}
If, and how, these nodes and fields within nodes are compliant with Darwin Core is to be determined. A specific ISDL example would probably not involve all resources.

Knowledge: knowing how things are connected. Thus, a constrained set of links between these nodes. We do not need n(n+1)/2 types of edge (node-link-node) for n nodes. Nor links based on types within nodes. Some links are simple and widely used,
(:Place)-[In]-(:Place), (:Organization-[:In]-(:Organization).
Some are not,
(:Person)-[HasOrganization {label:"employed by", job_description:"research biologist"]-(:Organization)
Likely to see a large number of optional fields, with defaults, in all nodes and links. In addition to ID and date-stamps created by neo4j.

1 Like

Neo4j Sandbox - Needed Immediately
ISDL is happening. We have salmon datasets recast as graphs. A database schema is emerging. We are moving toward pipelines/workflows and analyses. ISDL needs a server and stack for neo4j.

And now a word from our sponsors!

There is some affordance for this in Darwin Core with the RecordLevel class, but this class holds foreign keys or IDs for contributors etc. This can be expanded with relationships that point to more detailed information regarding the contributor or the organization. I'll add those relationships and nodes to the schema.

In terms of the Neo4j sandbox. I believe the default sandbox expires after 3 days. Will we have access to a sandbox with more longevity for the data?

Inspired by Mike Bostock's population pyramid, global salmon hatchery release data plotted as an interactive pyramid. https://isdl-220916.appspot.com/release