International Salmon Data Laboratory

graphs4good

(Scott Akenhead) #1

The International Year of the Salmon involves a dozen countries with one overarching goal: Salmon are resilient to climate change. Survive surprise. Salmon uniquely integrate terrestrial and marine effects from global warming, so they are the canary in the coal mine.
To meet this challenge, salmon biologists need new tools for data assembly, analysis, and visualization; tools that are powerful but safe, effective but easy-to-use, and rewarding. That's our challenge. This project intends to:
(a) develop tools and workflows for radically better information flow and knowledge management—new best practices;
(b) data mobilization and analyses, so we actually know what we should know from the data and metadata we collected;
(c) deliver irresistible examples, that are easy and safe to use, of workflows from field data to decision support products. Without effectively communicating the knowledge obtained, what's the point?
(d) help salmon biologists leapfrog to 2022 technology by 2022.

We will do this by leveraging the power of Neo4j and new tools built on neo4j. Your skills, and the tools you are helping to develop can have impacts throughout the environmental sciences. "The revolution begins with salmon" -- because focus allows excellence leads to adoption -- but will quickly spread via cogent and rewarding examples. Biologists have begun to share exampes of datasets for this work. About 20 people at GraphConnect offered expert assistance. Neo4j corporation is behind this (thank you, Jeff Morris!). We are connecting right here, right now.
On your mark, get Bang!

"... People don't know what they want until you show it to them." - Steve Jobs, 1997.


(Scott Akenhead) #2

Woo hoo! Discovered that Jeff Altman - graph fan - Denver, CO has PepperSlice - a data to decisions workflow tool. That addresses a big chunk of the underlying problem broached by ISDL: evolving best appropriate practices .
There you go, immediate value from this collaboration portal.


(Karin Wolok) #3

That makes us soooo happy! :smiley:


(Jeff) #4

Hey Scott - Thanks for the shout out. Sign up for a free PepperSlice account at https://pepperslice.com/


(Scott Akenhead) #5

ISDL Architecture
@tom.geudens asked me the following questions (I abbreviated) ... fortunately, he explained that "I have no clue" was an acceptable answer. Phew!
(answers revised 2018-09-28)

(1) size and growth prediction for the data?
The largest datasets are tables of about 10,000 rows of 10 variables. Size is not the problem. We will have max 20 of these in 2018, perhaps ~200 by 2022. Some are smaller but with more variables and need to be linked to preceding data. The problems are assembly (access), standardizing (ontology required), cleaning (synonyms, classification of practice, etc), and finally integrating. I will post links to data as it becomes available (tumbling in as I write).

(2) Will ISDL be public facing? If so, in what way: application, Neo4j browser connection,?
Yes. I want many people need to see the examples. Maybe webpages via Structr, or something simpler to start. It should be possible for people to look closely (Bloom, Browser) without being able to break anything.

(3) Will some persons have different access the database (i.e. Bloom)? Is that a separate design requirement?
Inevitable. Developers have more permission than known users (login) have more permission than public.

(4) How do you see security?
Lax, actually. Users (login/pw) will generally be previously known, and we simply boot malfeasants. The data is of value to a very limited community. Providers can have a copyright (attribution is critical).
I want people to steal the examples! To re-apply them, they will, of course, need SW licenses. I think 200 users would be a howling success. 2,000 means we fell through a wormhole.

(5) Query load ?
Scant. Certainly to start. In the happy event that I am wrong, use will ramping up slowly and we will react.

(6) How do you envision uploading of data? Small transactional size datapoints, gigantic bulk sets of datapoints, or in between?
The data will arrive as .csv tables. ISDL is about tools. Other, bewildering, people want to maintain a jumble of jumbo datasets.

(7) Should data contributions be verified first?
Yes, but no. I cannot imagine a user stuffing illicit or broken data into an existing knowledge graph as a covert act. Sweet, naive, child of summer that I am. However, this alerts me to the need for backups.

(8) How big will the daily updates/contributions be?
I am so enjoying this answer. Frequency: 0.1/day; Update size: microscopic; Contribution size: typically < 1M.

Small can be nasty. Shrews, OK? Try picking one up, you'll see. We don't need to wrestle Grizzlies to develop tools, workflows, and products.


(Scott Akenhead) #6

Offerings of Data #1
Q. What does your data look like? A. Tremendously diverse. I will post examples as they are indicated as available or passed to me for use within the ISDL project. If the data is proprietary (acknowledgement for all products and uses) it will be released after suitable declarations of honour and good will.
Oops! "new users can only put 5 links in a post" ... I have to break this up.

  1. New Salmon Escapement Database System
    NuSEDS is the Pacific Region’s centralized Oracle database that holds adult salmon escapement data. About 10,000 salmon spawning sites in DFO Pacific Region have been observed 0 to 10 times per year for 6 species for nearly 100 years (poorly before 1948). This is a rich, valuable, complicated, and largely unexplored dataset. Extensive metadata are required and provided. Data is aggregated within year; the raw data is largely on paper. Caveat analysta.
    http://www.pac.dfo-mpo.gc.ca/od-ds/science/sed-des/NUSEDS_20180416.zip

format: .csv, 135.6 Mbytes (24.4 zipped). 392,790 rows of 64 variables (sparse).
metadata: https://open.canada.ca/data/en/dataset/c48669a3-045b-400d-b730-48aafe8c5ee6
source: Canadian Open Government Portal https://open.canada.ca
contact: Bruce Baxter Fisheries and Oceans Canada, Stock Assessment Division, Pacific Biological Station, 3190 Hammond Bay Road, Nanaimo, BC, V9R 5K6, Canada. bruce.baxter@dfo-mpo.gc.ca


(Scott Akenhead) #7

Offerings of data #2

  1. Pacific Region Commercial Salmon Fishery In-season Catch Estimates.
    The spatial pattern of fishing (8 areas), by species (6), gear (3), and year (2004-2017).
    https://www.pac.dfo-mpo.gc.ca/od-ds/science/species-especes/salmon-saumon/ise-ecs.zip

format: .csv in 3 files (by gear).
metadata: https://open.canada.ca/data/en/dataset/7ac5fe02-308d-4fff-b805-80194f8ddeb4
source: Canadian Open Government Portal https://open.canada.ca
contact: Bruce A. Patten, Head, Fishery and Assessment Data Section, Science Branch, Fisheries and Oceans Canada, Government of Canada. Bruce.Patten@dfo-mpo.gc.ca


(Scott Akenhead) #8

Offerings of Data #3
3. North Pacific Salmon Abundance and Biomass Data
image002
From Dr. Jim Irvine, Pacific Biological Station: "Scott, thanks for distributing these data. ... these data are relatively simple, on the other hand, if you ever wondered how many salmon there are in the Pacific Ocean, and how this has changed since 1925, this is where to find out. There are lots of tables but the graphs in our paper will give you a better feel for the data." PAPER.
"I think the [ISDL] folks would be interested in seeing differences between Alaska and the southern states in terms of salmon numbers and biomass. The various time series, if spatially linked to maps, would be relatively easy to show. The NPAFC data [see following post] have the advantage of including catches for all salmon species by area. The Ruggerone and Irvine data are only 3 species but have the advantage of separating hatchery from wild fish as well as numbers vs adult biomass vs adult plus immature biomass, by area."
This paper was featured in a magazine article with tableau figures that allow readers to play with the data.

LINK TO DATA Please acknowledge the scientists in any use of these data.
format: .xlsx, 21 tables (multiple per sheet), 64 rows of 15 variables.
metadata: extensive, as sheets in this .xlsx file.
source: reviewed science publication, see link above
contacts: Dr. Gregory T. Ruggerone, Natural Resources Consultants Inc., Suite 404, 4039 21st Avenue West, Seattle, WA, 98199, USA.GRuggerone@nrccorp.com
Dr. James R. Irvine, Fisheries and Oceans Canada, Pacific Biological Station, 3190 Hammond Bay Road, Nanaimo, BC, V9T 6N7, Canada. James.Irvine@dfo-mpo.gc.ca


(Scott Akenhead) #9

Within hours of saying the largest tables had 10,000 rows, I posted a dataset with nearly 400 thousand rows. Sigh. :roll_eyes:


(Scott Akenhead) #10

Offerings of Data #4
4. NPAFC Statistics: Pacific Salmonid Catch and Hatchery Release Data
The North Pacific Anadromous Fish Commission (US, RU, JP, SK, CA) catch and hatchery release data are publicly available.
https://npafc.org/statistics/
format: .xlxs pivot tables (note hidden rows)
metadata: See that web page.
source: North Pacific Anadromous Fish Commission npafc.org
contact: Dr. James R. Irvine , Fisheries and Oceans Canada, Pacific Biological Station, 3190 Hammond Bay Road, Nanaimo, BC, V9T 6N7, Canada. James.Irvine@dfo-mpo.gc.ca


(James Irvine) #11

Great to see this discussion. We salmon folk need help from data experts to help us understand and interpret salmon data. These datasets are currently trivial in terms of size but that is partly because of all the metadata that are not included. How can all this important ancillary information be included? That is where software platforms like Neo4j have promise.


(Scott Akenhead) #12

Great to see Dr. Jim Irvine join this discussion. Jim is a distinguished research scientist at the Pacific Biological Station, Fisheries and Oceans Canada. Please reference his paper, the source of Offerings of Data #3,https://doi.org/10.1002/mcf2.10023, as:
Ruggerone, G.T., and J.R. Irvine. 2018. Numbers and Biomass of Natural- and Hatchery-Origin Pink Salmon, Chum Salmon, and Sockeye Salmon in the North Pacific Ocean, 1925–2015. Marine and Coastal Fisheries: Dynamics, Management, and Ecosystem Science 10:152-168. DOI: 10.1002/mcf2.10023


(Scott Akenhead) #13

Offerings of Data #5
Returns and Spawners for Sockeye, Pink, and Chum Salmon from British Columbia
We assembled productivity (recruits per spawner) estimates for BC sockeye, pink, and chum salmon. Annual estimates by brood year of spawner numbers, catch, and population and age composition are in a simple database. Time series were organized by species, Conservation Unit, Pacific Fisheries Management Area, or aggregates of these. Three categories of data quality, unique to each data type, were determined and used to rate annual recruit-per-spawner data annually and across all years for each time series. Our exploration of temporal changes in both field methods and data quality will assist analysts to interpret the reliability of the data and their results.

https://open.canada.ca/data/en/dataset/3d659575-4125-44b4-8d8f-c050d6624758
format: .csv
metadata: In the report.
source: Fisheries and Oceans Canada
citation: Ogden, A.D., Irvine, J.R., English, K.K., Grant, S., Hyatt, K.D., Godbout, L., and Holt,
C.A. 2015. Productivity (Recruits-per-Spawner) data for Sockeye, Pink, and
Chum Salmon from British Columbia. Can. Tech. Rep. Fish. Aquat. Sci. 3130:vi+57p.
ftp://ftp.meds-sdmm.dfo-mpo.gc.ca/pub/openData/Recruits_Spawner/Canadian_Technical_Report_3130.pdf
contact: Dr. James R. Irvine, Fisheries and Oceans Canada, Pacific Biological Station, 3190 Hammond Bay Road, Nanaimo, BC, V9T 6N7, Canada. James.Irvine@dfo-mpo.gc.ca


(Js) #14

To begin leveraging Neo4j for analyzing the salmon datasets, a core data schema will have to be established. Graph Commons provides a tool for collaborating on the schema design. Using the Darwin Core model to drive the initial design would be a good starting point. Once the core schema is generated, creating publishing pipelines into the schema from disparate datasets would be the next natural course of action. The Neo4j schema is forgiving and extensible, so the data model can evolve with deeper insights into the available datasets.


(Js) #15

I've created a schema graph at graphcommons.com. If you search for "Salmon Data Laboratory", it will pop up the canvas as a "Work in Progress". It is a public graph. Although the tool is for visualizing real data, it can be used to visualize a data schema model.

There may be better public tools out there, but this seems to have all the features needed for collaborating on a public graph.


(Scott Akenhead) #16

GraphCommons
A step forward. Thanks @js1
Alas, not visible today.
I will try to find collaborators re schema and the objects (lists named fields) attached to nodes and links.
Maybe we should split the problem(s) into small and specific examples.

Example #1
Biologists and DB managers at the Pacific Biological Station are thinking about expanding a small fraction of the NuSeds dataset (Offerings #1) by adding the raw sampling data behind each annual estimate. Via graph.

ISDL Documents
Our thinking, planning ,and projects should be an on-going and shared write-up in addition to this discussion. I started that HERE. Anyone with the link should add / edit. I don't like "shoulds" but, please, that would be helpful.

Example #2
What do you think of converting theDarwin Core to a graph? As a resource to:

  • standardize data and practices
  • tag nodes and links
  • link nodes by concepts

(Js) #17

I will start defining the Darwin Core Simple Model in the Salmon Data Laboratory Graph and attach edges to show how the data may be related. I'll then ingest some of the attach data into a real graph and share that out to show how the data can be fit into the Darwin model. I may get some of the biological terms wrong to start, but it'll be a good learning experience and nothing that can't be remedied with a simple update. Up to this point, salmon for me was something on a menu; not something with Latin and Greek names. :slight_smile:


(Js) #18

Darwin core classes have been added to the graph commons canvas. First stab at relating the different Darwin Core classes via edges.


(Scott Akenhead) #19

International Year of Salmon (IYS) Workshop on Salmon Status and Trends
2019-01-21/22, Vancouver BC Canada

Problem:
In recent decades, the productivity of salmon has become increasingly uncertain with many populations experiencing extremes related to timing, abundance and size with serious social, economic and conservation impacts. Much has been learned in previous workshops but a lack of consistency in approaches to categorize biological status and trends, terminology to indicate status, requirements and standards for different types of data, spatial and temporal scales for comparison and aggregation, and ways of communicating findings significantly impedes the timeliness, efficiency and effectiveness of scientific investigations. Our data systems do not match our technological capacity and social/scientific inquiry needs. Many agencies have a commitment to open data but are challenged to achieve it given the significant costs associated with bringing historic data online. Increasing variability in a rapidly changing environment demands rapid access to integrated data for comparative and mechanistic studies of the distribution and productivity of salmon across life history stages and associated eco- regions.

Solution: (extracts thereof)
The primary goal of this workshop will be to identify a series of legacy datasets and standards associated with major categories of data. These datasets will be the focus of subsequent analytical workshops; additional workshops may concentrate on communication of scientific results. Ideally these separate but linked workshops will [cycle and grow]* over the course of IYS.
* but not yet funded.

Complete announcement HERE..


(Scott Akenhead) #20

Thanks! Our first graph. See figure below. More to this than meets the eye. Extensive properties for each "resource" (type of node). That was a lot of work.
"Come Watson! The game is afoot."

Now I /we have to figure out how to create a lot of instances of each resource, so:

  1. Created Google Sheets corresponding to this schema, HERE.
  2. Wrote an R script to create (empty) data.frames for this schema, HERE

Darwin Core is for museum catalogues, not for organizing salmon datasets, practices, analyses, workflows, data products,. And the surrounding people, projects,. But:

  1. we do not need to specify every field in each resource.
  2. we must add new resources.
    Obviously Person, Activity, Organization),.
    Less so: WorkFlow, Model, DataSet, Practice (?)
  3. Helpful to future users if we predefine layers (ontologies,networks) such as Place-[In]-Place, Organization-[In]-Organization, Practice-[in]-Practice.

I conclude we are going to have to build our own Salmon Ontology. It will not be small.
This needs to be a group grope, but turning into a standards committee is a trap. We need "good enough to proceed." The Nazi Librarian ("no data for you!") will have her/his/its day.
Another Google Sheet for that. Stand by!