International Salmon Data Laboratory


(Scott Akenhead) #21

A search and rescue team for salmon data
I wrote to Matt Jones, a database guru at the National Centre for Ecological Analysis and Synthesis (NCEAS), introducing ISDL and asking for advice. Specifically about compliance with Darwin Core protocols and ontology as we build knowledge graphs and analyses to share across the environmental sector. ISDL has much to learn from the NCEAS project Southern Alaska Salmon and People or SASAP.

(Js) #22

Part of the challenge will be defining the standardization of the data. Darwin Core is a good starting point, but as I have been working through the transformation of the attached data, even the simple data sets feel like I'm fitting a square peg into a circle. We still want to preserve the "natural" way of thinking about the data that Neo4j and a graph model encourages to exercise it's true value. What that will probably mean is that we will have to define/extend a core model to fit more naturally into the domain of analytics. Working through the NUSED data with it's metadata really pushes the Darwin Core mapping to its limits.

I will share the data graphs on graph commons to show what the graph containing the data looks like in the next day or two. Working through the NUSED data may take longer.

We should start thinking about working inside Neo4j itself soon and load data into a fully queryable instance. We can then query through the Neo4j desktop.

(Js) #23

NPAFC catch data. You can search for "NPAFC" to get a list all of the available public graphs.

(Js) #24

The beauty of graphs. This is the hatchery release data for US and Canada, NPAFC Hatchery Statistics Species Data - Canada, United States, graph 2b1841ac-73d1-4603-9492-2f281f375171.

(Scott Akenhead) #25

Be careful what you ask for
Bewilderingly fast development by unseen people with strange tools in unfamiliar territory. Seriously, who would ask for that? Me. Now I am pressing frantically on this button marked “Genius on Demand” but nothing is coming out of the machine (translation: struggling to keep up).

At this point, we need define the components of a salmon graph: a set of ~15 "resources," with subtypes.Things like:
(:Organization{label:"NMFS", name:"National Marine Fisheries Service", type:"federal government"}
If, and how, these nodes and fields within nodes are compliant with Darwin Core is to be determined. A specific ISDL example would probably not involve all resources.

Knowledge: knowing how things are connected. Thus, a constrained set of links between these nodes. We do not need n(n+1)/2 types of edge (node-link-node) for n nodes. Nor links based on types within nodes. Some links are simple and widely used,
(:Place)-[In]-(:Place), (:Organization-[:In]-(:Organization).
Some are not,
(:Person)-[HasOrganization {label:"employed by", job_description:"research biologist"]-(:Organization)
Likely to see a large number of optional fields, with defaults, in all nodes and links. In addition to ID and date-stamps created by neo4j.

(Scott Akenhead) #26

Neo4j Sandbox - Needed Immediately
ISDL is happening. We have salmon datasets recast as graphs. A database schema is emerging. We are moving toward pipelines/workflows and analyses. ISDL needs a server and stack for neo4j.

And now a word from our sponsors!

(Js) #27

There is some affordance for this in Darwin Core with the RecordLevel class, but this class holds foreign keys or IDs for contributors etc. This can be expanded with relationships that point to more detailed information regarding the contributor or the organization. I'll add those relationships and nodes to the schema.

(Js) #28

In terms of the Neo4j sandbox. I believe the default sandbox expires after 3 days. Will we have access to a sandbox with more longevity for the data?

(Js) #29

Inspired by Mike Bostock's population pyramid, global salmon hatchery release data plotted as an interactive pyramid.

(Scott Akenhead) #30

2015 paper: Trends in IT Innovation to Build a Next Generation Bioinformatics Solution to Manage and Analyse Biological Big Data ... _"The future, as data will continue to grow in size and complexity, lies in online analysis and storage for collaborative work, with methods that allow high interactivity with data for analysis and visualization." They see graph databases as the way out of the box. The real message is how quickly the technology moved.

(Scott Akenhead) #31

Video from my ISDL talk at GraphConnect.
Abstract and slides.
Announcement: first IYS Salmon Data Workshop, Vancouver BC, 2019-01-23/24
Ecologists to contribute datasets and say what they need, technologists then figure out how.