Skip to content
20 changes: 20 additions & 0 deletions CODE_OF_CONDUCT.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
---
title: "Contributor Code of Conduct"
---

This Code of Conduct outlines our expectations for all participants in our community, as well as the consequences for unacceptable behavior.
As contributors and maintainers of this workshop, we pledge to follow the [Carpentry Code of Conduct](https://docs.carpentries.org/topic_folders/policies/code-of-conduct.html),
and we are committed to providing a welcoming, safe, and respectful environment for everyone, regardless of background, identity, or experience.

We expect all participants to:
* **Be respectful**. Treat others with courtesy and consideration.
* **Be inclusive**. Welcome diverse perspectives and experiences.
* **Be collaborative**. Support each other and work together toward shared goals.
* **Be honest**. Communicate clearly, act with integrity, and respect confidentiality where applicable.
* **Be responsible**. Take accountability for your actions and contribute positively to the community.

Instances of abusive, harassing, or otherwise unacceptable behavior
may be reported by following our [reporting guidelines](https://docs.carpentries.org/topic_folders/policies/incident-reporting.html).

**Acknowledgement**
By participating in this workshop, you agree to uphold this Code of Conduct and help ensure a positive and respectful environment for everyone.
13 changes: 5 additions & 8 deletions episodes/01-introduction.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,10 +25,7 @@
## Darwin Core - A global community of data sharing and integration
Darwin Core is a data standard to mobilize and share biodiversity data. Over the years, the Darwin Core standard has
expanded to enable exchange and sharing of diverse types of biological observations from citizen scientists, ecological
monitoring, eDNA, animal telemetry, taxonomic treatments, and many others. Darwin Core is applicable to any observation
of an organism (scientific name, OTU, or other methods of defining a species) at a particular place and time. In Darwin
Core this is an `occurrence`. To learn more about the foundations of Darwin Core read
[Wieczorek et al. 2012](https://doi.org/10.1371/journal.pone.0029715).
monitoring, eDNA, animal telemetry, taxonomic treatments, and many others.

### Demonstrated Use of Darwin Core
The power of Darwin Core is most evident in the data aggregators that harvest data using that standard.
Expand All @@ -44,7 +41,7 @@
extensions you use), an Ecological Metadata Language (EML) XML file, and a meta.xml file that describes what's in the
zipped folder.

![Darwin Core Archive](/episodes/fig/DwC-Archive.png)
![Darwin Core Archive](/fig/DwC-Archive.png)

Check warning on line 44 in episodes/01-introduction.md

View workflow job for this annotation

GitHub Actions / Build markdown source files if valid

[image missing alt-text]: /fig/DwC-Archive.png
*Image credit: Elizabeth Lawrence*

::::::::::::::::::::::::::::::::::::: challenge
Expand Down Expand Up @@ -82,7 +79,7 @@

#### :pushpin: Tip

If your raw column headers are Darwin Core terms verbatim then you can skip this step! Next time you plan data
If your column headers are Darwin Core terms verbatim then you can skip this step! Next time you plan data
collection use the standard DwC term headers!

::::::::::::
Expand All @@ -108,7 +105,7 @@
3. [`minimumDepthInMeters`](https://dwc.tdwg.org/terms/#dwc:minimumDepthInMeters) and [`maximumDepthInMeters`](https://dwc.tdwg.org/terms/#dwc:maximumDepthInMeters)
4. [`vernacularName`](https://dwc.tdwg.org/terms/#dwc:vernacularName)
5. [`organismQuantity`](https://dwc.tdwg.org/terms/#dwc:organismQuantity) and [`organismQuantityType`](https://dwc.tdwg.org/terms/#dwc:organismQuantityType)
6. This one is tricky- it's two terms combined and will need to be split. [`indvidualCount`](https://dwc.tdwg.org/terms/#dwc:individualCount) and [`sex`](https://dwc.tdwg.org/terms/#dwc:sex)
6. This one is tricky- it's two terms combined and will need to be split. [`individualCount`](https://dwc.tdwg.org/terms/#dwc:individualCount) and [`sex`](https://dwc.tdwg.org/terms/#dwc:sex)

::::::::::::::::::::::::

Expand Down Expand Up @@ -137,7 +134,7 @@
| Darwin Core Term | Definition | Comment | Example |
|------------------|------------------------------------|---------------------------------------|-----------------|
| [`occurrenceID`](https://dwc.tdwg.org/terms/#dwc:occurrenceID) | An identifier for the Occurrence (as opposed to a particular digital record of the occurrence). In the absence of a persistent global unique identifier, construct one from a combination of identifiers in the record that will most closely make the occurrenceID globally unique. | To construct a globally unique identifier for each occurrence you can usually concatenate station + date + scientific name (or something similar) but you'll need to check this is unique for each row in your data. It is preferred to use the fields that are least likely to change in the future for this. For ways to check the uniqueness of your occurrenceIDs see the [QA / QC]({{ page.root }}/06-qa-qc/index.html) section of the workshop. | Station_95_Date_09JAN1997:14:35:00.000_Atractosteus_spatula |
| [`basisOfRecord`](https://dwc.tdwg.org/terms/#dwc:basisOfRecord) | The specific nature of the data record. | Pick from these controlled vocabulary terms: [HumanObservation](http://rs.tdwg.org/dwc/terms/HumanObservation), [MachineObservation](http://rs.tdwg.org/dwc/terms/MachineObservation), [MaterialSample](http://rs.tdwg.org/dwc/terms/MaterialSample), [PreservedSpecimen](http://rs.tdwg.org/dwc/terms/PreservedSpecimen), [LivingSpecimen](http://rs.tdwg.org/dwc/terms/LivingSpecimen), [FossilSpecimen](http://rs.tdwg.org/dwc/terms/FossilSpecimen), [MaterialEntity](http://rs.tdwg.org/dwc/terms/MaterialEntity), [Event](http://rs.tdwg.org/dwc/terms/Event), [Taxon](http://rs.tdwg.org/dwc/terms/Taxon), [Occurrence](http://rs.tdwg.org/dwc/terms/Occurrence), [MaterialCitation](http://rs.tdwg.org/dwc/terms/MaterialCitation) | HumanObservation |

Check warning on line 137 in episodes/01-introduction.md

View workflow job for this annotation

GitHub Actions / Build markdown source files if valid

[needs HTTPS]: [Taxon](http://rs.tdwg.org/dwc/terms/Taxon)

Check warning on line 137 in episodes/01-introduction.md

View workflow job for this annotation

GitHub Actions / Build markdown source files if valid

[needs HTTPS]: [Event](http://rs.tdwg.org/dwc/terms/Event)

Check warning on line 137 in episodes/01-introduction.md

View workflow job for this annotation

GitHub Actions / Build markdown source files if valid

[needs HTTPS]: [MaterialEntity](http://rs.tdwg.org/dwc/terms/MaterialEntity)

Check warning on line 137 in episodes/01-introduction.md

View workflow job for this annotation

GitHub Actions / Build markdown source files if valid

[needs HTTPS]: [FossilSpecimen](http://rs.tdwg.org/dwc/terms/FossilSpecimen)

Check warning on line 137 in episodes/01-introduction.md

View workflow job for this annotation

GitHub Actions / Build markdown source files if valid

[needs HTTPS]: [LivingSpecimen](http://rs.tdwg.org/dwc/terms/LivingSpecimen)

Check warning on line 137 in episodes/01-introduction.md

View workflow job for this annotation

GitHub Actions / Build markdown source files if valid

[needs HTTPS]: [PreservedSpecimen](http://rs.tdwg.org/dwc/terms/PreservedSpecimen)

Check warning on line 137 in episodes/01-introduction.md

View workflow job for this annotation

GitHub Actions / Build markdown source files if valid

[needs HTTPS]: [MaterialSample](http://rs.tdwg.org/dwc/terms/MaterialSample)

Check warning on line 137 in episodes/01-introduction.md

View workflow job for this annotation

GitHub Actions / Build markdown source files if valid

[needs HTTPS]: [MachineObservation](http://rs.tdwg.org/dwc/terms/MachineObservation)

Check warning on line 137 in episodes/01-introduction.md

View workflow job for this annotation

GitHub Actions / Build markdown source files if valid

[needs HTTPS]: [HumanObservation](http://rs.tdwg.org/dwc/terms/HumanObservation)
| [`scientificName`](https://dwc.tdwg.org/terms/#dwc:scientificName) | The full scientific name, with authorship and date information if known. When forming part of an Identification, this should be the name in lowest level taxonomic rank that can be determined. This term should not contain identification qualifications, which should instead be supplied in the `identificationQualifier` term. | Note that cf., aff., etc. need to be parsed out to the `identificationQualifier` term. For a more thorough review of `identificationQualifier` see [this paper](https://doi.org/10.3389/fmars.2021.620702). | Atractosteus spatula |
| [`scientificNameID`](https://dwc.tdwg.org/terms/#dwc:scientificNameID) | An identifier for the nomenclatural (not taxonomic) details of a scientific name. | Must be a WoRMS LSID for sharing to OBIS. Note that the numbers at the end are the AphiaID from WoRMS. | urn:lsid:marinespecies.org:taxname:218214 |
| [`eventDate`](https://dwc.tdwg.org/terms/#dwc:eventDate) | The date-time or interval during which an Event occurred. For occurrences, this is the date-time when the event was recorded. Not suitable for a time in a geological context. | Must follow [ISO 8601](https://en.wikipedia.org/wiki/ISO_8601). See more information on dates in the [Data Cleaning]({{ page.root }}/03-data-cleaning/index.html) section of the workshop. | 2009-02-20T08:40Z |
Expand Down Expand Up @@ -190,4 +187,4 @@
- Implementing Darwin Core makes data FAIR-er and means becoming part of a community of people working together to
understand species no matter where they work or are based.

:::::::::::::::::::::::
:::::::::::::::::::::::
33 changes: 15 additions & 18 deletions episodes/03-data-cleaning.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,17 +36,15 @@ examples use the [pandas package for Python](https://pandas.pydata.org/) and the
those are not the only options for dealing with these conversions but simply the ones we use more frequently in our
experiences.


## Getting your dates in order
Dates can be surprisingly tricky because people record them in many different ways. For our purposes we must follow
[ISO 8601](https://en.wikipedia.org/wiki/ISO_8601) which means using a four digit year, two digit month, and two digit
day with dashes as separators (i.e. `YYYY-MM-DD`). You can also record time in ISO 8601 but make sure to include the time
zone which can also get tricky if your data take place across time zones and throughout the year where daylight savings
time may or may not be in effect (and start and end times of daylight savings vary across years). There are packages in
R and Python that can help you with these vagaries. Finally, it is possible to record time intervals in ISO 8601 using a
slash (e.g. `2022-01-02/2022-01-12`). Examine the dates in your data to determine what format they are following and what
amendments need to be made to ensure they are following ISO 8601. Below are some examples and solutions in Python and R
for them.
slash (e.g. `2022-01-02/2022-01-12`). Examine the dates in your data to determine what amendments need to be made to
ensure they are following ISO 8601. Below are some examples and solutions in Python and R for them.

ISO 8601 dates can represent moments in time at different resolutions, as well as time intervals, which use "/" as a separator. Date and time are separated by "T". Timestamps can have a time zone indicator at the end. If not, then they are assumed to be local time. When a time is UTC, the letter "Z" is added at the end (e.g. 2009-02-20T08:40Z, which is the equivalent of 2009-02-20T08:40+00:00).

Expand All @@ -62,20 +60,20 @@ your package of choice to translate the dates.

| Darwin Core Term | Description | Example |
|------------------|-------------|-----------|
| [eventDate](https://dwc.tdwg.org/list/#dwc_eventDate) | The date-time or interval during which an Event occurred. For occurrences, this is the date-time when the event was recorded. Not suitable for a time in a geological context. | `1963-03-08T14:07-0600` (8 Mar 1963 at 2:07pm in the time zone six hours earlier than UTC).<br/>`2009-02-20T08:40Z` (20 February 2009 8:40am UTC).<br/>`2018-08-29T15:19` (3:19pm local time on 29 August 2018).<br/>`1809-02-12` (some time during 12 February 1809).<br/>`1906-06` (some time in June 1906).<br/>`1971` (some time in the year 1971).<br/>`2007-03-01T13:00:00Z/2008-05-11T15:30:00Z` (some time during the interval between 1 March 2007 1pm UTC and 11 May 2008 3:30pm UTC).<br/>`1900/1909` (some time during the interval between the beginning of the year 1900 and the end of the year 1909).<br/>`2007-11-13/15` (some time in the interval between 13 November 2007 and 15 November 2007). |
| [eventDate](https://dwc.tdwg.org/list/#dwc_eventDate) | The date-time or interval during which an Event occurred, or a taxa was recorded or observed. Not suitable for a time in a geological context. | `1963-03-08T14:07-0600` (8 Mar 1963 at 2:07pm in the time zone six hours earlier than UTC).<br/>`2009-02-20T08:40Z` (20 February 2009 8:40am UTC).<br/>`2018-08-29T15:19` (3:19pm local time on 29 August 2018).<br/>`1809-02-12` (some time during 12 February 1809).<br/>`2007-03-01T13:00:00Z/2008-05-11T15:30:00Z` (some time during the interval between 1 March 2007 1pm UTC and 11 May 2008 3:30pm UTC). |

::::::::::::::::::::::::::::::::: challenge
### Examples

Below are a few examples in R and Python for converting commonly represented dates to ISO-8601.
Below are a few examples in R and Python for converting commonly represented dates to ISO 8601.

::::::::::::::::: solution

::::::::::::::::: tab

### Python

When dealing with dates using pandas in Python it is best to create a Series as your time column with the appropriate
When dealing with dates using pandas in Python it is best to create a Series of your time column with the appropriate
datatype. Then, when writing your file(s) using [.to_csv()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html)
you can specify the format which your date will be written in using the `date_format` parameter.

Expand Down Expand Up @@ -175,7 +173,7 @@ function to read various date formats. The process can be applied to entire colu
### R

When dealing with dates using R, there are a few base functions that are useful to wrangle your dates in the correct format. An R package that is useful is [lubridate](https://cran.r-project.org/web/packages/lubridate/lubridate.pdf), which is part of the `tidyverse`. It is recommended to bookmark this [lubridate cheatsheet](https://evoldyn.gitlab.io/evomics-2018/ref-sheets/R_lubridate.pdf).
The examples below show how to use the `lubridate` package and format your data to the ISO-8601 standard.
The examples below show how to use the `lubridate` package and format your data to the ISO 8601 standard.

1. `01/31/2021 17:00 GMT`

Expand Down Expand Up @@ -307,9 +305,9 @@ into ISO 8601.
OBIS uses the [World Register of Marine Species (WoRMS)](https://www.marinespecies.org/) as the taxonomic backbone for
its system. GBIF uses the [Catalog of Life](https://www.catalogueoflife.org/). Since WoRMS contributes to the Catalog of
Life and WoRMS is a requirement for OBIS we will teach you how to do your taxonomic lookups using WoRMS. The key Darwin
Core terms that we need from WoRMS are `scientificNameID` also known as the WoRMS LSID which looks something like this
`"urn:lsid:marinespecies.org:taxname:105838"` and `kingdom` but you can grab the other parts of the taxonomic hierarchy if
you want as well as such as `taxonRank`.
Core terms that we need from WoRMS are `scientificNameID`, also known as the WoRMS LSID, which looks something like this
`"urn:lsid:marinespecies.org:taxname:105838"`, and `kingdom`. But you can grab the other parts of the taxonomic hierarchy
such as `taxonRank`.

There are two ways to grab the taxonomic information necessary. First, you can use the [WoRMS Taxon Match Tool](https://www.marinespecies.org/aphia.php?p=match).
The tool accepts lists of scientific names (each unique name as a separate row in a .txt, .csv, or .xlsx file) up to
Expand All @@ -318,7 +316,6 @@ the service is included in the challenge box below. A more detailed step-by-step
using the WoRMS Taxon Match Tool for the [MBON Pole to Pole](https://marinebon.org/p2p/) can be found [here](https://marinebon.github.io/p2p/protocols/WoRMS_quality_check.pdf). Additionally, OBIS has a three-part [video series](https://www.youtube.com/watch?v=jJ8nlMlg-cY) on YouTube about using the tool.



The other way to get the taxonomic information you need is to use [worrms](https://cran.r-project.org/web/packages/worrms/worrms.pdf)
(yes there are two **r**'s in the package name) or [pyworms](https://github.com/iobis/pyworms).

Expand Down Expand Up @@ -464,12 +461,12 @@ track which values are latitude and which are longitude.

| Darwin Core Term | Description | Example |
|------------------|-------------|----------------|
| [decimalLatitude](https://dwc.tdwg.org/list/#dwc_decimalLatitude) | The geographic latitude (in decimal degrees, using the spatial reference system given in geodeticDatum) of the geographic center of a Location. Positive values are north of the Equator, negative values are south of it. Legal values lie between -90 and 90, inclusive. | `-41.0983423` |
| [decimalLongitude](https://dwc.tdwg.org/list/#dwc_decimalLongitude) | The geographic longitude (in decimal degrees, using the spatial reference system given in geodeticDatum) of the geographic center of a Location. Positive values are east of the Greenwich Meridian, negative values are west of it. Legal values lie between -180 and 180, inclusive. | `-121.1761111` |
| [geodeticDatum](https://dwc.tdwg.org/list/#dwc_geodeticDatum) | The ellipsoid, geodetic datum, or spatial reference system (SRS) upon which the geographic coordinates given in decimalLatitude and decimalLongitude as based. | `WGS84` |
| [decimalLatitude](https://dwc.tdwg.org/list/#dwc_decimalLatitude) | The geographic latitude (in decimal degrees, using the spatial reference system given in geodeticDatum) of the geographic center of a Location. | `-41.0983423` |
| [decimalLongitude](https://dwc.tdwg.org/list/#dwc_decimalLongitude) | The geographic longitude (in decimal degrees, using the spatial reference system given in geodeticDatum) of the geographic center of a Location. | `-121.1761111` |
| [geodeticDatum](https://dwc.tdwg.org/list/#dwc_geodeticDatum) | The ellipsoid, geodetic datum, or coordinate reference system (CRS) upon which the geographic coordinates given in decimalLatitude and decimalLongitude as based. | `WGS84` |

![Coordinate precision](https://imgs.xkcd.com/comics/coordinate_precision.png)

![Coordinate precision](https://imgs.xkcd.com/comics/coordinate_precision.png)
*Image credit: [xkcd](https://xkcd.com/)*

::::::::::::::::::::::::::::::::: challenge
Expand Down Expand Up @@ -639,8 +636,8 @@ Below are a few examples in R and Python to convert some common coordinate pairs
::::::::::::: keypoints

- When doing conversions it's best to break out your data into it's component pieces.
- Dates are messy to deal with. Some packages have easy solutions, otherwise use regular expressions to align date strings to ISO 8601.
- Dates are messy to deal with. Some packages provide easy solutions, otherwise use regular expressions to align date strings to ISO 8601.
- WoRMS LSIDs are a requirement for OBIS.
- Latitude and longitudes are like dates, they can be messy to deal with. Take a similar approach.
- Latitude and longitudes are like dates, they can be messy to deal with, so take a similar approach. They have to be in decimal degrees.

:::::::::::::::::::::::
Loading