Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Review]: Workflows with Snakemake #17

Open
4 of 5 tasks
gperu opened this issue Dec 16, 2022 · 36 comments
Open
4 of 5 tasks

[Review]: Workflows with Snakemake #17

gperu opened this issue Dec 16, 2022 · 36 comments
Assignees
Labels
3/reviewer(s)-assigned Reviewers have been assigned; review in progress

Comments

@gperu
Copy link

gperu commented Dec 16, 2022

Lesson Title

Snakemake for Bioinformatics

Lesson Repository URL

https://github.com/carpentries-incubator/snakemake-novice-bioinformatics

Lesson Website URL

https://carpentries-incubator.github.io/snakemake-novice-bioinformatics/

Lesson Description

Researchers needing to implement data analysis workflows face a number of common challenges, including the need to organise tasks, make effective use of compute resources, handle any errors in processing, and document and share their methods. The Snakemake workflow system provides effective solutions to these problems. By the end of the course, you will be confident in using Snakemake to run real workflows in your day-to-day research.

Snakemake workflows are described by special scripts that define steps in the workflow as rules, and these are then used by Snakemake to construct and execute a sequence of shell commands to yield the desired output. Re-calculation of existing results is avoided where possible, so you can add or update input data, then efficiently generate an updated result. Workflows can be seamlessly scaled to server, cluster, grid and cloud environments without the need to modify the workflow definition.

This course is primarily intended for researchers who need to automate data analysis tasks for biological research involving next-generation sequence data, for example RNA-seq analysis, variant calling, CHIP-Seq, bacterial genome assembly, etc. However, Snakemake has many uses beyond this and the course does not assume any specialist biological knowledge. The language used to write Snakemake workflows is Python-based, but no prior knowledge of Python is required or assumed either. We do require that attendees must have familiarity with using the Linux command line (pipes, redirects, variables, …).

Author Usernames

@

Zenodo DOI

No response

Differences From Existing Lessons

@tbooth

Confirmation of Lesson Requirements

JOSE Submission Requirements

Potential Reviewers

No response

@tobyhodges
Copy link
Member

Thank you for submitting this lesson for review, @gperu.

My capacity for managing lesson reviews is quite limited at the moment and I will not be able to handle reviews of all of your submitted lessons simultaneously. If you have a preference for which lesson(s) you would like us to prioritise for review, please let me know and I will do my best to focus on that/those first.

@tobyhodges tobyhodges added the 1/editor-checks Editor is conducting initial checks on the lesson before seeking reviewers label Feb 3, 2023
@tobyhodges
Copy link
Member

Thanks for submitting this lesson, @tbooth. It appears very well put together and the topic is a good fit for The Carpentries Lab.

I'll be acting as Editor on this submission, and I have completed the editorial checks that form the first step of the review process. I have requested only two changes before I will assign reviewers. You can see my comments below.

To ensure that the review process runs as smoothly as possible, please make sure you are subscribed to receive notifications from this thread. On the right sidebar of this page you should see a section headed Notifications, with a Customize link. You can click on that and make sure that you have the Subscribed option selected, to receive all notifications from the thread.

You can add a badge to display the status of this review in the README of your lesson repository with the following Markdown:

[![The Carpentries Lab Review Status](http://badges.carpentries-lab.org/17_status.svg)](https://github.com/carpentries-lab/reviews/issues/17)

Editor Checklist - Snakemake for Bioinformatics

Accessibility

  • All figures are also described in image alternative text or elsewhere in the lesson body.

Alternative text descriptions are present for almost all figures in the lesson. Please replace the alternative text for the flowchart representation in Constructing a whole new workflow with a meaningful description.

  • The lesson uses appropriate heading levels:
    • h2 is used for sections within a page.
    • no “jumps” are present between heading levels e.g. h2->h4.
    • no page contains more than one h1 element i.e. none of the source files include first-level headings.
    • The contrast ratio of text in all figures is at least 4.5:1.

A note about headings: by my reading, the two h3 level headings in Conda integration could probably be moved up to h2 level, and the first h2 heading ("Conda and Snakemake") removed, because it is redundant with the title of the episode in the context of the lesson. This is only a suggestion - I won't require that you make the change before I look for reviewers.

Content

  • The lesson teaches data and/or computational skills that could promote efficient, open, and reproducible research.
  • All exercises have solutions.
  • Opportunities for formative assessments are included and distributed throughout the lesson sufficiently to track learner progress. (We aim for at least one formative assessment every 10-15 minutes.)
  • Any data sets used in the lesson are published under a permissive open license i.e. CC0 or equivalent.

I could download the data via the link on the Setup page, but I did not find a link to the FigShare record for that download, so I cannot check the license terms. Please share a link to the FigShare record for the dataset here, and add it to the lesson somewhere too e.g. on the Setup page and/or where you describe the data in the first episode.

Design

  • Learning objectives are defined for the lesson and every episode.
  • The target audience of the lesson is identified specifically and in sufficient detail.

Repository

The lesson repository includes:

  • a CC-BY or CC0 license.
  • a CODE_OF_CONDUCT.md file that links to The Carpentries Code of Conduct.
  • a list of lesson maintainers.
  • tabs to display Issues and Pull Requests for the project.

I recommend updating the README to remove the initial setup checklist from the early days of lesson development.

Structure

  • Estimated times are included in every episode for teaching and completing exercises.
  • Episodes lengths are appropriate for the management of cognitive load throughout the lesson.

Most episodes are estimated to take more than an hour, which is definitely at the longer end of what I would consider appropriate. However, I think this is a reflection of the long time slots that are allocated to the completion of exercises, implying that cognitive load is being managed by providing learners with plenty of time to apply the skills and concepts they are learning. This facilitates transfer from working to long-term memory, so I am satisfied with the composition of the lesson. Nevertheless, reviewers may find places in the lesson where they can suggest episodes be broken into smaller chunks.

Supporting information

The lesson includes:

  • a list of required prior skills and/or knowledge.
  • setup and installation instructions.
  • a glossary of key terms or links out to definitions in an external glossary e.g. Glosario.

@tbooth
Copy link

tbooth commented Feb 3, 2023

Hi @tobyhodges thanks very much for this. I'd made a start on the JOSE sumbission stuff which I know is not a prerequisite for acceptance to the lab but is still important to do.

I'll get working on the tasks you listed ASAP. I also have some thoughts on people who could be approached as reviewers, if that would help.

@tobyhodges
Copy link
Member

I have a few potential reviewers in mind but would be delighted to get additional suggestions. If you know their GitHub handle(s), please provide them here, but without tagging them i.e. leave the '@' off the start of the handle.

@tbooth
Copy link

tbooth commented Feb 15, 2023

The issues noted above, and some others that I noticed, have been addressed. Let me know if I missed anything or
you need anything else fixed.

Regarding the alternative text for all the figures I went through and tried to make all of the descriptions
reasonable for those who cannot see the pictures and might for example be using a screen reader.

The reviewers I have in mind are:

descostesn - Nicolas Descostes, Head of Bioinformatics, EMBL Rome

cokelaer or ddesvillechabrol - Authors of the Sequanix GUI for Snakemake, at Institut Pasteur

@tobyhodges tobyhodges added 2/seeking-reviewers Editor is looking for reviewers to assign to this lesson and removed 1/editor-checks Editor is conducting initial checks on the lesson before seeking reviewers labels Feb 16, 2023
@tobyhodges
Copy link
Member

@tkphd & @jdblischak thank you for volunteering to review this Snakemake for Bioinformatics lesson for The Carpentries Lab.

When you are ready, please post your reviews as replies in this thread. If you have any questions for me during the review, please ask. You can read more about the lesson review process in our Reviewer Guide, where you will also find the checklist for Reviewers.

@tobyhodges tobyhodges added 3/reviewer(s)-assigned Reviewers have been assigned; review in progress and removed 2/seeking-reviewers Editor is looking for reviewers to assign to this lesson labels Jun 15, 2023
@tobyhodges
Copy link
Member

Checking in here with @tkphd and @jdblischak. Please reach out if you have any questions or need any assistance with your reviews of this Snakemake lesson, or if your capacity for this has changed and you need to step away from your role as a reviewer.

@jdblischak
Copy link

@tobyhodges sorry for the delay. I started reviewing the lesson last week but then got sidetracked before I could complete it. I'll return to it this week

@tkphd
Copy link

tkphd commented Aug 2, 2023 via email

@jdblischak
Copy link

jdblischak commented Aug 3, 2023

I've gone through the lesson material. Congrats to @gperu, @tbooth, and colleagues for creating this thorough introduction to Snakemake!

To get started, below is the reviewer checklist:

Reviewer Checklist

Accessibility

  • The alternative text of all figures is accurate and sufficiently detailed.
    • Large and/or complex figures may not be described completely in the alt text of the image and instead be described elsewhere in the main body of the episode.
  • The lesson content does not make extensive use of colloquialisms, region- or culture-specific references, or idioms.
  • The lesson content does not make extensive use of contractions (“can’t” instead of “cannot”, “we’ve” instead of “we have”, etc).

I confirmed that the main figures had alt text. They could probably use more though if the goal is to convey the same information to someone using a screen reader. For example, the information contained in the image at the beginning of Episode 7 summarizes the input files (eg 18 in total due to paired reads, 3 conditions, and 3 reps) and the various steps. The steps are well described in the text above, but not the input files.

Content

  • The lesson content:
    • conforms to The Carpentries Code of Conduct.
    • meets the objectives defined by the authors.
    • is appropriate for the target audience identified for the lesson.
    • is accurate.
    • is descriptive and easy to understand.
    • is appropriately structured to manage cognitive load.
    • does not use dismissive language.
  • Tools used in the lesson are open source or, where tools used are closed source/proprietary, there is a good reason for this e.g. no open source alternatives are available or widely-used in the lesson domain.
  • Any example data sets used in the lesson are accessible, well-described, available under a CC0 license, and representative of data typically encountered in the domain.
  • The lesson does not make use of superfluous data sets, e.g. increasing cognitive load for learners by introducing a new data set instead of reusing another that is already present in the lesson.
  • The example tasks and narrative of the lesson are appropriate and realistic.
  • The solutions to all exercises are accurate and sufficiently explained.
  • The lesson includes exercises in a variety of formats.
  • Exercise tasks and formats are appropriate for the expected experience level of the target audience.
  • All lesson and episode objectives are assessed by exercises or another opportunity for formative assessment.
  • Exercises are designed with diagnostic power.

Design

  • Learning objectives for the lesson and its episodes are clear, descriptive, and measurable. They focus on the skills being taught and not the functions/tools e.g. “filter the rows of a data frame based on the contents of one or more columns,” rather than “use the filter function on a data frame.”
  • The target audience identified for the lesson is specific and realistic.

Supporting information

  • The list of required prior skills and/or knowledge is complete and accurate.
  • The setup and installation instructions are complete, accurate, and easy to follow.
  • No key terms are missing from the lesson glossary or are not linked to definitions in an external glossary e.g. Glosario.

I think the Setup would benefit from a few improvements, especially if self-learners are going to follow the instructions alone.

For installing the data, the wget command is provided explicitly, but then users are left to remember the tar flags on their own. Best to remove this early barrier and provide the explicit commands to prepare the data. Something like below:

wget --content-disposition https://ndownloader.figshare.com/files/35058796
tar xJf data-for-snakemake-novice-bioinformatics.tar.xz
ls -R data/

In lesson 10 on conda integration, it states:

We’ll not talk about installing Conda, since it is already set up on the systems we are using.

But you provide a link to Miniconda in the setup instructions. I recommend replacing the above text with a link back to the Setup instructions.

Also, I wasn't able to run conda env update --file conda_env.yaml with the recommended setting of channel_priority: strict. I had to temporarily disable it in my .condarc. As Snakemake strongly encourages users to set strict channel priority when using --use-conda, this could potentially trip up false beginners that have already started using Snakemake. Another suggestion is to use conda list --explicit > conda_env.frozen.yaml to bypass the conda solver altogether, and allow users to immediately install the exact packages that you used (though this would only work for linux)

And one last note on the Glossary question. The number of terms that could potentially be added is enormous, but given the prerequisite knowledge, I think it is adequate. One term I would suggest to add is "wildcard". While this is also used in the shell, wildcards are so central to Snakemake that it seems worth defining it.

General

  • the readability of the lesson.

Very readable

  • any key concepts or skills relevant to the lesson topic/domain that are
    missing from the lesson.

All the basics are covered

  • how it compares to any other learning resources that you are aware of on the
    same/similar topics.

I'm aware of another snakemake-based lesson called HPC Workflow Management with Snakemake . But I haven't gone through it, so I don't know how much overlap there is

  • its utility as a resource both for an Instructor teaching the lesson at a
    workshop and for a self-directed learner following the lesson alone.

I think it would be tougher for a self-directed learner, especially the open-ended episode 11 to write a pipeline from scratch. But I think there is sufficient material in the other episodes that a self-learner would still benefit from going through it

I'll provide more snakemake-specific comments in a follow-up post

@tobyhodges
Copy link
Member

Thank you very much for your review and detailed feedback @jdblischak 🙌

I'm aware of another snakemake-based lesson called HPC Workflow Management with Snakemake . But I haven't gone through it, so I don't know how much overlap there is

Your co-reviewer, Trevor @tkphd, is one of the main authors of that hpc-workflows lesson, which is still in the early stages of design and development. I am sure they will be able to comment more about the similarities and differences between the lessons, their target audiences and main objectives.

@jdblischak
Copy link

Some minor comments related specifically to Snakemake:

03 - Chaining rules

Note that {sample} changed in the rule kallisto_quant

When introducing the rule kallisto_quant, may want to mention that the wildcard {sample} is different from the above rules. For example, now it's ref1 instead of ref1_1. This note could be added to the list under "There are many things to note here:"

The kallisto manual has been updated

So the file names created by kallisto are not quite the same as we saw in the manual (note - the manual may have been fixed at the point you are doing this course, but it was true back when the course was written!). Change the rule definition in the Snakefile to use the correct names, then you should have everything working.

This was fun to see as a motivating example. I too once ran into the exact problem, and I know the manual has been updated, since I was the that sent the PR! 😄 (pachterlab/kallisto#109)

Mention the log field

You have the learners add a log file that is included as an output file. This is fine for getting started, but I'd recommend at least mentioning that Snakemake has a dedicated rule field log to support log files. When you run snakemake --lint, it will report a missing log directive if the log field is missing, even if you are logging via a dedicated file listed in output.

Indentation is inconsistent

The indentation changes between 2 and 4 spaces. A good example of this is ep03.Snakefile. The first 2 rules use 2 spaces for indentation and the next 2 use 4 spaces. The Setup instructions configure the editors to use 4 spaces for a tab, so I think it makes sense to standardize on 4 spaces throughout the code chunks and example files. Personally, I would just run snakefmt on everything.

04 - How Snakemake plans what jobs to run

Cross-reference discussion of --touch

In the box "Removing files to trigger reprocessing", I'd recommend linking to your section on --touch since this is a natural point that a learner will think about how to achieve the opposite (ie avoid reprocessing)

05 - Processing lists of inputs

Great example: symlinks to fix the inconsistent filenames

Nothing to fix here. Just wanted to note that I loved the reality and pragmatism of this approach to renaming the files. Too often tutorials are overly simplified and don't prepare learners for the messiness of real life (eg when your collaborator sends you hundreds of inconsistently named files)

Simplify example of glob_wildcards() with multiple wildcards

You give an example of using glob_wildcards() with two wildcards, but I believe you are overly cautious in your advice.

If there are two wildcards in the glob pattern, dealing with the result becomes a little more tricky. Unless you’re a Python programmer you probably don’t want to start writing code like this, and for most cases in Snakemake there is no need to.

I have found the power to parse multiple wildcards out of well-structured filenames to be very useful. For example, you could obtain things like the batch number, flow cell ID, etc. And I think your example code that uses **DOUBLE_MATCH._asdict() is unnecessarily complex. I am hardly a Python expert, and I think that ** has something to do with unpacking an arbitrary number of arguments, but this code can be written like below, which I find simpler and more readable:

from snakemake.io import glob_wildcards, expand
DOUBLE_MATCH = glob_wildcards("reads/{condition}_{samplenum}_1.fq")
SAMPLES = expand(
  "{condition}_{samplenum}",
   zip,
   condition = DOUBLE_MATCH.condition,
   samplenum = DOUBLE_MATCH.samplenum,
)

06 - Handling awkward programs

Directories are automatically created for normal rules

In your example that uses directory() for the output, you demonstrate that you need to manually run mkdir. I think it's important to tell the learners that manually running mkdir isn't needed for a standard rule. My old Snakefiles used to be cluttered with mkdir and os.mkdir(), but this is unnecessary because Snakemake automatically creates the subdirectories for you. Below is an example to demonstrate this behavior:

touch subdir/test.txt
## touch: cannot touch 'subdir/test.txt': No such file or directory

cat > Snakefile <<EOF
rule create_file_in_subdir:
    output: "subdir/test.txt"
    shell: "touch {output}"
EOF

snakemake -j1

ls subdir/
## test.txt

10 - Conda integration

Show what an example env file looks like

When demonstrating the conda integration, you first create an env and then export it via conda env export. But you never show what this file looks like. In practice I often write these from scratch. This allows them to be simpler. In other words, when you run conda env export, it's difficult to know what are the top-level requirements versus the many dependencies.

A simple env file to install cutadapt could look like the following:

name: new-env
channels:
  - conda-forge
  - bioconda
  - nodefaults
dependencies:
  - cutadapt

13 - Robust quoting in Snakefiles

Note the option of putting your code in a script

When I start running into complex quoting issues, that's usually the time when I switch to simply putting my code in a standalone script. As you note in the lesson, trying to get quotations and brackets to pass through so many different levels of processing can be unbelievably frustrating. I think that moving the code to a script should be listed as a potential solution.

For a simple script with only one or a few inputs, the script can use $1, etc.

    shell: "bash code/myscript.sh {input.first} {input.second} > {output}"

For a more complex situation that wants to access more information from Snakemake, you can recommend the script field, which passes in the Snakemake variables as arrays.

@jdblischak
Copy link

Now for some feedback on higher-level ideas. I don't expect you to overhaul your lesson based on my comments. But I think that it would be worth adding a few boxes to alert learners when they will likely see something different in other Snakemake tutorials.

Putting output before input

You have the learners write the output field before the input field. And your motivation is that it is natural to work backwards when writing a Snakefile, eg:

Rather than listing steps in order of execution, you are always working backwards from the final desired result. The order of operations is determined by applying the pattern matching rules to the filenames, not by the order of the rules in the Snakefile.

This logic of working backwards from the desired output is why we’re putting the output lines first in all our rules - to remind us that these are what Snakemake looks at first!

I am not a fan of this approach for two main reasons:

  1. Pretty much any other Snakefile they encounter or tutorial they read will list input before output. As a concrete example, the official Snakemake tutorial. Having them write their Snakefiles different from everyone else adds unnecessary cognitive load
  2. While it's true that Snakemake works backwards just like Make does, and it's important for learners to understand this mental model, I don't think it is necessary for a Snakemake user to design their pipeline backwards. I always develop my Snakemake pipelines one rule at a time, in the forward order. While I have a vague sense of my final result, there are too many unknowns along the way. Inevitably I'll run into something frustrating like mismatched chromosomes between my sequencing files and the references files, and have to add a rule to fix this. In other words, I've never been able to follow your first step to "Define rules for all the processing steps". And even your lesson goes in the forward order, starting with trimming and counting before then adding rules for indexing and mapping

So like I said above, I don't think you need to change your lesson. But I would recommended adding some boxes, eg:

  • box: We recommend listing output before input to remind yourself how Snakemake processes the rules, but note that this is our personal preference. Most other Snakefiles you see will list input first
  • box: You can also build your pipeline one step at a time in the forward direction. Just make sure to always keep in mind that Snakemake processes the rules backwards

Snakemake 7.8

Snakemake 7.8 introduced a few big changes. Your tutorial is based on version 5.32.2. You have a box in the episode "How Snakemake plans what jobs to run" that notes these changes in behavior. However, given that Snakemake 7.8 was released in May 2022, at some point it will make more sense for the episode to describe the 7.8+ behavior, and then only mention the <7.8 behavior in a box.

Another cool Snakemake feature that I really like is the ability to freeze the conda environments used in each rule. This would be worth mentioning in the lesson on conda integration

Another option for formatting multi-line shell commands

I know you recommend using r""" to make the quoting more robust, but personally I find this more readable:

    shell:
        """
        fastqc -o . {input}
        mv {wildcards.sample}_fastqc.html {output.html}
        mv {wildcards.sample}_fastqc.zip  {output.zip}
        """

compared to your recommendation:

    shell:
       r"""fastqc -o . {input}
           mv {wildcards.sample}_fastqc.html {output.html}
           mv {wildcards.sample}_fastqc.zip  {output.zip}
        """

Especially since this example doesn't require robust quoting.

@tbooth
Copy link

tbooth commented Aug 11, 2023

@jdblischak thanks so much for all the detailed comments and the PRs. I'll start to go through these points in detail, hopefully next week. Anything that needs more specific discussion I'll break out into a separate issue.

@tkphd
Copy link

tkphd commented Sep 11, 2023

Hi @tbooth, apologies for the delay, but I am working through the lesson. Overall, it looks great! Here's what I have for Setup.

Homepage

The lesson homepage lists the following prerequisites:

  • Familiarity with the Bash command shell, including concepts like pipes,
    variables and loops.
  • Knowledge of bioinformatics fundamentals like the FASTQ file format and
    short read mapping, in order to understand the example workflow.

I suggest rephrasing to state the actual prerequisites, and separately
enumerate background that would be helpful but is not strictly necessary.

Prerequisites:

  • Some background navigating the Unix filesystem and editing files through the
    command line, as taught in Shell Novice or similar.

Optional background:

  • To help understand the shell-based workflow this lesson takes as its starting
    point, familiarity with Unix shell pipes, loops, and variables is encouraged.
  • Some familiarity with FASTQ files and short read mapping will help learners
    who seek a deep understanding of the example workflow.

Setup

Suggest breaking this up into (at least) 3 sections: Software, Data, and
Editor.

The instructor may need to show this page at the beginning of the lesson for
those who did not already work through the setup. Recommend printing links in
their entirety rather than using anchor tags, e.g.,
Download and unpack the sample dataset tarball from <https://ndownloader.figshare.com/files/35058796>

Software

Conda is a common and useful tool, but it is simply invoked, not
introduced. Explain what it is (a Python distribution with virtual environment
isolation), how it helps (simplifies dependency management), and how to use it.

  1. The instructions as written appear to update an existing environment, not
    create a new one.
  2. The environment is named "snakemake_dash". Why?
  3. The conda_env.yaml file contains a whole lot of specific packages.
    Consider filtering this to specify just those packages you would install
    manually: snakemake, fastqc, kallisto, etc. Let conda fill in the full
    dependency graph.

Data

  1. Specify where to download (home directory? Desktop?) and how to extract this
    file. Tarballs are unfamiliar to most Windows users. The linked file is
    also an xzip-compressed Tar archive, which may require extra packages on
    some Linux distributions.
  2. The provided wget command results in "403: Forbidden" on a current Debian
    system. With the updated URL, this worked: wget https://figshare.com/ndownloader/files/35058796 -O data.tar.xz
  3. The contents of this file are nested two directories deep: the top-level
    "data" folder is extraneous.
  4. Package this slice of a dataset with a README explaining its provenance and
    intended usage, with citations and attribution to the original authors.
    (Aspire to FAIR principles.)
  5. It is unclear whether CC BY-SA applies to a pure dataset, which is not
    typically eligible for copyright protection: this is not a creative work.
    Was the source dataset released under a license agreement?

Editor

  1. Two editors are mentioned here, but no editor is invoked in the lesson material.
    Throughout, when showing changes to a file, preface it with the command the
    instructor should use to launch the editor.
  2. Provide installation instructions or suggest a framework (like
    gitforwindows) that provides an editor.
  3. "Setup" is meant to be run by the learner hours or days ahead of the
    workshop. Any alias they set will be lost by the time they need it.
    Recommend editing ~/.nanorc to set appropriate flags instead, or editing
    ~/.bashrc to retain the alias, and revisiting this at the beginning of the
    lesson to make sure everyone has a consistent editing environment.

@tobyhodges
Copy link
Member

Thanks for providing feedback on the setup instructions, @tkphd. Are you able to give an estimate of when you will be able to review the remainder of the lesson?

@tbooth
Copy link

tbooth commented Oct 20, 2023

Thanks to both reviewers for the thorough comments so far, which have clearly taken a lot of time to put together. I've already made a draft response to most of the points, fixed many things, and triaged the more complex ones into individual issues. However I'll wait for @tkphd to submit the rest of the review before making a full response.

@tobyhodges
Copy link
Member

Pinging @tkphd, to see if they have had any time to complete the review?

Trevor, I'm grateful to you for volunteering to review and I know from personal experience how one's capacity to take such things on can change at short notice. We would still love to get your perspective on the lesson, but If you need to step back from the review please let me know so that I can take this off your plate and look for somebody else.

@tkphd
Copy link

tkphd commented Dec 1, 2023

Hi @tobyhodges and @tbooth, I'm still working through the lesson material. It looks good, but my curiosity/interest in Snakemake means I'm going very slowly.

While I'm learning, it would be helpful to resolve issues highlighted by the make lesson-check command:

$ make lesson-check
_extras/discuss.md:4:FIXME
./LICENSE.md:92: Unknown or missing blockquote type None
./LICENSE.md:107: Unknown or missing blockquote type None
./LICENSE.md:115: Unknown or missing blockquote type None
./LICENSE.md:120: Unknown or missing blockquote type None
./README.md:12: Unknown or missing blockquote type None
./_config.yml: configuration carpentry value incubator is not in ('swc', 'dc', 'lc', 'cp')
./_episodes/01-introduction.md:175: Unknown or missing code block type language
./_episodes/01-introduction.md:195: Unknown or missing code block type language
./_episodes/02-placeholders.md:32: Unknown or missing code block type language
./_episodes/02-placeholders.md:52: Unknown or missing code block type language
./_episodes/02-placeholders.md:157: Unknown or missing code block type language
./_episodes/02-placeholders.md:193: Unknown or missing code block type None
./_episodes/02-placeholders.md:242: Unknown or missing code block type language
./_episodes/03-chaining_rules.md:26: Unknown or missing code block type language
./_episodes/03-chaining_rules.md:51: Unknown or missing code block type language
./_episodes/03-chaining_rules.md:145: Unknown or missing code block type None
./_episodes/03-chaining_rules.md:169: Unknown or missing code block type language
./_episodes/03-chaining_rules.md:259: Unknown or missing code block type None
./_episodes/03-chaining_rules.md:306: Unknown or missing code block type language
./_episodes/03-chaining_rules.md:326: Unknown or missing code block type None
./_episodes/04-the_dag.md:26: Unknown or missing code block type None
./_episodes/04-the_dag.md:131: Unknown or missing code block type None
./_episodes/04-the_dag.md:159: Unknown or missing code block type None
./_episodes/04-the_dag.md:173: Unknown or missing code block type None
./_episodes/04-the_dag.md:200: Unknown or missing code block type None
./_episodes/04-the_dag.md:264: Unknown or missing code block type None
./_episodes/05-expansion.md:62: Unknown or missing code block type language
./_episodes/05-expansion.md:82: Unknown or missing code block type language
./_episodes/05-expansion.md:140: Unknown or missing code block type language
./_episodes/05-expansion.md:158: Unknown or missing code block type language
./_episodes/05-expansion.md:187: Unknown or missing code block type None
./_episodes/05-expansion.md:193: Unknown or missing code block type None
./_episodes/05-expansion.md:214: Unknown or missing code block type None
./_episodes/05-expansion.md:227: Unknown or missing code block type None
./_episodes/05-expansion.md:259: Unknown or missing code block type language
./_episodes/05-expansion.md:293: Unknown or missing code block type None
./_episodes/05-expansion.md:321: Unknown or missing code block type None
./_episodes/05-expansion.md:337: Unknown or missing code block type None
./_episodes/05-expansion.md:355: Unknown or missing code block type None
./_episodes/06-awkward_programs.md:53: Unknown or missing code block type None
./_episodes/06-awkward_programs.md:66: Unknown or missing code block type None
./_episodes/06-awkward_programs.md:81: Unknown or missing code block type None
./_episodes/06-awkward_programs.md:110: Unknown or missing code block type None
./_episodes/06-awkward_programs.md:124: Unknown or missing code block type None
./_episodes/06-awkward_programs.md:167: Unknown or missing code block type None
./_episodes/06-awkward_programs.md:187: Unknown or missing code block type None
./_episodes/06-awkward_programs.md:205: Unknown or missing code block type None
./_episodes/06-awkward_programs.md:216: Unknown or missing code block type None
./_episodes/06-awkward_programs.md:238: Unknown or missing code block type None
./_episodes/06-awkward_programs.md:267: Unknown or missing code block type None
./_episodes/06-awkward_programs.md:286: Unknown or missing code block type None
./_episodes/06-awkward_programs.md:320: Unknown or missing code block type None
./_episodes/07-finished-workflow.md:67: Unknown or missing code block type None
./_episodes/07-finished-workflow.md:81: Unknown or missing code block type None
./_episodes/07-finished-workflow.md:114: Unknown or missing code block type None
./_episodes/07-finished-workflow.md:136: Unknown or missing code block type None
./_episodes/07-finished-workflow.md:167: Unknown or missing blockquote type None
./_episodes/07-finished-workflow.md:181: Unknown or missing code block type None
./_episodes/07-finished-workflow.md:229: Unknown or missing code block type None
./_episodes/08-configuring.md:28: Unknown or missing code block type None
./_episodes/08-configuring.md:41: Unknown or missing code block type None
./_episodes/08-configuring.md:64: Unknown or missing code block type None
./_episodes/08-configuring.md:76: Unknown or missing code block type None
./_episodes/08-configuring.md:100: Unknown or missing code block type None
./_episodes/08-configuring.md:110: Unknown or missing code block type None
./_episodes/08-configuring.md:137: Unknown or missing code block type None
./_episodes/08-configuring.md:152: Unknown or missing code block type None
./_episodes/08-configuring.md:160: Unknown or missing code block type None
./_episodes/08-configuring.md:178: Unknown or missing code block type None
./_episodes/08-configuring.md:186: Unknown or missing code block type None
./_episodes/08-configuring.md:216: Unknown or missing code block type None
./_episodes/08-configuring.md:225: Unknown or missing code block type None
./_episodes/08-configuring.md:232: Unknown or missing code block type None
./_episodes/09-performance.md:55: Unknown or missing code block type None
./_episodes/09-performance.md:61: Unknown or missing code block type None
./_episodes/09-performance.md:67: Unknown or missing code block type None
./_episodes/09-performance.md:82: Unknown or missing code block type None
./_episodes/09-performance.md:128: Unknown or missing code block type None
./_episodes/09-performance.md:181: Unknown or missing code block type None
./_episodes/10-conda_integration.md:78: Unknown or missing code block type None
./_episodes/10-conda_integration.md:103: Unknown or missing code block type None
./_episodes/10-conda_integration.md:111: Unknown or missing code block type None
./_episodes/10-conda_integration.md:120: Unknown or missing code block type None
./_episodes/10-conda_integration.md:142: Unknown or missing code block type None
./_episodes/10-conda_integration.md:154: Unknown or missing code block type None
./_episodes/10-conda_integration.md:165: Unknown or missing code block type None
./_episodes/10-conda_integration.md:226: Unknown or missing code block type None
./_episodes/10-conda_integration.md:236: Unknown or missing code block type None
./_episodes/10-conda_integration.md:243: Unknown or missing code block type None
./_episodes/10-conda_integration.md:250: Unknown or missing code block type None
./_episodes/10-conda_integration.md:266: Unknown or missing code block type None
./_episodes/11-assembly_challenge.md:81: Unknown or missing code block type None
./_episodes/11-assembly_challenge.md:88: Unknown or missing code block type None
./_episodes/12-cleaning_up.md:31: Unknown or missing code block type None
./_episodes/12-cleaning_up.md:46: Unknown or missing code block type None
./_episodes/12-cleaning_up.md:57: Unknown or missing code block type None
./_episodes/12-cleaning_up.md:69: Unknown or missing code block type None
./_episodes/12-cleaning_up.md:77: Unknown or missing code block type None
./_episodes/12-cleaning_up.md:104: Unknown or missing code block type None
./_episodes/12-cleaning_up.md:111: Unknown or missing code block type None
./_episodes/12-cleaning_up.md:127: Unknown or missing code block type None
./_episodes/12-cleaning_up.md:157: Unknown or missing code block type None
./_episodes/12-cleaning_up.md:174: Unknown or missing code block type None
./_episodes/12-cleaning_up.md:192: Unknown or missing code block type None
./_episodes/12-cleaning_up.md:240: Unknown or missing code block type None
./_episodes/13-quoting.md:112: Unknown or missing code block type None
./_episodes/13-quoting.md:128: Unknown or missing code block type None
./_episodes/13-quoting.md:140: Unknown or missing code block type None
./_episodes/13-quoting.md:154: Unknown or missing code block type None
./_episodes/13-quoting.md:164: Unknown or missing code block type None
./_episodes/13-quoting.md:177: Unknown or missing code block type None
./_extras/guide.md:23: Unknown or missing blockquote type None
./setup.md:139: Unknown or missing code block type None
make: *** [Makefile:135: lesson-check] Error 1

@tbooth
Copy link

tbooth commented Mar 5, 2024

Over the last couple of weeks I have moved the lesson from the old format to the new RMarkdown format. I have also reviewed the episodes in light of recent changes to Snakemake and tested/updated everything for Snakemake 8.5 (the latest version) I believe I've also resolved many if not most of the things mentioned by the two reviewers. Possibly I have introduced an error or two, but I'm teaching the whole course in just a couple of weeks so that should flush out any obvious errata.

I have further time I can spend on this during March, after which a funding deadline expires, so is there any chance we could get this review wrapped up in the next month? @tkphd if you have further comments to add then could you please add them as soon as you have them, and I'll make changes accordingly. If you want to see a stable version of the lessons without my recent changes then the legacy/gh_pages branch in GitHub holds the old version.

@tbooth
Copy link

tbooth commented Mar 5, 2024

It seems the auto-build of the lesson pages from my latest push to GitHub has not worked.

https://github.com/carpentries-incubator/snakemake-novice-bioinformatics/actions/runs/8160552892/job/22307376337

@tobyhodges can you see why this might be? The local sandpaper::serve() from my laptop works fine.

@tbooth
Copy link

tbooth commented Mar 12, 2024

I had to fiddle with the .github/workflows/sandpaper-main.yaml file, but the lessons on https://carpentries-incubator.github.io/snakemake-novice-bioinformatics/ are now correctly built from the latest edits.

@tobyhodges
Copy link
Member

Great to see the lesson transitioned to use the new infrastructure, @tbooth.

It seems the auto-build of the lesson pages from my latest push to GitHub has not worked.

This was a bug exposed by the release of R v4.3 at the end of last month, which was fixed in a recent release of pegboard. BTW I have now enabled the automated pull requests to update workflows on your repository.

@tobyhodges
Copy link
Member

Hi folks, in private discussion @tkphd and I agreed that they will step back from this lesson review.

I have begun looking for another reviewer, and I hope we can get some momentum going here again shortly. @tbooth thank you for your patience, please let me know what questions and/or concerns you have.

@tbooth
Copy link

tbooth commented Jun 21, 2024

Thanks, Toby. The good news is I still have funding to work on this and maintain and improve the lesson, and I'm still teaching the material myself at least a couple of times a year. It would be great to be able to say "I wrote a Carpentries Lesson" and I've had a few people ask why it's still in limbo. The comments from both reviewers were really useful though, so I don't mind having a new reviewer (or indeed feedback and issue reports from anyone), but I really hope they can get on to it quickly.

@tobyhodges
Copy link
Member

@cmeesters thank you for volunteering to review a lesson for The Carpentries Lab. Please can you confirm that you are happy to review this Snakemake for Bioinformatics lesson?

You can read more about the lesson review process in our Reviewer Guide.

@cmeesters
Copy link

Please can you confirm that you are happy to review this Snakemake for Bioinformatics lesson?

Yes.

@tbooth
Copy link

tbooth commented Jun 25, 2024

Thanks for stepping in @cmeesters. I don't know if you plan to send all the review comments together or in stages, but if you flag up anything here that relates to technical issues or specific fixes I'll open an issue on the issue tracker.

@cmeesters
Copy link

Hi,

I am sorry, it took me so long - I was pretty busy and now on holidays. I can only provide feedback for two more days, then I will be offline for two weeks.

Anyway:

Review

General Notes

Thank you for your contribution of this new teaching material for reproducible data analysis with Snakemake! I sincerely hope not to discourage you by requiring a little more work to make this material more impactful and reliable.

Chapter Notes

Setup

The Setup is mostly well described. As Snakemake recommends mamba, mamba should be mentioned, too.

Minor issues:

  • the directory for conda/mamba/micromamba does not need to be created, deviating from the standard setup is not necessary.
  • retrieving the conda_env.yaml is described with "get the file". Perhaps a wget command should be included and described for newbies (like exercised below)?
  • the environment file is frighteningly complex: too many dependencies are defined explicitly - conda (and its implementations) are defined to resolve dependencies automatically. Please restrict yourself to the necessary applications.
  • fastx is totally outdated, tools like cutadept are meanwhile good replacements. It is ok to use fastx for didactics, though, as participants can view all steps for quality processing in detail. A note on the state-of-art should be given regardless.
  • versions are pinned for Snakemake and other dependencies, rather than defining minimum versions >=. The way it is, becoming outdated is only a question of time.
  • new users should be introduced into creating separate environments, a conda environment can be created with the required files in one command
  • "See this link for details about this dataset and the redistribution licence." contains the link https://figshare.com/articles/dataset/data-for-snakemake-novice-bioinformatics_tar_xz/19733338/1. It leads to a description on summary level, but also a "sorry, we can't preview this file" - which is slightly irritating. Is figshare a good place for non-figure data?
  • when introducing gedit, it is recommended to start the editor in the background. I noticed, myself, that you have to describe what the ampersand (&) actually does to newbies.
  • likewise with Nano, a number of rather "cryptic" command line flags are given and not explained.

Chapter 1 - Running commands with Snakemake

I really like the data set selection and description in the chapter! However, a description about Snakemake, reproducible computing and the whole fuzz is all too brief as a motivation. Some hints: Snakemake is part of NumFocus, has over a million downloads, and an awful lot of citations.

Some minor issues:

  • when starting to investigate the data, participants are expected to look into a directory. It will not work, without a cd snakemake_data, first.
  • instead of using a subshell and arithmetic in bash, one could run grep -c @ reads/ref1_1.f, too.
  • the option -F is not explained, not asked for (like -p) and in the first run not necessary - only later -F gets explained
  • reading files with input redirection (<) is rather advanced and rarely seen (except for FASTX, of course).

Chapter 2 - Placeholder and Wildcards

Major Issue:

  • here it does not pay out to start with a counting reads rule: there is no motivation to do so. It is not necessary and there is no scientific connection to the DAG. So, I consider this a - non-severe - breach of didactic 101.

Minor issues:

  • -F is used constantly.

Chapter 3 - Chaining Rules

The illustration of the processing steps is well done. I like the box about error creation and handling very much, this is well explained.

Major Issues:

  • putting output before input is syntactically correct, but violates the de-facto standard we use for workflows. It should not be introduced as good practice.
  • Kallisto performs (wording according to docs) a "pseuoalignment" - it is not a classical aligner and should not be mentioned as such.
  • we get to named in- and output without explaining the background or that in- and output are usually lists.
  • it is not explained what an index actually is. This is of algorithmic importance, albeit not for this course. However, introducing it without giving the background should be avoided.
  • the log directive is crucial. There is no rule, using it, no task implementing it. No lookup performed by the participants.
  • log is its own directive, not a part of the output directive. The shown solution does work, but separating logs and outputs is a jolly good idea - mixing it up will lead to quirky workflows.
  • you should offer a solution for every task.
  • please verify that your Snakefile always work! I had some minor issues.
  • to use the log directives, you can use cmd -i {input} -o {output} &> {log} for ordinary commands.

Minor Issues:

  • the brewing solution does not relate well to the task setting. Perhaps you should ask to plan an ordinary kitchen tasks like for a computer? Being trivial is fine, here.
  • "If you know about the Kallisto software..." make no such assumption. Explain, not presume presumptuous.

Chapter 4 - The DAG

The DAG is nicely introduced. Also, the -F flag is finally introduced. I am still not sure whether it is a good idea to risk the puzzlement at first.

Major Issue:

  • the solution does not work - particularly, in the light of what I described, consider this deviation:
   # Kallisto quantification of one sample
rule kallisto_quant:
    output:
        h5   = "kallisto.{sample}/abundance.h5",
        tsv  = "kallisto.{sample}/abundance.tsv",
        json = "kallisto.{sample}/run_info.json",
    input:
        index = "transcriptome/Saccharomyces_cerevisiae.R64-1-1.kallisto_index",
        fq1   = "trimmed/{sample}_1.fq",
        fq2   = "trimmed/{sample}_2.fq",
    shell:
        "kallisto quant -i {input.index} -o kallisto.{wildcards.sample} {input.fq1} {input.fq2}"

rule kallisto_index:
    output:
        idx = "transcriptome/{strain}.kallisto_index",
    log:
        log = "log/{strain}.kallisto_log",
    input:
        fasta = "transcriptome/{strain}.cdna.all.fa"
    shell:
        "kallisto index -i {output.idx} {input.fasta} >& {log}"

Idea:

  • I usually prefer using dot - and demonstrate all possible workflow representations. This way, participants know how to display and copy&paste "their" DAGs for presentation purposes.

Chapter 5 - Processing lists of Inputs

It's a very nice chapter. Personally, as a teacher, I would very much prefer to split this into tiny tasks and have a solution for every sub-task. But that is a matter of taste and perhaps not so easy to implement

Chapter 6 - Awkward programs

In principle, very nice - it's an introduction to newbies and you cannot cover everything. However, we recommend using Snakemake wrapper to handle awkward programs to increase the stability of workflows wherever possible. There is not even an outlook to wrappers. This is - IMO - a major flaw.

Chapter 7 - Finishing

Again, a nice description. The solution, however, does not work, and stating in the file # "rule all_counts" has been removed to reduce clutter might seem like a good idea, but a "solution" for reference should be well tested and complete up to the description in the chapter.

Chapter 8 - Configuration

A pretty good chapter. Yet, now there cannot be a one-file solution. You ask participants to download the solution and split the content into different files manually. This is pretty error-prone. Do not do that: You will realize upon teaching, that participants will stumble and comprehension comes from seeing. You need to require, that the configuration file gets into a separate folder, and you need to provide individual solutions for this.

Chapter 9 - Optimizing the Workflow

Good, start. There are, however, a few major issues:

  • The link https://snakemake.readthedocs.io/en/stable/executing/cluster.html is a) outdated and b) does not exist any more. Instead, I would prefer a pointer to the plugin catalogue (`https://snakemake.github.io/snakemake-plugin-catalog/). If using the SLURM batch system, you can point to SLURM executor plugin.
  • describe the difference between global and per-rule requirements. A global requirement might be snakemake, the executor plugin, the file system plugin (as a remedy to I/O contention and perhaps some remote file plugin for data management).

Chapter 10 - Conda integration

Rather well described. However:

  • describe --software-deployment-method / --sdm and the selection of the conda implementation
  • describe alternatives (module environments on clusters and containerized software) and how to use them
  • describe the conda directive as a semi-must.
  • usually the r channel is important for bioinformatics users, too.
  • useful additions to the condarc file include, but are not limited to: ssl_verify: false to prevent useless warnings if people have to use a proxy server. And always_yes: true to prevent confirming every install.

Chapter 11 - Designing a new workflow

This Chapter needs a major revision:

  • the assembly part comes out of the blue and is unrelated to everything before. If you want it, you need additional material, describing the background. Best put it into a separate chapter (or several), then.
  • genome assembly is an intricate challenge, recommending a relatively outdated tool like velvet is dangerous, as there are numerous follow-up implementation tailored for various genome types.
  • the design phase is ok, but does not mention the template from the Snakemake workflow catalogue. However, standardizing and contributing(!) a workflow has an enormous impact on the deployment and portability of workflows. And thereby on the whole ecosystem of Snakemake. Not to mention, the catalogue and how to contribute to it is a major flaw.
  • for the whole community it would be better, if people do not re-invent the wheel (e.g. new workflows for existing solutions), but were able to contribute to existing workflows and fix issues. This, however, requires a bit more documentation in Snakemake. A basic intro to git (pull, fork, commit, create PRs) might be helpful - and beyond the scope of this intro. Yet, perhaps a pointer to the catalogue and snakedeploy might be a good idea after all.
  • the separation of workflow and data is not taught (unless overlooked by me). Please introduce the --directory flag and the recommendation to separate workflow and data, which enables new users to apply the workflow onto several different datasets.

Chapter 12 -- Cleaning up

Very nice!

Issues:

  • the tempfile() and protected() functions are nowhere to be found in the solution.
  • I found it a nice idea to let participants implement tempfile() in crucial rules and do a du -sh * before and after a new run. ;-) No must, of course.
  • the shadow rules are useful, indeed. However, there is only a dry example, no real use case.

Chapter 13 - Quoting

This is rather a Shell/Bash issue and should be addressed in other courses - or as a mini intro to this one.

Review Guideline Checklist

Accessibility

  • The alternative text of all figures is accurate and sufficiently detailed *.
    • Large and/or complex figures may not be described completely in the alt text of the image and instead be described elsewhere in the main body of the episode.
  • The lesson content does not make extensive use of colloquialisms, region- or culture-specific references, or idioms.
  • The lesson content does not make extensive use of contractions (“can’t” instead of “cannot”, “we’ve” instead of “we have”, etc)

Content

  • The lesson content:
    • conforms to [The Carpentries Code of Conduct][code-of-conduct].
    • meets the objectives defined by the authors.
    • is appropriate for the target audience identified for the lesson.
    • is accurate.
    • is descriptive and easy to understand.
    • is appropriately structured to manage cognitive load.
    • does not use dismissive language.
  • Tools used in the lesson are open source or, where tools used are closed source/proprietary, there is a good reason for this e.g. no open source alternatives are available or widely-used in the lesson domain.
  • Any example data sets used in the lesson are accessible, well-described, available under a CC0 license, and representative of data typically encountered in the domain.
  • The lesson does not make use of superfluous data sets, e.g. increasing cognitive load for learners by introducing a new data set instead of reusing another that is already present in the lesson.
  • The example tasks and narrative of the lesson are appropriate and realistic.
  • The solutions to all exercises are accurate and sufficiently explained.
  • The lesson includes exercises in a variety of formats.
  • Exercise tasks and formats are appropriate for the expected experience level of the target audience.
  • All lesson and episode objectives are assessed by exercises or another opportunity for formative assessment.
  • Exercises are designed with diagnostic power.

Design

  • Learning objectives for the lesson and its episodes are clear, descriptive, and measurable. They focus on the skills being taught and not the functions/tools e.g. “filter the rows of a data frame based on the contents of one or more columns,” rather than “use the filter function on a data frame.”
  • The target audience identified for the lesson is specific and realistic.

Supporting information

  • The list of required prior skills and/or knowledge is complete and accurate.
  • The setup and installation instructions are complete, accurate, and easy to follow.
  • No key terms are missing from the lesson glossary or are not linked to definitions in an external glossary e.g. [Glosario][glosario].

Missing tick marks, might seem a bit harsh. As outlined, a few things ought to be added to be up-to-date. Considering a few things, will place all tick marks automatically.

Particularly:

  • the separation of data and code, see above (not from the start on, necessary)
  • the mention of wrappers
  • directory layouts
  • the run and script directives are missing, too.
  • mention the plugin ecosystem established with version 8 of Snakemake.
  • stress the co-existence of configuration files, Snakemake profiles (again: not mentioned!) and workflow profiles (not really mentioned). Show how these can shorten the command line significantly and avoid redundant settings.

I will be happy to give more feedback after the 2nd week of August and understand that with change from v7 to v8 of Snakemake there have been major changes, no breaks of workflows, but certainly a steeper learning curve with more things to digest. So, I am happy that people refrain from the MOOC idea and endeavour teaching in person!

@jdblischak
Copy link

usually the r channel is important for bioinformatics users, too.

Please don't mention the r channel. It is a constant source of confusion. r is part of defaults, which is incompatible with conda-forge. bioconda is designed to be compatible with conda-forge, so any R packages not installed from bioconda should be installed from conda-forge.

In other words, I think the current section on Channel configuration and conda-forge is correct and does not need to be updated.

@cmeesters
Copy link

yes, you are right - never tried fiddling with the rc file to check this. Something learned - thank you.

@tbooth
Copy link

tbooth commented Jul 17, 2024

Thanks @cmeesters for all the comments. I'll split these down into issues and deal with the easy ones first, as soon as I can.

There's a common theme with this review and the partial review from @tkphd that you want the material to be more geared to "newbies", removing the pre-requisite for familiarity with shell scripting and introducing/explaining concepts like using an editor, task backgrounding ("&") and redirection ("<") within the course. This is a fundamental change. I think anyone trying to make use of Snakemake has to have some experience in scripting or programming, beyond just the basics of the interactive shell. I'd be happy to add some callouts to remind people about these points of syntax, but I really think that trying to make the course newbie-ready is a bad idea. Anyone trying to learn Snakemake without the foundational knowledge of Bash syntax is going to be taking on too much at once, and trying to pretend otherwise is doing them a disservice.

@tbooth
Copy link

tbooth commented Jul 17, 2024

@tobyhodges, I'm hoping you can clear this up for us. Are you happy with the lesson to go into the Carpentries Lab as something geared towards people already familiar with shell scripting, or do I need to make it suitable for Linux newbies?

If it needs to be the latter then I really have a problem. From inception, this thing was written for experienced Bash users who are new to Snakemake (and not necessarily familiar with Python). The name snakemake-novice-bioinformatics is maybe unfortunate in this regard but that name was never my choice.

Many of the comments from @cmeesters make perfect sense if this were to be a course for Linux newbies, but I've never thought that was the remit and it's not what I set out to write. For now I'll look to make explicit references back to where some concepts are introduced in https://swcarpentry.github.io/shell-novice, since this will make it explicit that I am building on the foundational material.

@cmeesters
Copy link

@tbooth

Many of the comments from @cmeesters make perfect sense if this were to be a course for Linux newbies, ...

I really do not want to interfere with you conceptually. Please focus on the Snakemake-oriented remarks, then.

@tobyhodges
Copy link
Member

Thanks all for your comments so far, particularly to @cmeesters for the detailed review and @tbooth for seeking clarification so swiftly. It seems that you are reaching a resolution already, but I will pitch in with my point of view as an editor for the Lab.

Lessons in The Carpentries Lab do not need to be aimed at any particular audience, either in terms of domain expertise or level of prior knowledge. What is important is that the target audience of a lesson is defined, that this audience description is realistic, and that the design and content of the lesson is appropriate to the stated audience.

So my answer to @tbooth's question

Are you happy with the lesson to go into the Carpentries Lab as something geared towards people already familiar with shell scripting, or do I need to make it suitable for Linux newbies?

is that you absolutely do not need to change the audience of the lesson, but you may need to tweak content to better fit your stated audience and/or tweak your stated audience to fit the content (meeting somewhere in the middle may be the most practical solution).

For example, in the list of prerequisites for the lesson states that learners are expected to arrive with

Familiarity with the Bash command shell, including concepts like pipes, variables, loops and scripts.

Linking to the SWC Shell lesson is great, but will give people the impression that they are ready for your lesson if they know about the stuff in that one. So please watch out for anything used in the Snakemake lesson that is not covered in SWC Shell, e.g. you mentioned background execution with & above, and I do not think that is introduced in Software Carpentry. The alternative is to try to explicitly list out the concepts you expect people to arrive knowing about (we try to do this in the Data Carpentry Image Processing with Python lesson): this can feel like overkill but in my experience proves to be a really useful resource e.g. to share with learners in advance of the workshop while they still have time to learn about one or two things they may have missed up to that point.

@cmeesters
Copy link

cmeesters commented Jul 18, 2024

Two considerations:

  1. Regarding the skill set needed for this course:

I think we basically can consider two types of participants. Those who are familiar with the shell. And those who do not. (There is more to consider, but we shall put everything else aside for this little thought experiment).

People who followed, for example a bioinformatics master course, certainly need only little reminders for certain techniques, if at all. Yet there are more people who want to analyse data. I, for example, frequently am confronted with biologists, who never saw a command line before. Now, we can point them to other courses. When I teach a certain practical course, I put working with the command line and logging in to remote systems before the actual data analysis part.

However, usually people are working under pressure. They just started a thesis and need data to be analysed. In this scenario there is little time to digest new content, like a predecessor bash-course, thoroughly.

This is why I, regardless of specific contents like background processes, recommend keeping the newbie in mind. I am sorry, if that came out wrong. All in all, a little(!) redundancy does not hurt.

  1. Gaps in the material (wrappers, directory structure, separation of code and data, snakedeploy, tempfile() not being in the solution, ...) are of far greater importance. The course is conceptually excellent. I know that tweaking it requires quite some time and probably is frustrating (it would be to me after all this time). Yet, I think it will have a tremendous positive impact.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3/reviewer(s)-assigned Reviewers have been assigned; review in progress
Projects
None yet
Development

No branches or pull requests

6 participants