Skip to content

Commit

Permalink
Merge pull request datacarpentry#19 from amyehodge/gh-pages
Browse files Browse the repository at this point in the history
reviewed and updated challenge exercises for shell-genomics episodes
  • Loading branch information
tracykteal authored Jul 19, 2017
2 parents 2d4d4d7 + 23b6b44 commit a996e9f
Show file tree
Hide file tree
Showing 4 changed files with 827 additions and 43 deletions.
35 changes: 29 additions & 6 deletions _episodes/02-the-filesystem.md
Original file line number Diff line number Diff line change
Expand Up @@ -87,8 +87,7 @@ directory. `..` means go back up a level.
* * * *
**Exercise**

Now we're going to try a hunt. Find a hidden directory in dc_sample_data list its contents
and file the text file in there. What is the name of the file?
Now we're going to try a hunt. Find the hidden directory in dc_sample_data, list its contents, and identify the name of the text file in that directory.

Hint: hidden files and folders in unix start with '.', for example .my_hidden_directory
* * * *
Expand All @@ -111,7 +110,7 @@ Then enter the command:
ls dc_sample_data

This will list the contents of the `dc_sample_data` directory without
you having to navigate there.
your having to navigate there.

The `cd` command works in a similar way.

Expand All @@ -123,11 +122,11 @@ Try entering:
and you will jump directly to `untrimmed_fastq` without having to go through
the intermediate directory.

****
* * * *
**Exercise**

List the 'SRR097977.fastq' file from your home directory without changing directories
****
List the contents of the directory containing the 'SRR097977.fastq' file. Do this from your home directory without leaving that directory.
* * * *

## Full vs. Relative Paths

Expand Down Expand Up @@ -180,6 +179,30 @@ Over time, it will become easier for you to keep a mental note of the
structure of the directories that you are using and how to quickly
navigate amongst them.

***
## Relative Path Resolution

Using the filesystem diagram below, if `pwd` displays `/Users/thing`,
what will `ls ../backup` display?

1. `../backup: No such file or directory`
2. `2012-12-01 2013-01-08 2013-01-27`
3. `2012-12-01/ 2013-01-08/ 2013-01-27/`
4. `original pnas_final pnas_sub`

![File System for Challenge Questions](../fig/filesystem-challenge.svg)

> ## Solution
> 1. No: there *is* a directory `backup` in `/Users`.
> 2. No: this is the content of `Users/thing/backup`,
> but with `..` we asked for one level further up.
> 3. No: see previous explanation.
> Also, we did not specify `-F` to display `/` at the end of the directory names.
> 4. Yes: `../backup` refers to `/Users/backup`.
{: .solution}
{: .challenge}
***

***
**Exercise**

Expand Down
19 changes: 10 additions & 9 deletions _episodes/03-working-with-files.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,14 +60,15 @@ The `*` is expanded to include any file that ends with `.fastq`.
****
**Exercise**

Do each of the following using a single `ls` command without
navigating to a different directory.
Do each of the following tasks from your current directory using a single `ls` command.

1. List all of the files in `/bin` that start with the letter 'c'
2. List all of the files in `/bin` that contain the letter 'a'
3. List all of the files in `/bin` that end with the letter 'o'

BONUS: List all of the files in '/bin' that contain the letter 'a' or 'c'
BONUS: List all of the files in '/bin' that contain the letter 'a' or the letter 'c'

HINT: This requires a Unix wildcard that we haven't talked about yet. Trying searching the internet for information about Unix wildcards to find what you need to solve the bonus problem.

****

Expand Down Expand Up @@ -106,8 +107,8 @@ then you could repeat command #260 by simply entering:
****
**Exercise**

1. Find the line number in your history for the last exercise (listing
files in /bin) and reissue that command.
Find the line number in your history for the command that listed all the
files in /bin.

****

Expand Down Expand Up @@ -174,7 +175,7 @@ works its way forward. Note, if you are at the end of the file and search
for the word "cat", `less` will not find it. You need to go to the
beginning of the file and search.

For instance, let's search for the sequence `GTGCGGGCAATTAACAGGGGTTCAC` in our file.
For instance, let's search the file we have open for the sequence `GTGCGGGCAATTAACAGGGGTTCAC`.
You can see that we go right to that sequence and can see
what it looks like.

Expand Down Expand Up @@ -284,9 +285,9 @@ Enter the following command:

Do the following:

1. Create a backup of your fastq files.
2. Create a backup directory.
3. Copy a backup of your files there.
1. Create a backup of your SRR097977.fastq file in the directory containing the original file.
2. Move the backup copy to the backup directory.
3. Rename the backup copy of your file.

* * * *

Expand Down
53 changes: 25 additions & 28 deletions _episodes/04-redirection.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,15 +19,12 @@ search within files without even opening them, using `grep`. Grep is a command-l
utility for searching plain-text data sets for lines matching a string or regular expression.
Let's give it a try!

Suppose we want to see how many reads in our file have really bad, with 10 consecutive Ns.
Let's search for the string NNNNNNNNNN in file.
Suppose we want to see how many reads in our file have really bad segments containing 10 consecutive Ns.
Let's search for the string NNNNNNNNNN in the SRR098026 file.

grep NNNNNNNNNN SRR098026.fastq

We get back a lot of lines. What is we want to see the whole fastq record for each of these read.
We can use the '-B' argument for grep to return the matched line plus one before (-B 1) and two
lines after (-A 2). Since each record is four lines and the last second is the sequence, this should
give the whole record.
We get back a lot of lines. What we want to see is the whole fastq record for each of these reads. The fastq record consists of one line before the sequence information as well as two lines after. We can use the '-B' argument for grep to return the matched line plus one before: '-B 1'. With the '-A argument', we can have grep list the two lines after also: '-A 2'.

grep -B1 -A2 NNNNNNNNNN SRR098026.fastq

Expand All @@ -38,15 +35,15 @@ for example:
+SRR098026.177 HWUSI-EAS1599_1:2:1:1:2025 length=35
#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

****
* * * *
**Exercise**

1) Search for the sequence GNATNACCACTTCC in SRR098026.fastq.
In addition to finding the sequence, have your search also return
the name of the sequence.
In addition to identifying the line containing the sequence, have your search also return
the line containing the name of the sequence (tip: the name of the sequence is listed after the '@' sign).

2) Search for that sequence in both fastq files.
****
2) Search for the sequence AAGTT in both fastq files. Get the lines containing the name of the sequence and the sequence itself.
* * * *

## Redirection

Expand Down Expand Up @@ -111,25 +108,23 @@ efficiently. If you want to be proficient at using the shell, you must
learn to become proficient with the pipe and redirection operators:
`|`, `>`, `>>`.



Finally, let's use the new tools in our kit and a few new ones to example our SRA metadata file.

cd
cd dc_sample_data/
cd dc_sample_data/sra_metadata

Let's ask a few questions about the data
Let's ask a few questions about the data.

1) How many of the read libraries are paired end?
#### How many of the read libraries are paired end?

First, what are the column headers?
We know this information is somewhere in our SraRunTable.txt file, we just need to find it. First, let's look at the column headers.

head -n 1 SraRunTable.txt
BioSample_s InsertSize_l LibraryLayout_s Library_Name_s LoadDate_s MBases_l MBytes_l ReleaseDate_s Run_s SRA_Sample_s Sample_Name_s Assay_Type_s AssemblyName_s BioProject_s Center_Name_s Consent_s Organism_Platform_s SRA_Study_s g1k_analysis_group_s g1k_pop_code_s source_s strain_s

That's only the first line but it is a lot to take in. 'cut' is a program that will extract columns in tab-delimited
files. It is a very good command to know. Lets look at just the first four columns in the header using the '|' readirect
and 'cut'
files. It is a very good command to know. Lets look at just the first four columns in the header using the '|' redirect
and 'cut'.

head -n 1 SraRunTable.txt | cut -f1-4
BioSample_s InsertSize_l LibraryLayout_s Library_Name_s
Expand All @@ -154,7 +149,7 @@ for just PAIRED and count the number of hits.
cut -f3 SraRunTable.txt | grep PAIRED | wc -l
2

2) How many of each class of library layout are there?
#### How many of each class of library layout are there?

We can use some new tools 'sort' and 'uniq' to extract more information. For example, cut the third column, remove the
header and sort the values. The '-v' option for greap means return all lines that DO NOT match.
Expand All @@ -168,14 +163,16 @@ count the different categories.
2 PAIRED
35 SINGLE

3) Sort the metadata file by PAIRED/SINGLE and save to a new file
We can use if '-k' option for sort to specify which column to sort on. Note that this does something
#### Can we sort the file by PAIRED/SINGLE and save it to a new file?

We can use the '-k' option for sort to specify which column to sort on. Note that this does something
similar to cut's '-f'.

sort -k3 SraRunTable.txt > SraRunTable_sorted_by_layout.txt

#### Can we extract only paired end records into a new file?

4) Extract only paired end records into a new file
Do we know PAIRED only occurs in column 4? WE know there are only two in the file, so let's check.
Do we know PAIRED only occurs in column 4? We know there are only two in the file, so let's check.

grep PAIRED SraRunTable.txt | wc -l
2
Expand All @@ -185,12 +182,12 @@ OK, we are good to go.
grep PAIRED SraRunTable.txt > SraRunTable_only_paired_end.txt


****
* * * *
**Final Exercise**

1) How many sample load dates are there?

2) How many samples were loaded on each date
2) How many samples were loaded on each date?

3) Filter subsets into new files bases on load date
****
3) Filter subsets into new files based on load date.
* * * *
Loading

0 comments on commit a996e9f

Please sign in to comment.