Skip to content

Commit 1952788

Browse files
committed
added runs_per_week.sh and imporved compute_production_stats.py in order to plot per month yield. Updated docs to make Phil happy
1 parent 4edb6a8 commit 1952788

File tree

3 files changed

+180
-33
lines changed

3 files changed

+180
-33
lines changed

README.md

Lines changed: 34 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -23,36 +23,51 @@ Examples:
2323
- `python compute_undet_index_stats.py --config couch_db.yaml -- mode most_undet --instrument-type HiSeqX`
2424

2525

26+
### compute_undet_index_stats.py
27+
used to fetch stats about undermined indexes.
28+
This scripts queries statusdb x_flowcell_db and fetch informaiton about runs.
29+
The following operations are supported:
2630

27-
28-
### DupRateTrends_from_charon.py
29-
Used to fetch stats from charon about duplication rate trends and number of sequenced human genomes
31+
- check_undet_index: given a specific index checks all FCs and prints all FC and lanes where the indx appears as undetermined
32+
- most_undet: outputs a summary about undetermiend indexes, printing the most 20 most occurring indexes for each instrument type
33+
- single_sample_lanes: prints stats about HiSeqX lanes run with a single sample in it
34+
- workset_undet: prints for each workset the FC, lanes and samples where the specified index has been found in undet. For each sample the plate position is printed.
35+
- fetch_pooled_projects: returns pooled projects, that is projects that have been run in a pool.
3036

3137
#### Usage
32-
Example: `DupRateTrends_from_charon.py`
38+
Examples:
39+
40+
- compute for each workset the FC that contain a lane with index CTTGTAAT present in undet at least 0.5M times:
41+
- `python compute_undet_index_stats.py --config couch_db.yaml --index CTTGTAAT --mode workset_undet --min_occurences 500000`
42+
- Compute a list of the most occurring undetemriend indexes for HiSeqX runs:
43+
- `python compute_undet_index_stats.py --config couch_db.yaml -- mode most_undet --instrument-type HiSeqX`
44+
3345

34-
```
35-
Usage: DupRateTrends_from_charon.py
3646

37-
Options:
38-
-h, --help show this help message and exit
39-
-t TOKEN, --token TOKEN
40-
Charon API Token. Will be read from the env variable
41-
CHARON_API_TOKEN if not provided
42-
-u URL, --url URL Charon base url. Will be read from the env variable
43-
CHARON_BASE_URL if not provided
44-
```
47+
48+
49+
### runs_per_week.sh
50+
Run on Irma prints a three columns:
51+
52+
- first column is the week number
53+
- second column number of HiSeqX runs in that week
54+
- seconf column number of HiSeq2500 runs in that week
55+
56+
#### Usage
57+
Examp `runs_per_week.sh `
58+
4559

4660

4761
### compute_production_stats.py
48-
This scripts queries statusdb x_flowcelldb and project database and fetches informations about what organism have been sequenced. More in detail:
62+
This scripts queries statusdb x_flowcelldb and project database and fetches informations useful to plot trands and aggregated data. It can be run in three modalities:
63+
64+
- production-stats: for each instrument type it prints number of FCs, number of lanes, etc. It then prints a summary of all stats
65+
- instrument-usage: for each instrument type and year it prints different run set-ups and samples run with that set-up
66+
- year-stats: cumulative data production by month
4967

50-
- reports total number of lanes sequenced per year
51-
- reports total number of Human lanes and of Non-Human lanes sequenced (divided per instrument)
52-
- other stats...
5368

5469
##### Usage
55-
Example: `compute_production_stats.py --config couchdb.yaml`
70+
Example: `compute_production_stats.py --config couchdb.yaml --mode year-stats`
5671
```
5772
Usage: compute_production_stats.py --config couchdb.yam
5873

compute_production_stats.py

Lines changed: 92 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -235,10 +235,8 @@ def instrument_usage():
235235
}
236236
flowcell_db = couch["x_flowcells"]
237237
project_sequenced = {}
238-
FC = 0
238+
instrument_runs_per_week = {}
239239
for fc_doc in flowcell_db:
240-
if FC > 50:
241-
continue
242240
if 'RunInfo' not in flowcell_db[fc_doc]:
243241
continue
244242
instrument = flowcell_db[fc_doc]["RunInfo"]['Instrument']
@@ -285,6 +283,13 @@ def instrument_usage():
285283
month = int(flowcell_db[fc_doc]['RunInfo']['Date'][2:4])
286284
day = int(flowcell_db[fc_doc]['RunInfo']['Date'][4:6])
287285
date_seq = datetime(year , month , day )
286+
if instrument not in instrument_runs_per_week:
287+
instrument_runs_per_week[instrument] = {}
288+
date_entry = "{}_{}".format(year, month) #date_seq.isocalendar()[1]) # year plus week number
289+
if date_entry not in instrument_runs_per_week[instrument]:
290+
instrument_runs_per_week[instrument][date_entry] = 1
291+
else:
292+
instrument_runs_per_week[instrument][date_entry] += 1
288293
for lane in projects_in_lanes:
289294
for project in projects_in_lanes[lane]:
290295
projects[project]['lanes'] += 1
@@ -293,11 +298,8 @@ def instrument_usage():
293298
projects[project]['date'] = date_seq
294299
else:
295300
projects[project]['date'] = date_seq
296-
#FC += 1
297301
flowcell_db = couch["flowcells"]
298302
for fc_doc in flowcell_db:
299-
if FC > 100:
300-
continue
301303
if 'RunInfo' not in flowcell_db[fc_doc]:
302304
continue
303305
if 'Date' not in flowcell_db[fc_doc]['RunInfo']:
@@ -307,17 +309,15 @@ def instrument_usage():
307309
print "run {} too old".format(flowcell_db[fc_doc]['RunInfo']['Id'])
308310
continue
309311
if 'Instrument' not in flowcell_db[fc_doc]["RunInfo"]:
310-
import pdb
311-
pdb.set_trace()
312+
print "ERROR: Instrument not found in RunInfo: how is this possible?"
313+
exit
312314

313315
instrument = flowcell_db[fc_doc]["RunInfo"]['Instrument']
314316
if 'illumina' not in flowcell_db[fc_doc]:
315317
print "Not illumina field found in doc {}".format(fc_doc)
316318
continue
317319
if 'Demultiplex_Stats' not in flowcell_db[fc_doc]['illumina']:
318320
print "Not Demultiplex_Stats field found in doc {}".format(fc_doc)
319-
import pdb
320-
pdb.set_trace()
321321
continue
322322
if 'Barcode_lane_statistics' not in flowcell_db[fc_doc]['illumina']['Demultiplex_Stats']:
323323
print "Not Barcode_lane_statistics field found in doc {}".format(fc_doc)
@@ -359,6 +359,14 @@ def instrument_usage():
359359
month = int(flowcell_db[fc_doc]['RunInfo']['Date'][2:4])
360360
day = int(flowcell_db[fc_doc]['RunInfo']['Date'][4:6])
361361
date_seq = datetime(year , month , day )
362+
if instrument not in instrument_runs_per_week:
363+
instrument_runs_per_week[instrument] = {}
364+
date_entry = "{}_{}".format(year, month) #date_seq.isocalendar()[1]) # year plus week number
365+
if date_entry not in instrument_runs_per_week[instrument]:
366+
instrument_runs_per_week[instrument][date_entry] = 1
367+
else:
368+
instrument_runs_per_week[instrument][date_entry] += 1
369+
362370
for lane in projects_in_lanes:
363371
for project in projects_in_lanes[lane]:
364372
projects[project]['lanes'] += 1
@@ -367,9 +375,23 @@ def instrument_usage():
367375
projects[project]['date'] = date_seq
368376
else:
369377
projects[project]['date'] = date_seq
370-
#FC += 1
371-
372378

379+
years = [2013, 2014, 2015, 2016, 2017]
380+
sys.stdout.write('date,')
381+
for instrument in sorted(instrument_runs_per_week):
382+
sys.stdout.write('{},'.format(instrument))
383+
sys.stdout.write('\n')
384+
for year in years:
385+
for week in xrange(1,13):
386+
date_to_search = "{}_{}".format(year, week)
387+
sys.stdout.write('{},'.format(date_to_search))
388+
for instrument in sorted(instrument_runs_per_week):
389+
if date_to_search in instrument_runs_per_week[instrument]:
390+
sys.stdout.write('{},'.format(instrument_runs_per_week[instrument][date_to_search]))
391+
else:
392+
sys.stdout.write('0,')
393+
sys.stdout.write('\n')
394+
sys.stdout.write('\n')
373395

374396
years = [2013, 2014, 2015, 2016, 2017]
375397
sequencers_year_setup = {}
@@ -409,6 +431,58 @@ def instrument_usage():
409431
sys.stdout.write('\n')
410432

411433

434+
def year_bp_production():
435+
couch = setupServer(CONFIG)
436+
db_names = ['flowcells', 'x_flowcells']
437+
flowcells = {}
438+
production_stats = {}
439+
for db_name in db_names:
440+
flowcell_db = couch[db_name]
441+
for fc_doc in flowcell_db:
442+
if 'RunInfo' not in flowcell_db[fc_doc]:
443+
continue
444+
if 'Flowcell' not in flowcell_db[fc_doc]['RunInfo']:
445+
continue
446+
fc_name = flowcell_db[fc_doc]['RunInfo']['Flowcell']
447+
if fc_name in flowcells:
448+
continue
449+
else:
450+
flowcells[fc_name] = 0
451+
year = int(flowcell_db[fc_doc]['RunInfo']['Date'][0:2])
452+
month = int(flowcell_db[fc_doc]['RunInfo']['Date'][2:4])
453+
if year < 12:
454+
continue
455+
yield_MBases = 0
456+
if 'illumina' not in flowcell_db[fc_doc]:
457+
continue
458+
if 'Demultiplex_Stats' not in flowcell_db[fc_doc]['illumina']:
459+
continue
460+
if db_name == "x_flowcells":
461+
if 'Flowcell_stats' not in flowcell_db[fc_doc]['illumina']['Demultiplex_Stats']:
462+
continue
463+
if 'Yield (MBases)' not in flowcell_db[fc_doc]['illumina']['Demultiplex_Stats']['Flowcell_stats']:
464+
continue
465+
yield_MBases = int(flowcell_db[fc_doc]['illumina']['Demultiplex_Stats']['Flowcell_stats']['Yield (MBases)'].replace(',', ''))
466+
else:
467+
if 'Barcode_lane_statistics' not in flowcell_db[fc_doc]['illumina']['Demultiplex_Stats']:
468+
continue
469+
for sample in flowcell_db[fc_doc]['illumina']['Demultiplex_Stats']['Barcode_lane_statistics']:
470+
yield_MBases += int(sample['Yield (Mbases)'].replace(',', ''))
471+
472+
if year not in production_stats:
473+
production_stats[year] = month_production = [0]*12
474+
production_stats[year][month-1] += yield_MBases
475+
sys.stdout.write(',')
476+
for year in sorted(production_stats):
477+
sys.stdout.write('{},'.format(year))
478+
sys.stdout.write('\n')
479+
for month in range(0,12,1):
480+
sys.stdout.write('{},'.format(month+1))
481+
for year in sorted(production_stats):
482+
sys.stdout.write('{},'.format(production_stats[year][month]))
483+
sys.stdout.write('\n')
484+
sys.stdout.write('\n')
485+
412486

413487

414488
def main(args):
@@ -422,7 +496,10 @@ def main(args):
422496

423497
if args.mode == 'instrument-usage':
424498
instrument_usage()
425-
499+
500+
if args.mode == 'year-stats':
501+
year_bp_production()
502+
426503

427504

428505

@@ -434,9 +511,10 @@ def main(args):
434511
parser = argparse.ArgumentParser("""This scripts queries statusdb x_flowcelldb and project database and fetches informations about what organisms have been sequenced. It can be run in the following modes:
435512
- production-stats: for each instrument type it prints number of FCs, number of lanes, etc. It then prints a summary of all stats
436513
- instrument-usage: for each instrument type and year it prints different run set-ups and samples run with that set-up
514+
- year-stats: cumulative data production by month
437515
""")
438516
parser.add_argument('--config', help="configuration file", type=str, required=True)
439-
parser.add_argument('--mode', help="define what action needs to be executed", type=str, required=True, choices=('production-stats', 'instrument-usage'))
517+
parser.add_argument('--mode', help="define what action needs to be executed", type=str, required=True, choices=('production-stats', 'instrument-usage', 'year-stats'))
440518

441519
args = parser.parse_args()
442520
main(args)

runs_per_week.sh

Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
function runs_per_week()
2+
{
3+
local fc_root_folders=/proj/ngi2016003/*/
4+
local week=$1 year=$2
5+
local week_num_of_Jan_1 week_day_of_Jan_1
6+
local first_Mon
7+
local date_fmt="+%y %m %d"
8+
local date_ftm_year="+%y"
9+
local date_fmt_month="+%m"
10+
local date_fmt_day="+%d"
11+
12+
week_num_of_Jan_1=$(date -d $year-01-01 +%W)
13+
week_day_of_Jan_1=$(date -d $year-01-01 +%u)
14+
15+
if ((week_num_of_Jan_1)); then
16+
first_Mon=$year-01-01
17+
else
18+
first_Mon=$year-01-$((01 + (7 - week_day_of_Jan_1 + 1) ))
19+
fi
20+
21+
22+
YEAR=${year: -2}
23+
MONTH_START=$(date -d "$first_Mon +$((week - 1)) week" "$date_fmt_month")
24+
MONTH_END=$(date -d "$first_Mon +$((week - 1)) week + 6 day" "$date_fmt_month")
25+
DAY_START=$(date -d "$first_Mon +$((week - 1)) week" "$date_fmt_day")
26+
DAY_END=$(date -d "$first_Mon +$((week - 1)) week + 6 day" "$date_fmt_day")
27+
28+
mon=$(date -d "$first_Mon +$((week - 1)) week" "$date_fmt")
29+
sun=$(date -d "$first_Mon +$((week - 1)) week + 6 day" "$date_fmt")
30+
DAYS=()
31+
if [ $MONTH_START -ne $MONTH_END ];
32+
then
33+
DAYS=($(seq -f "$YEAR$MONTH_START%02g" $DAY_START 1 31) $(seq -f "$YEAR$MONTH_END%02g" 1 1 $DAY_END))
34+
else
35+
DAYS=($(seq -f "$YEAR$MONTH_START%02g" $DAY_START 1 $DAY_END))
36+
fi
37+
38+
RUNS_PER_WEEK_X=0
39+
RUNS_PER_WEEK_nonX=0
40+
for DAY in "${DAYS[@]}" ; do
41+
DAY_RUNS_X=`ls -d $fc_root_folders/$DAY*_ST* 2> null | wc -l`
42+
RUNS_PER_WEEK_X=`expr $DAY_RUNS_X + $RUNS_PER_WEEK_X`
43+
DAY_RUNS_nonX=`ls -d $fc_root_folders/$DAY* 2> null | grep -v ST | grep -v 000000 | wc -l`
44+
RUNS_PER_WEEK_nonX=`expr $DAY_RUNS_nonX + $RUNS_PER_WEEK_nonX`
45+
done
46+
echo $week $RUNS_PER_WEEK_X $RUNS_PER_WEEK_nonX
47+
48+
}
49+
50+
CURRENT_YEAR=`date +"%Y"`
51+
CURRENT_WEEK=`date +"%V"`
52+
for WEEK in `seq 1 1 $CURRENT_WEEK`; do
53+
runs_per_week $WEEK $CURRENT_YEAR
54+
done

0 commit comments

Comments
 (0)