PhenotypeTesting.Rmd

---
title: "Phenotype Testing"
author: "Eugene Gardner"
date: "03 December 2020"
output: 
  html_document:
    toc: true
    toc_depth: 2
---

# 1. Startup and Introduction

This document contains UKBB data and comparisons between and within this data to the trait of interest, Fertility. If using data produced by this repo, please cite [our manuscript](https://www.biorxiv.org/content/10.1101/2020.05.26.116111v2).

**Big Note**: The first part of this document involves running scripts to generate text files required for downstream analysis. _PLEASE_ start there and make sure all scripts ran successfully. All scripts for this section are available in the folder `./scripts/` and _will not_ run as part of this document. There are two additional resources that you need to download -- the OMIM morbid map, which requires registration and DDG2P. See the section [on disease genes](#disease_genes) below.

**Big Note**: You also need to have access to UKBiobank, but this script is agnostic to the UKBiobank application number. You should be able to download a bulk phenotype file and, if it contains the correct phenotypes as referred to in the manuscript and in the file `rawdata/phenofiles/fields_to_extract.txt`, you should be able to reproduce our data and figures.

**Big Note**: Due to legacy variable naming, the terms **FI** and **CHOD** are synonymous in this document. FI stands for "first incidence", the internal UKBB name for the term we use in the manuscript: Complete Health Outcomes Data (e.g. CHOD).

You can view a compiled html version of this document with all code run either within this repository at `compiled_htmls/PhenotypeTesting.html` or on [github](https://htmlpreview.github.io/?https://github.com/eugenegardner/UKBBFertility/blob/master/compiled_html/PhenotypeTesting.html).

## 1A. Libraries

```{r setup}

knitr::opts_chunk$set(
	echo = TRUE,
	message = FALSE,
	warning = FALSE ## Warnings turned off to squelch unecessary ggplot noise in kintted document. Have checked all for accuracy.
)

## Quietly Load Libraries
load.package <- function(name) {
  suppressMessages(suppressWarnings(library(name, quietly = T, warn.conflicts = F, character.only = T)))
}

load.package("biomaRt") ## Get gene lists we need
load.package("readxl") ## Read Supplemental Excel tables
load.package("data.table") ## Better than data.frame
load.package("patchwork") ## Arranging ggplots
load.package("broom") ## Makes getting covars out of lm much tidier
load.package("meta") ## For doing meta analysis
load.package("mratios") ## Need this to calculate 95% CIs for ratios of two means
load.package("svglite") ## Need to create main text figures properly (ggsave doesn't like anything with an alpha and is .SVG easier to edit in Illustrator)
load.package("tidyverse") ## Takes care of ggplot, tidyr, dplyr, and stringr
load.package("rcompanion") ## For getting incremental pseudo-R2
load.package("lubridate") ## For dealing with participant birthdays
```

## 1B. Themes

Set themes for internal figures just for testing purposes:

```{r Base Themes}
theme <- theme(panel.background=element_rect(fill="white"),line=element_line(size=1,colour="black",lineend="round"),axis.line=element_line(size=1),text=element_text(size=16,face="bold",colour="black"),axis.text=element_text(colour="black"),axis.ticks=element_line(size=1,colour="black"),axis.ticks.length=unit(.1,"cm"),strip.background=element_rect(fill="white"),axis.text.x=element_text(angle=45,hjust=1),legend.position="blank",panel.grid.major=element_line(colour="grey",size=0.5),legend.key=element_blank())

## Default theme w/legend
theme.legend <- theme + theme(legend.position="right")

del.line <- "#4F7942"
del.fill <- "#77DD77"
dup.line <- "#0000FF"
dup.fill <- "#AEC6CF"
```

Theme for main text/supplemental figures (making font size smaller):

```{r Figure Themes}
theme.figures <- theme(panel.background=element_rect(fill="white"),line=element_line(size=1,colour="black",lineend="round"),axis.line=element_line(size=1),text=element_text(size=9,face="bold",colour="black"),axis.text=element_text(colour="black"),axis.ticks=element_line(size=1,colour="black"),axis.ticks.length=unit(.1,"cm"),strip.background=element_rect(fill="white"),axis.text.x=element_text(angle=45,hjust=1),legend.position="blank",panel.grid.major=element_line(colour="grey",size=0.5),legend.key=element_blank())

## Default theme w/legend
theme.figures.legend <- theme.figures + theme(legend.position="right")

## Colour Scheme:
male.col <- "#38BCA0"
female.col <- "#7B06F8"

sex.colours <- c(male.col, female.col)
names(sex.colours) <- c("Male","Female")
sex.colours.fill <- scale_fill_manual(name = "Sex",values=sex.colours,guide=guide_legend(reverse=F))
sex.colours.colour <- scale_colour_manual(name = "Sex",values=sex.colours, guide=guide_legend(reverse=F))
sex.colours.fill.rev <- scale_fill_manual(name = "Sex",values=sex.colours,guide=guide_legend(reverse=T))
sex.colours.colour.rev <- scale_colour_manual(name = "Sex",values=sex.colours, guide=guide_legend(reverse=T))

alt.colours <- c("#75485E","#CB904D","#DFCC74")
```

## 1C. Prepping Storage Directories

This just unpacks the tarball of provided data resources at `rawdata.tar.gz`

```{bash Prepare Storage}

tar -zxf rawdata.tar.gz

```

# 2. Generating Required Text Files

Example code for downloading and initial processing UKBB phenotype file. None of the code in this section will actually be run.

## 2A. Creating Master Phenotype File:

This code chunk is not evaluated here but is provided for replication purposes. This code chunk assumes that the user has already gained access to, and [downloaded](http://biobank.ndph.ox.ac.uk/showcase/), relevant phenotype fields and has aquired the encoded phenotype file (like ukb00000.enc). The tools used below are also available via the [UKBiobank datashowcase website](http://biobank.ndph.ox.ac.uk/showcase/download.cgi). More information on downloading can be found [here](https://biobank.ctsu.ox.ac.uk/~bbdatan/Accessing_UKB_data_v2.1.pdf).

```{bash Get UKBB Phenotype File, eval = F}

## First step involves downloading and decoding individual phenotype data. Keyvalue is the key provided via email when you apply for bulk download. This will create a decoded file ukb00000.enc_ukb
ukbunpack ukb00000.enc <keyvalue>

## Next, convert the file to a tab-delimited format:
ukbconv ukb00000.enc_ukb txt

## Create a data dictionary (so we know where phenotypes are in the file!)
ukbconv ukb00000.enc_ukb docs

## This should result in 2 required files for further processing:
# ukb00000.txt
# ukb00000.html
```

**Note** the following processing data expects the "ukb00000.*" data to be in the unpacked `rawdata/phenofiles/` directory!

This code chunk extracts phenotypes of relevance from the master phenotype file that is downloaded and processed above. It is run with the script: `./scripts/extract_phenotypes.pl`.

This script is slower than it should be due to an issue with text format encoding of the UKBB-created TSV file on MacOS. If running this code on a UNIX system, would suggest switching the 'exec' call at the bottom of this script to use cut for additional speed. It should mean the script executes in ~30s rather than 5mins.

```{bash Extract Phenotypes, eval = F}

## First do most fields:
./scripts/extract_phenotypes.pl rawdata/phenofiles/fields_to_extract.txt rawdata/phenofiles/ukbb_phenotypes.txt

## Have to extract first incidence data seperately as it requires additional processing:
## P.S. Make sure you run the above command first!!!
./scripts/extract_phenotypes.pl rawdata/phenofiles/fi_fields_to_extract.txt rawdata/phenofiles/fi_phenotypes.txt
./scripts/process_first_incidence.pl

```

## 2B. Setting Unrelated Individuals:

Again, this is just an example on how relatedness information is accquired from UKBB, code does not actually run. It proceeds in two basic steps:

1. Download the relatedness file using the (ukbgene)[http://biobank.ndph.ox.ac.uk/showcase/refer.cgi?id=664] tool
2. Place the file in `./rawdata/phenofiles/`
3. Running the provided script to generate a list of individuals to filter - at: `./scripts/get_relateds.R`.

```{bash Process Relatedness, eval = F}

## 1. Run ukbgene rel. This will download a file like: ukbXXXXX_rel_sYYYYYY.dat, where X represents your application ID, and 488288 represents the file version
ukbgene rel

## 2. Get related individuals to filter (change the name of the .dat to your specific file):
./scripts/get_relateds.R ukbXXXXX_rel_sYYYYYY.dat > raw_data/phenofiles/relateds.out
## Format the output file to be readable by R
perl -ne 'if ($_ =~ /\"(\d{7})\"/) {print "$1\n";}' raw_data/phenofiles/relateds.out > raw_data/phenofiles/relateds.txt
```

# 3. Phenotype Data

Read in the master phenotype and related individuals file that was created in [the previous section](#2._generating_required_text_files)

```{r Load Master Phenotype File}

UKBB.raw.phenotypes <- fread("rawdata/phenofiles/ukbb_phenotypes.txt")
UKBB.raw.phenotypes[,eid:=as.character(eid)]

```

```{r Load Relatedness File}

related.individuals <- fread("rawdata/phenofiles/relateds.txt", header = F)
setnames(related.individuals,"V1","eid")
related.individuals[,eid:=as.character(eid)]

```

## 3A. Generic Phenotypes

Grabbing generic phenotypes age, sex, ancestry, European ancestry status and adding them to the main UKBB.phenotype.data table.

This also filters out individuals that are not broadly European.

```{r Process Generic Phenotypes}

## Data table of all UKBB population data:
PCAs<-c(1:40)
for (i in PCAs) {
  PCAs[i] <- paste("22009-0",i,sep=".")
}
fields <- c("eid","22006-0.0","31-0.0","21022-0.0","34-0.0","52-0.0",PCAs)
UKBB.phenotype.data <- UKBB.raw.phenotypes[,..fields]

setnames(UKBB.phenotype.data,c(fields),c("eid","white.british.ancestry","sexPulse","agePulse","birth.year","birth.month",paste0("PC",seq(1,40))))

## Remove all non-white British (determined direct by UKBB and taken from the pheno file)
UKBB.phenotype.data <- UKBB.phenotype.data[!is.na(white.british.ancestry)]
paste0("Number of Broadly Euro Indiv: ", table(UKBB.raw.phenotypes[,`22006-0.0`]))

## Remove all related individuals:
UKBB.phenotype.data <- UKBB.phenotype.data[!eid %in% related.individuals[,eid]]

## Change sex to be 1 = male, 2 = female instead of 0,1. 
## This is a distinction made for an old project to keep it consistent.
## And remove individuals that have mismatched genetic/reported sex
UKBB.phenotype.data[,sexPulse:=ifelse(sexPulse == 0, 2, 1)]

## Add agePulse.squared covar:
UKBB.phenotype.data[,agePulse.squared:=agePulse^2]
paste0("Number of Individuals after filtering: ",nrow(UKBB.phenotype.data))

## Curate birthdays:
## Have to set everybody's birthday as the 15th since UKBB doesn't want to give specific days out. Should be a reasonable approximation as it's the closest possible day for all individuals...
UKBB.phenotype.data[,birthday:=paste(birth.year,sprintf("%02d",birth.month),"15",sep="-")]
UKBB.phenotype.data[,birthday:=as.Date(birthday, format = "%Y-%m-%d")]
UKBB.phenotype.data[,birth.year.cut:=cut(birth.year, breaks = seq(1930,1970,by=5))]

UKBB.phenotype.data[,birth.month:=NULL]

## Add Fields for Excluding Specific Birth Years:
for (i in c(1940,1950,1960)) {

  age.remove <- UKBB.phenotype.data[birth.year < i | birth.year >= (i + 10),eid]
  col <- paste0("is.age.",i)
  UKBB.phenotype.data[,eval(col):=if_else(eid %in% age.remove, 1, 0)]

}

rm(PCAs,fields,i)
```

## 3B. Recent Ancestry

Accounting for recent ancestry by using IBD segments calculated in [this](https://www.nature.com/articles/s41467-020-19588-x) manuscript. The sparse matrix of IBD sharing is available [here](https://link.tbd). When downloaded, place this file in `rawdata/phenofiles/IBD_GRM_t10.grm.gz`, and the associated IDs file in `rawdata/phenofiles/IBD_GRM_t10.grm.cat.id`.

I am then using egienvalue decomposition on the sparse matrix (similar to [this](https://elifesciences.org/articles/61548) manuscript) to generate 100 PCs from that data. I have provided the script to process the matrix at:

`./scripts/ibd_pca.py`

This script is written in python3 and requires the scipy and pandas modules to be installed to run. This script is merely provided as an example and is not intended to be run as-is. It should take ~45 minutes to generate PCs, which should be moved to to:

`./rawdata/phenofiles/IBD_PCs.txt`

Now we load those PCs in and add to UKBB.phenotype.data

```{r Recent Ancestry}

recent.ancestry.PCs <- fread("rawdata/phenofiles/IBD_PCs.txt", header = F)
setnames(recent.ancestry.PCs,names(recent.ancestry.PCs),c("eid",paste0("rare.PC",seq(1,100))))
recent.ancestry.PCs[,eid:=as.character(eid)]

for (pc in paste0("rare.PC",seq(1,100))) {
  
  new.val <- paste0("scaled.",pc)
  recent.ancestry.PCs[,eval(new.val):=scale(get(pc))]
  
}

cols <- c("eid",paste0("scaled.rare.PC",seq(1,100)))
UKBB.phenotype.data <- merge(UKBB.phenotype.data,recent.ancestry.PCs[,..cols],by="eid", all.x = T)

rm(recent.ancestry.PCs)
```

## 3C. Fertility

```{r Fertility pt 1, fig.height=5, fig.width=4}

fertility.metrics <- UKBB.raw.phenotypes[,c("eid",
                                            "2405-0.0", ## Children Fathered
                                            "2734-0.0", ## Live Births
                                            "6141-0.0", ## Individuals in household
                                            "709-0.0", ## Number of individuals in household
                                            "2129-0.0", ## Answered Sex Questions
                                            "2159-0.0", ## Same sex behaviour
                                            "2139-0.0", ## Age at first sexual intercourse
                                            "31-0.0" ## Sex for other stuff
                                            )]

## Replace all NAs with a double value to make filtering easier
fertility.metrics[is.na(fertility.metrics)] <- -9

setnames(fertility.metrics,names(fertility.metrics),c("eid",
                                                      "children.fathered",
                                                      "live.births",
                                                      "in.household",
                                                      "number.in.household",
                                                      "answered.sex",
                                                      "same.sex",
                                                      "age.first.intercourse",
                                                      "sexPulse"))

## Just for making plots
fertility.metrics[,sexPulse:=ifelse(sexPulse == 0, 2, 1)]
fertility.metrics[,sexPulse:=factor(sexPulse,levels=c(1,2),labels=c("Male","Female"))]
                   
# Number of live births:
fertility.metrics[,live.births:=if_else(live.births > 7, -9, live.births)]
fertility.metrics[,live.births:=if_else(live.births < 0, -9, live.births)]
                    
ggplot(fertility.metrics[live.births>=0],aes(live.births,..density..)) +
  geom_histogram(binwidth = 1,colour="black",fill=female.col) +
  xlab("Live Births") +
  scale_y_continuous(name = "Proportion of Females", limits=c(0,0.5), labels = paste0(c(0,10,20,30,40,50),"%")) +
  theme

# Children fathered
fertility.metrics[,children.fathered:=if_else(children.fathered > 7, -9, children.fathered)]
fertility.metrics[,children.fathered:=if_else(children.fathered < 0, -9, children.fathered)]

ggplot(fertility.metrics[children.fathered>=0],aes(children.fathered,..density..)) +
  geom_histogram(binwidth = 1,colour="black",fill=male.col) +
  xlab("Children Fathered") +
  scale_y_continuous(name = "Proportion of Males", limits=c(0,0.5), labels = paste0(c(0,10,20,30,40,50),"%")) +
  theme

## Who is in the household is stored in an array of up to 5 values (so can stored UP to 5 possible relationships)
## Is a follow-up question to "how many individuals are in your household?" and will be NA if they did not answer
## To make this simple, just taking the first response, the followup data is so small, isn't going to matter and it's taking me way too long to come up with all possible combinations
fertility.metrics[,partner.in.house:=if_else(number.in.household < 0,-9, ## Did not answer, did not know, did not want to answer
                                               if_else(number.in.household==1,0, ## individuals living by themselves
                                                       if_else(number.in.household > 1 & in.household == 1,1,0)))] ## Check if partner present
fertility.metrics[,lives.alone:=if_else(number.in.household < 0,-9, ## Did not answer, did not know, did not want to answer
                                                      if_else(number.in.household==1,0,1))] ## Check if living alone

ggplot(fertility.metrics,aes(as.factor(partner.in.house),group=sexPulse,fill=sexPulse)) +
  stat_count(position="identity",alpha=0.5) +
  scale_alpha_continuous(range=c(0,1)) +
  scale_x_discrete(name="Has Partner In Home?",labels=c("Did Not Answ.","False","True")) +
  sex.colours.fill +
  theme.legend

## Same sex sexual behaviour
fertility.metrics[,same.sex:=if_else(answered.sex == 1,
                                     if_else(same.sex == 0,0,
                                             if_else(same.sex == 1,1,-9)),
                                     -9)]

ggplot(fertility.metrics,aes(as.factor(same.sex),group=sexPulse,fill=sexPulse)) +
  stat_count(position="identity",alpha=0.5) +
  scale_alpha_continuous(range=c(0,1)) +
  scale_fill_manual(values=c(male.col,female.col),guide=guide_legend(title="Sex"),labels=c("Male","Female")) +
  scale_x_discrete(name = "Enganged in Same Sex\nIntercourse", labels = c("NA","False","True")) + 
  scale_y_continuous(name = "# of Indiv.") +
  sex.colours.fill +
  theme.legend

```

Separate block so plots don't get stretched out weird.

```{r Fertility pt 2, fig.height=5, fig.width=10}

## Age at first sexual intercourse
# This just modifies the special codes so they spread out on the plot
fertility.metrics[,plot:=ifelse(age.first.intercourse < 0, age.first.intercourse * 5, age.first.intercourse)]
                                          
ggplot(fertility.metrics, aes(plot,..density..,group = sexPulse, fill = sexPulse)) + 
  geom_histogram(binwidth=1, position = position_dodge()) + 
  scale_x_continuous(name = "Age at First Intercourse", breaks = c(-15,-10,-5,0,10,20,30,40,50,60), labels = c("Prefer not to answer","Never had sex", "Do not know", 0, 10, 20, 30, 40, 50, 60)) + 
  scale_y_continuous("% Individuals") + 
  sex.colours.fill + 
  theme.legend    

## Has had sexual intercourse -  which is derived from age at first sexual intercourse data
fertility.metrics[,had.sex:=if_else(answered.sex != 1, -9,
                                    if_else(age.first.intercourse == -3 | age.first.intercourse == -1, -9,
                                            if_else(age.first.intercourse == -2, 0, 1)))]

table(fertility.metrics[,c("sexPulse","had.sex")])
prop.table(table(fertility.metrics[,c("sexPulse","had.sex")]),margin = 1) * 100

# Check for immaculate conceptions...
# V1 is children
table(fertility.metrics[sexPulse == "Male",list(if_else(children.fathered>0,1,0),had.sex)][,c("V1","had.sex")])
table(fertility.metrics[sexPulse == "Female",list(if_else(live.births>0,1,0),had.sex)][,c("V1","had.sex")])

## Now convert all the -9s back to NA
fertility.metrics[,num.children:=if_else(sexPulse=="Male",children.fathered,live.births)]
fertility.metrics <- fertility.metrics[,c("eid","children.fathered","live.births","num.children","partner.in.house","lives.alone","same.sex","had.sex")]
fertility.metrics[,children.fathered:=if_else(children.fathered==-9,as.numeric(NA),children.fathered)]
fertility.metrics[,live.births:=if_else(live.births==-9,as.numeric(NA),live.births)]
fertility.metrics[,partner.in.house:=if_else(partner.in.house==-9,as.numeric(NA),partner.in.house)]
fertility.metrics[,lives.alone:=if_else(partner.in.house==-9,as.numeric(NA),lives.alone)]
fertility.metrics[,same.sex:=if_else(same.sex==-9,as.numeric(NA),same.sex)]
fertility.metrics[,had.sex:=if_else(had.sex==-9,as.numeric(NA),had.sex)]
fertility.metrics[,num.children:=if_else(num.children==-9,as.numeric(NA),num.children)]

UKBB.phenotype.data <- merge(UKBB.phenotype.data,fertility.metrics,by="eid")
rm(fertility.metrics)
```

## 3D. Fluid Intelligence

This section looks at UKBB "Fluid Intelligence". 

The field of relevance for us is 20016-0.0. There is no array data (only 1 test was done, but 3 instances - we're using instance 0).

```{r Fluid Intelligence}

## Grab fluid intel from the raw phenotypes:
fluid.intel.table <- UKBB.raw.phenotypes[,c("eid","20016-0.0")]
fluid.intel.table[,fluid.intel:=if_else(is.na(`20016-0.0`),as.integer(NA),`20016-0.0`)]
ggplot(merge(UKBB.phenotype.data,fluid.intel.table,by="eid"),aes(fluid.intel,group=as.factor(sexPulse),fill=as.factor(sexPulse))) +
  geom_histogram(binwidth = 1, position="identity", alpha = 0.5) +
  scale_alpha_continuous(range=c(0,1)) +
  scale_fill_manual(values=c(male.col,female.col),guide=guide_legend(title="Sex"),labels=c("Male","Female")) +
  scale_x_continuous(name="Fluid Intel Score") +
  scale_y_continuous(name = "# of Individuals") +
  theme.legend

## Normalize fluid intelligence
fluid.intel.table[,fluid.intel:=(fluid.intel-mean(fluid.intel,na.rm=T))/sd(fluid.intel,na.rm=T)]
ggplot(merge(UKBB.phenotype.data,fluid.intel.table,by="eid"),aes(fluid.intel,group=as.factor(sexPulse),fill=as.factor(sexPulse))) +
  geom_histogram(binwidth = 1, position="identity", alpha = 0.5) +
  scale_alpha_continuous(range=c(0,1)) +
  scale_fill_manual(values=c(male.col,female.col),guide=guide_legend(title="Sex"),labels=c("Male","Female")) +
  scale_x_continuous(name="Normalized Fluid Intel Score") +
  scale_y_continuous(name = "# of Individuals") +
  theme.legend

## Merge with remaining phenotypes
UKBB.phenotype.data <- merge(UKBB.phenotype.data,fluid.intel.table[,c("eid","fluid.intel")],by="eid",all.x=T)

rm(fluid.intel.table)
```

## 3E. Educational Attainment

Analyzing fields for Educational Attainment.

6138-0/1/2 includes educational attainment measures. There are a max of 5 values for each instance (to accomodate multiple levels of qualifications). Weuse a binary for did/did not complete a college degree, which is high correlated with years of education according to [this](https://academic.oup.com/sf/article-abstract/92/1/109/2235872?redirectedFrom=fulltext) article (cannot actually read it as is behind a paywall, but was cited in [this](https://static-content.springer.com/esm/art%3A10.1038%2Fnature17671/MediaObjects/41586_2016_BFnature17671_MOESM48_ESM.pdf) study as an explanation for why they don't care about years schooling vs. college education.

```{r Educational Attainment, fig.height=3, fig.width=6}

educational.attainment <- UKBB.raw.phenotypes[,c("eid","6138-0.0")]

## only have to check the first array column as that is the only one that is ever == 1 (i.e. college education)
educational.attainment[,test.1:=if_else(is.na(`6138-0.0`),as.integer(NA),
                                        if_else(`6138-0.0` == -3,as.integer(NA),
                                                if_else(`6138-0.0` == 1,1L,0L)))]

## This is for just testing in center data
educational.attainment[,completed.college:=test.1]
educational.attainment <- educational.attainment[,c("eid","completed.college")]

ggplot(merge(UKBB.phenotype.data,educational.attainment,by="eid"),aes(as.factor(completed.college),group=as.factor(sexPulse),fill=as.factor(sexPulse))) +
  stat_count(position="identity",alpha=0.5) +
  scale_alpha_continuous(range=c(0,1)) +
  scale_fill_manual(values=c(male.col,female.col),guide=guide_legend(title="Sex"),labels=c("Male","Female")) +
  scale_x_discrete(name="Completed College?",labels=c("False","True","NA")) +
  theme.legend

## Merge with remaining phenotypes
UKBB.phenotype.data <- merge(UKBB.phenotype.data,educational.attainment,by="eid",all.x=T)
rm(educational.attainment)
```

## 3F. Household Income

```{r Income data}

income.data <- UKBB.raw.phenotypes[,c("eid","738-0.0")]

income.data[,household.income:=if_else(`738-0.0`>0,`738-0.0`,as.integer(NA))]

income.data <- income.data[,c("eid","household.income")]

ggplot(merge(UKBB.phenotype.data,income.data,by="eid"),aes(household.income,group=as.factor(sexPulse),fill=as.factor(sexPulse))) + 
  geom_histogram(binwidth=1,position=position_dodge(),colour="black") + 
  scale_fill_manual(values=c(male.col,female.col),guide=guide_legend(title="Sex"),labels=c("Male","Female")) +
  scale_x_continuous(name="Income Bracket") +
  ylab("# of Individuals") +
  theme.legend

## Merge with remaining phenotypes
UKBB.phenotype.data <- merge(UKBB.phenotype.data,income.data,by="eid",all.x=T)
rm(income.data)
```

## 3G. Townsend Deprivation Index

```{r TDI}

tdi.data <- UKBB.raw.phenotypes[,c("eid","189-0.0")]

setnames(tdi.data,"189-0.0","townsend.index")
tdi.data <- tdi.data[!is.na(townsend.index)]

ggplot(merge(UKBB.phenotype.data,tdi.data,by="eid"),aes(townsend.index,..density..,group=as.factor(sexPulse),fill=as.factor(sexPulse))) +
  geom_histogram(position="identity",alpha=0.5) +
  scale_alpha_continuous(range=c(0,1)) +
  scale_fill_manual(values=c(male.col,female.col),guide=guide_legend(title="Sex"),labels=c("Male","Female")) +
  scale_x_continuous(name="Townsend Dep. Index") +
  scale_y_continuous(name="Proportion of Individuals") +
  theme.legend

UKBB.phenotype.data <- merge(UKBB.phenotype.data,tdi.data,by="eid",all.x=T)
rm(tdi.data)

```

## 3G. Email

```{r email data}

email.data <- UKBB.raw.phenotypes[,c("eid","20005-0.0")]

## As far as I can tell this is a purely binary value:
email.data[,has.email:=if_else(!is.na(`20005-0.0`),1,0)]

email.data <- email.data[,c("eid","has.email")]

ggplot(merge(UKBB.phenotype.data,email.data,by="eid"),aes(as.factor(has.email),group=as.factor(sexPulse),fill=as.factor(sexPulse))) +
  stat_count(position="identity",alpha=0.5) +
  scale_alpha_continuous(range=c(0,1)) +
  scale_fill_manual(values=c(male.col,female.col),guide=guide_legend(title="Sex"),labels=c("Male","Female")) +
  scale_x_discrete(name="Has Email?",labels=c("False","True")) +
  theme.legend

## Merge with remaining phenotypes
UKBB.phenotype.data <- merge(UKBB.phenotype.data,email.data,by="eid",all.x=T)
rm(email.data)
```

## 3H. Mental Health Phenotypes

### ICD 10 Coding

#### Hospital Episode Statistics

This code pulls in all Hospital Episode Statistic (HES) ICD-10 codes for all individuals. Using this to test for 

$$ has.children \sim s_{het[i,v]} + has.icd.code + age + age^2 + PC1..PC10 $$

In the section(s) below.

```{r All ICD codings}

## Load most HES ICD-10 Codes except cancer
cols <- names(UKBB.raw.phenotypes)[grep("eid|41202|41204",names(UKBB.raw.phenotypes))]
hes.data.long <- data.table(pivot_longer(UKBB.raw.phenotypes[,..cols],-eid,values_to="icd.code",values_drop_na = T))
hes.data.long <- hes.data.long[,c("eid","icd.code")]

## Remove duplicate primary/secondary codes:
hes.data.long <- unique(hes.data.long)

## Remove any cancer codes that we will get from cancer-specific icd.data
hes.data.long <- hes.data.long[grepl("C",icd.code) == F & grepl("D[0-4]", icd.code, perl = T) == F & grepl("O0", icd.code, perl = T) == F]

## Load Cancer codings:
cols <- names(UKBB.raw.phenotypes)[grep("eid|40006",names(UKBB.raw.phenotypes))]
cancer.data.long <- data.table(pivot_longer(UKBB.raw.phenotypes[,..cols],-eid,values_to="icd.code",values_drop_na = T))
cancer.data.long <- cancer.data.long[,c("eid","icd.code")]

## Mash together regular ICD10 and Cancer codes:
hes.data.long <- rbind(hes.data.long, cancer.data.long)

## Generate shorter string to match:
hes.data.long[,icd.category:=substr(icd.code,1,3),by=1:nrow(hes.data.long)]

rm(cancer.data.long)
```

#### Complete Health Outcomes Data (CHOD) ICD Codings

We have access to the complete health outcomes ICD-10 codings, which we hope will represent a more accurate depiction of conditions an individual has, as well as provide a way of only testing individuals with conditions prior to child-bearing age. We extract fields the same way as before, and then create a data dictionary to link up each code to it's relevant UKBB field. This will incorporate all UKBB fields from 130000-132605. We also incorperate field 42040 here to exclude individuals who don't have GP records. This cuts our sample size in half, but is better than having differential ascertainment in my opinion.

**Note**: Remember that we generated the `fi_phenotypes.txt` and `valid_fi_indvs.txt` files above in section 2A.

```{r read and process}

raw.CHOD <- fread("rawdata/phenofiles/fi_phenotypes.txt")
raw.CHOD[,eid:=as.character(eid)]

valid.CHOD.indvs <- fread("rawdata/phenofiles/valid_fi_indvs.txt")
valid.CHOD.indvs[,eid:=as.character(eid)]

## It's yo birthday!
processed.CHOD <- merge(raw.CHOD,UKBB.phenotype.data[,c("eid","birthday")],by="eid")
processed.CHOD[,date:=as.Date(date, format = "%Y-%m-%d")]

## Set a flag in the phenotype data for people I should include when doing GP-only analyses:
UKBB.phenotype.data[,has.gp.data:=if_else(eid %in% valid.CHOD.indvs[!is.na(num.gp.codes),eid], 1, 0)]

## Get ~age of incidence while taking into account special codes:
processed.CHOD[,age.at.incidence:=if_else(date == as.Date("2037-07-07"), -1, ## This is an error code for incidence in the future and is presumably an error.
                                                if_else(date == as.Date("1901-01-01"), -1, ## This is an error code for incidence before birth (doesn't appear to be any...?)
                                                        if_else(date == as.Date("1902-02-02"), 0, ## This is congenital conditions
                                                                if_else(date == as.Date("1903-03-03"), 0.5, ## This is for neonatal conditions 
                                                                        time_length(difftime(date, birthday), "years")))))] ## This is for all other cases.

```

This is just to generate equivalent data from CHOD data as is generated for HES and MHQ data.

```{r translate to table format}

## This is very ugly but was the easiest way for me to tabulate it from another datasource to ensure rough concistency with HES/MHQ data
codes <- c("F20","F23","F25",
           "F84",
           "F30","F31",
           "F32","F33",
           "F50",
           "F40",
           "F42",
           "F41",
           "F60","F61",
           "F90",
           "N46",
           "F70","F71","F72","F78","F79","F80","F81","F82","F89")

condition <- c("scizo","scizo","scizo",
               "asd",
               "bipolar","bipolar",
               "depression","depression",
               "eating_disorders",
               "phobia",
               "ocd",
               "gen_anxiety",
               "gen_personality","gen_personality",
               "add",
               "infertility",
               "developmental_disorder","developmental_disorder","developmental_disorder","developmental_disorder","developmental_disorder","developmental_disorder","developmental_disorder","developmental_disorder","developmental_disorder")

condition.key <- data.table(code=codes, table.condition=condition)
condition.key <- unique(condition.key)

raw.CHOD <- merge(raw.CHOD, condition.key, by = "code", all.x = T)
condition.table <- data.table(table(raw.CHOD[,c("eid","table.condition")]))
condition.table[,N:=if_else(N == 0, 0, 1)]
condition.table <- data.table(pivot_wider(condition.table, id_cols = eid, names_from=table.condition, values_from = N))
setnames(condition.table,names(condition.table)[-1],paste("fi",names(condition.table)[-1],sep="."))

## This just makes sure every individual we have CHOD data for is in our final table
condition.table <- merge(valid.CHOD.indvs[,c("eid")],condition.table,by="eid",all.x=T)

condition.table[is.na(condition.table)] <- 0

UKBB.phenotype.data <- merge(UKBB.phenotype.data,condition.table[,c("eid","fi.scizo","fi.bipolar","fi.asd","fi.add","fi.developmental_disorder")],by="eid",all.x=T)

rm(raw.CHOD)
```

#### Specific HES ICD-10 Codings

The ICD-10 codes that I have used and their equivalancies to the MHQ section are listed in the perl script below. For the traits covered in [Power et al.](https://jamanetwork.com/journals/jamapsychiatry/article-abstract/1390257) I have stuck with their exact codes to enable replication, for the others I have searched for equivalents using various articles on those particular subjects.

This set of data needs to be processed in two steps, as I use perl to process the actual phenotypes (it's much easier than in R). Need to first print out a text file of the raw phenotypes I need, process it with Perl, and then read back in and add to the final phenotype table.

```{r Print ICD10 Data}

cols.to.print <- c("eid",names(UKBB.raw.phenotypes)[grep("41202",names(UKBB.raw.phenotypes))],names(UKBB.raw.phenotypes)[grep("41204",names(UKBB.raw.phenotypes))])
icd.data <- UKBB.raw.phenotypes[,..cols.to.print]
write.table(icd.data,file="rawdata/phenofiles/ICD10.data.txt",sep="\t",row.names=F,col.names=F,quote=F)
rm(icd.data)

```

```{bash Process ICD10 Data}

./scripts/process_icd.pl

```

```{r process ICD10}

hes.data <- fread("rawdata/phenofiles/ICD10.data.processed.txt")
hes.data[,eid:=as.character(eid)]

setnames(hes.data, names(hes.data), c("eid",paste("hes",names(hes.data)[2:length(names(hes.data))],sep=".")))
UKBB.phenotype.data <- merge(UKBB.phenotype.data,hes.data[,c("eid","hes.scizo","hes.bipolar","hes.asd","hes.add","hes.developmental_disorder")],by="eid",all.x=T)
```

### Mental Health Questionnaire

This [recent paper](https://doi.org/10.1192/bjo.2019.100) documents the mental health questionaire that was sent out to a subset of UKBB particpants. ~160k responded on a number of measures. On the UKBB Data Showcase, they have a [document](http://biobank.ndph.ox.ac.uk/showcase/showcase/docs/mental_health_online.pdf) about what questions were asked and where to find them in the showcase.

This section is attempting to replicate the general codings of [this](https://jamanetwork.com/journals/jamapsychiatry/article-abstract/1390257) study that looked at all Swedish individuals born from 1950-1970 that did not have any genetic data. As such, the equivalent UK Biobank fields and their codings withing the MH Questionnaire are:

* 20406 (alcohol), 20503 (prescription meds), 20456(rec. drugs) - Substance Addiction. Field 20401 indicates a 'YES' answer to ever addicted to a substance or behaviour, which is a 'gate' question to answer these three fields.
* 20544: Numbers that follow are the code for that disorder 
    + Social Anxiety/Phobia (1)
    + Psychotic disorders (2 & 3) - This is also encoded in ICD10 code(s) F20-29. Could get a bigger N by using ICD10 codes?
    + Personality disorders (4)
		+  Any other diabling phobia(5)
		+  Panic Attacks (6)
		+  OCD (7)
		+  Bipolar disorder, manic depressive diso., etc. (10)
		+  Depression (11)
		+  Eating disorders (12,13,16)
		+  ASD (14)
		+  General Anxiety Disorder(15)
		+  Agorophobia (17)
		+  ADD/ADHD (18)
		+  Did not answer one or both sections (-818, -819)

Processing of MH Traits functions similarly to ICD10 above, where I have to print a file, processes with perl and then read it back in:

```{r Print MHQ data}

cols.to.print <- c("eid","20401-0.0","20406-0.0","20456-0.0","20503-0.0",names(UKBB.raw.phenotypes)[grep("20544",names(UKBB.raw.phenotypes))])
mhq.data <- UKBB.raw.phenotypes[,..cols.to.print]
write.table(mhq.data,file="rawdata/phenofiles/mhq.data.txt",sep="\t",row.names=F,col.names=F,quote=F)
rm(mhq.data)

```

```{bash Process MH data}

./scripts/process_mhq.pl

```

```{r Process MHQ}

mhq.data <- fread("rawdata/phenofiles/mhq.data.processed.txt")
mhq.data[,eid:=as.character(eid)]

setnames(mhq.data, names(mhq.data), c("eid",paste("mhq",names(mhq.data)[2:length(names(mhq.data))],sep=".")))

## Convert answered MHQ to binary:
mhq.data[,mhq.answered_mhq:=if_else(is.na(mhq.answered_mhq),0,1)]
UKBB.phenotype.data <- merge(UKBB.phenotype.data,mhq.data[,c("eid","mhq.scizo","mhq.bipolar","mhq.asd","mhq.add","mhq.answered_mhq")],by="eid",all.x=T)
```

### Comparing MH Data Sources

```{r compare MHQ and ICD10, fig.height=8, fig.width=10}

## Totals MHQ:
totals.mhq <- mhq.data[,lapply(.SD, sum,na.rm=T),.SDcols=names(mhq.data)[2:length(names(mhq.data))]]
totals.mhq <- data.table(pivot_longer(totals.mhq,cols=names(totals.mhq),names_sep="\\.",names_to = c(".value","condition")))

## Totals HES:
totals.hes <- hes.data[,lapply(.SD, sum,na.rm=T),.SDcols=names(hes.data)[2:length(names(hes.data))]]
totals.hes <- data.table(pivot_longer(totals.hes,cols=names(totals.hes),names_sep="\\.",names_to = c(".value","condition")))

## Totals CHOD + HES:
totals.fi.hes <- condition.table[,lapply(.SD, sum,na.rm=T),.SDcols=names(condition.table)[2:length(names(condition.table))]]
totals.fi.hes <- data.table(pivot_longer(totals.fi.hes,cols=names(totals.fi.hes),names_sep="\\.",names_to = c(".value","condition")))

## Totals GP only:
totals.fi.gp <- condition.table[eid %in% valid.CHOD.indvs[!is.na(num.gp.codes),eid],lapply(.SD, sum,na.rm=T),.SDcols=names(condition.table)[2:length(names(condition.table))]]
totals.fi.gp <- data.table(pivot_longer(totals.fi.gp,cols=names(totals.fi.gp),names_sep="\\.",names_to = c(".value","condition")))

## Population totals from Sweden
totals.sweden <- data.table(condition = c("add","asd","bipolar","developmental_disorder","scizo"), sweden = c(NA, 2947, 14439, NA, 18890))

n.mhq <- totals.mhq[condition=="answered_mhq",mhq]
n.hes <- nrow(hes.data)
n.fi.hes <- nrow(condition.table)
n.fi.gp <- nrow(condition.table[eid %in% valid.CHOD.indvs[!is.na(num.gp.codes),eid]])
n.sweden <- 2356598

totals.mhq[,prop:=(mhq/n.mhq)*100]
totals.hes[,prop:=(hes/n.hes)*100]
totals.fi.hes[,prop:=(fi/n.fi.hes)*100]
totals.fi.gp[,prop:=(fi/n.fi.gp)*100]
totals.sweden[,prop:=(sweden/n.sweden)*100]

setnames(totals.mhq,c("mhq","prop"),c("value.mhq","prop.mhq"))
setnames(totals.hes,c("hes","prop"),c("value.hes","prop.hes"))
setnames(totals.fi.hes, c("fi","prop"),c("value.fi.hes","prop.fi.hes"))
setnames(totals.fi.gp, c("fi","prop"),c("value.fi.gp","prop.fi.gp"))
setnames(totals.sweden, c("sweden","prop"),c("value.sweden","prop.sweden"))

prop.icd10.coding <- merge(totals.mhq, totals.hes,by="condition",all.y = T)
prop.icd10.coding <- merge(prop.icd10.coding, totals.fi.hes, by = "condition", all.x = T)
prop.icd10.coding <- merge(prop.icd10.coding, totals.fi.gp, by = "condition", all.x = T)
prop.icd10.coding <- merge(prop.icd10.coding, totals.sweden, by = "condition", all.x = T)

cond <- c("add","asd","bipolar","developmental_disorder","scizo")

plot.corr <- function(col.x, col.y, show.x, show.y) {
  
  theme.mod <- theme
  if (show.x == F) {
    theme.mod <- theme.mod + theme(axis.text.x = element_blank())
  }
  if (show.y == F) {
    theme.mod <- theme.mod + theme(axis.text.y = element_blank())
  }
  
  ggplot(prop.icd10.coding[condition %in% cond],aes(get(col.x),get(col.y))) + 
    geom_point() + 
    scale_x_log10(name = "", limits=c(1e-3,1e0)) + 
    scale_y_log10(name = "", limits=c(1e-3,1e0)) + 
    geom_text(aes(label=condition),size = 5,nudge_x = -0.1, hjust = 1) + 
    geom_abline(linetype = 2, colour = "red") + 
    theme.mod
  
}

blank.text <- grid::textGrob("")

wrap_elements(blank.text) + grid::textGrob("MHQ Prop.") + grid::textGrob("HES Prop.") + grid::textGrob("CHOD w/HES Prop.") + 
  grid::textGrob("HES Prop.", rot = 90) + plot.corr("prop.mhq", "prop.hes", F, T) + plot_spacer() + plot_spacer() +
  grid::textGrob("CHOD w/HES\nProp.", rot = 90) + plot.corr("prop.mhq", "prop.fi.hes", F, T) + plot.corr("prop.hes", "prop.fi.hes", F, F) + plot_spacer() +
  grid::textGrob("Sweden Prop.", rot = 90) + plot.corr("prop.mhq", "prop.sweden", T, T) + plot.corr("prop.hes", "prop.sweden", T, F) + plot.corr("prop.fi.hes", "prop.sweden", T, F) +
  plot_layout(ncol = 4, nrow = 4, widths = c(0.15,1,1,1), heights = c(0.1,1,1,1))

format.prop.icd10.coding <- prop.icd10.coding[condition %in% cond]

format.prop.icd10.coding[,mhq:=paste0(sprintf("%0.3f",prop.mhq),"%(", value.mhq, ")")]
format.prop.icd10.coding[,hes:=paste0(sprintf("%0.3f",prop.hes),"%(", value.hes, ")")]
format.prop.icd10.coding[,fi.hes:=paste0(sprintf("%0.3f",prop.fi.hes),"%(", value.fi.hes, ")")]
format.prop.icd10.coding[,fi.gp:=paste0(sprintf("%0.3f",prop.fi.gp),"%(", value.fi.gp, ")")]
format.prop.icd10.coding[,sweden:=paste0(sprintf("%0.3f",prop.sweden),"%(", value.sweden, ")")]

format.prop.icd10.coding[,c("condition","mhq","hes","fi.hes","fi.gp","sweden")]

rm(condition.table)
```

### Adding a Covariate for Any Mental Health Disorder or Infertility Code

```{r MHD Covariate}

## Calculate the 'MHT' binary:
has.mht <- unique(c(UKBB.phenotype.data[mhq.scizo == 1 | mhq.bipolar == 1 | mhq.asd == 1 | mhq.add == 1 | hes.scizo == 1 | hes.bipolar == 1 | hes.asd == 1 | hes.add == 1 | fi.scizo == 1 | fi.bipolar == 1 | fi.asd  ==  1 | fi.add == 1 | hes.developmental_disorder == 1 | fi.developmental_disorder == 1,eid]))

UKBB.phenotype.data[,mht.binary:=if_else(eid %in% has.mht, 1, 0)]

## Calculate Infertility Codes
ukbb.has.male.infertility <- unique(c(hes.data.long[icd.category == "N46", eid],processed.CHOD[grepl("N46",code), eid]))

UKBB.phenotype.data[,has.male.infertility:=if_else(eid %in% ukbb.has.male.infertility, 1, 0)]

## Any Infertility Code (Male or Female)
UKBB.phenotype.data[,fi.fert:=if_else(sexPulse == 1,
                                      if_else(eid %in% processed.CHOD[grepl("N46",code),eid], 1, 0),
                                      if_else(eid %in% processed.CHOD[grepl("N97",code),eid], 1, 0))]
```

## 3I. Neutral Phenotypes

Purpose of these phenotypes is to provide a test against s[het] for a phenotype we do not expect to have a (strong) genetic component to test to make sure there is no bias in our ascertainment:

* Fresh Fruit Intake (Field 1309)
* Handedness (Field 1707)
* Hair colour (Field 1747)

```{r neutral phenotypes}

neutral.phenos <- UKBB.raw.phenotypes[,c("eid","1309-0.0","1707-0.0","1747-0.0")]

neutral.phenos[,fresh.fruit:=ifelse(is.na(`1309-0.0`),NA,
                                     ifelse(`1309-0.0` < 0,NA,`1309-0.0`))]

neutral.phenos[,handedness:=ifelse(`1707-0.0` == "NaN" | is.na(`1707-0.0`), NA,
                                    ifelse(`1707-0.0`==1,0,
                                            ifelse(`1707-0.0`==2,1,NA)))]

neutral.phenos[,is.blonde:=if_else(is.na(`1747-0.0`) | `1747-0.0` < 0, NaN,
                                   if_else(`1747-0.0` == 1, 1, 0))]

UKBB.phenotype.data <- merge(UKBB.phenotype.data,neutral.phenos[,c("eid","fresh.fruit","handedness","is.blonde")],by="eid",all.x=T)

rm(UKBB.raw.phenotypes)
```

# 4. Assembling Sequencing/Array Data

## 4A. Curating Gene Lists

We use several genelists as part of this project:

1. pLI information from the [gnomAD project](https://storage.googleapis.com/gnomad-public/release/2.1.1/constraint/gnomad.v2.1.1.lof_metrics.by_gene.txt.bgz).
2. s~het~ from [Weghorn et al.](https://doi.org/10.1093/molbev/msz092). The reference file is included in this repository (`rawdata/genelist/shet.weghorn.txt`).
    + We also, for comparative purposes, use the old s~het~ from [Cassa et al.](https://www.nature.com/articles/ng.3831) which is included in this repository as well (`rawdata/genelist/shet.cassa.txt`)
3. ENSEMBL-downloaded resources from [BioMart](https://www.ensembl.org/biomart/martview/0511514c231557b5d24ace4e8f7862e0).
4. Disease genes from ClinVar, DDG2P, and OMIM (see that section for links).
5. Male Infertility Genes from [this paper](https://academic.oup.com/humrep/article/34/5/932/5377831).

The purpose of the following scripts is to generate these lists if they are not available. This process is also duplicated when processing CNV data and annotating SNVs/InDels, as that has to be done as part of a separate script. See both the section on [Variant Data](#4._variant_data) in this document, and the separate RMarkdown documents `CNVCalling_Filtering.R` and `SNVCalling_Filtering.Rmd`, respectively, for more information.

**Note**: These scripts assume you have `curl` installed on your system, which _should_ be true if you are using macos. Please change the scripts below if this is not the case.

### Download Resources from BioMart

```{r Generate biomart resources}

## Hg19
ensembl <- useMart("ensembl", host="http://grch37.ensembl.org", dataset = "hsapiens_gene_ensembl")
hg19.table <- data.table(getBM(attributes = c('ensembl_gene_id','chromosome_name','start_position','end_position','hgnc_id','hgnc_symbol','ensembl_transcript_id'),mart = ensembl))
hg19.table <- hg19.table[!grep("_",chromosome_name)]
write.table(hg19.table,"rawdata/genelists/hg19.genes.txt",col.names=F,row.names=F,quote=F,sep="\t")

## Hg38
ensembl <- useMart("ensembl", dataset = "hsapiens_gene_ensembl")
hg38.table <- data.table(getBM(attributes = c('ensembl_gene_id','chromosome_name','start_position','end_position','hgnc_id','hgnc_symbol','strand'),mart = ensembl))
hg38.table[,hgnc_id:=str_remove(hgnc_id,"HGNC:"),by=1:nrow(hg38.table)]
hg38.table <- hg38.table[!grep("CHR_",chromosome_name)]
write.table(hg38.table,"rawdata/genelists/hg38.genes.txt",col.names=F,row.names=F,quote=F,sep="\t")

rm(hg19.table,hg38.table,ensembl)
```

### s~het~ Gene Lists

```{bash Generate sHET gene lists}

perl -ane 'chomp $_; @F = split("\t", $_); print "$F[0]\t$F[7]\n";' rawdata/genelists/shet.weghorn.txt > rawdata/genelists/shet.processed.weghorn.txt
perl -ane 'chomp $_; @F = split("\t", $_); print "$F[0]\t$F[1]\n";' rawdata/genelists/shet.cassa.txt > rawdata/genelists/shet.processed.cassa.txt

## sHET gene lists (have to attach ENSG):
scripts/matcher.pl -file1 rawdata/genelists/hg19.genes.txt -col1 5 -file2 rawdata/genelists/shet.processed.weghorn.txt -r | perl -ane 'chomp $_; print "$F[2]\t$F[0]\t$F[1]\t$F[6]\n";' > rawdata/genelists/shet.hgnc.txt
scripts/matcher.pl -file1 rawdata/genelists/hg19.genes.txt -col1 5 -file2 rawdata/genelists/shet.processed.cassa.txt -r | perl -ane 'chomp $_; print "$F[2]\t$F[0]\t$F[1]\t$F[6]\n";' > rawdata/genelists/shet.cassa.hgnc.txt
```

### Hg19 Gene Lists

```{bash Generate hg19 Gene Lists}

## Download gnomAD scores:
curl -o rawdata/genelists/gnomad.v2.1.1.lof_metrics.by_gene.txt.bgz https://storage.googleapis.com/gnomad-public/release/2.1.1/constraint/gnomad.v2.1.1.lof_metrics.by_gene.txt.bgz

## Rename your files gnomAD........
mv rawdata/genelists/gnomad.v2.1.1.lof_metrics.by_gene.txt.bgz rawdata/genelists/gnomad.v2.1.1.lof_metrics.by_gene.txt.gz
gunzip -f rawdata/genelists/gnomad.v2.1.1.lof_metrics.by_gene.txt.gz

## Create a reference file of just ENSG and pLI, while removing genes w/o a pLI score:
perl -ane 'chomp $_; @F = split("\t", $_); if ($F[20] ne 'NA') {print "$F[63]\t$F[20]\n";}' rawdata/genelists/gnomad.v2.1.1.lof_metrics.by_gene.txt > rawdata/genelists/hg19.all_genes_with_pli.txt

## Add additional info from biomart that we acquired:
# pLI file:
scripts/matcher.pl -file1 rawdata/genelists/hg19.genes.txt -file2 rawdata/genelists/hg19.all_genes_with_pli.txt -r | perl -ne 'chomp $_;  @F = split("\t", $_); print "$F[0]\t$F[3]\t$F[4]\t$F[5]\t$F[6]\t$F[7]\t$F[8]\t$F[1]\n";' > rawdata/genelists/hg19.all_genes_with_pli.2.txt
mv rawdata/genelists/hg19.all_genes_with_pli.2.txt rawdata/genelists/hg19.all_genes_with_pli.txt

# sHET file:
scripts/matcher.pl -file1 rawdata/genelists/hg19.genes.txt -file2 rawdata/genelists/shet.hgnc.txt -r | perl -ne 'chomp $_;  @F = split("\t", $_); print "$F[0]\t$F[5]\t$F[6]\t$F[7]\t$F[8]\t$F[9]\t$F[10]\t$F[2]\n";' > rawdata/genelists/hg19.all_genes_with_shet.txt
```

### Hg38 Gene Lists

```{bash Generate hg38 Gene lists}
# Try and match genes to Hg19 based on HGNC ID
# Generate a list of hg19 genes with HGNC IDs:
perl -ane 'chomp $_; if ($F[4] ne "NA" && $F[4] ne "") {print "$F[4]\t$F[0]\t$F[5]\n";}' rawdata/genelists/hg19.genes.txt | sort | uniq > rawdata/genelists/hg19.trans.txt

scripts/matcher.pl -file1 rawdata/genelists/hg19.trans.txt -file2 rawdata/genelists/hg38.genes.txt -col2 4 -r | perl -ane 'chomp $_; print "$F[0]\t$F[1]\t$F[2]\t$F[3]\t$F[4]\t$F[5]\t$F[8]\n";' > rawdata/genelists/hg38.hgnc.matched.txt

# Ask which genes have a pLI score:
scripts/matcher.pl -file1 rawdata/genelists/hg38.hgnc.matched.txt -col1 6 -file2 rawdata/genelists/hg19.all_genes_with_pli.txt -r | perl -ane 'chomp $_; @F = split("\t", $_); print "$F[8]\t$F[9]\t$F[10]\t$F[11]\t$F[12]\t$F[13]\t$F[0]\t$F[7]\n";' > rawdata/genelists/hg38.all_genes_with_pli.txt

# Ask which genes have a sHET score:
scripts/matcher.pl -file1 rawdata/genelists/hg38.hgnc.matched.txt -col1 6 -file2 rawdata/genelists/hg19.all_genes_with_shet.txt -r | perl -ane 'chomp $_; @F = split("\t", $_); print "$F[8]\t$F[9]\t$F[10]\t$F[11]\t$F[12]\t$F[13]\t$F[0]\t$F[7]\n";' > rawdata/genelists/hg38.all_genes_with_shet.txt

# There is a fairly large caveat here, which is that I label the genes with their Hg19 ENSG ID so that I can be consistant in my R code below!!! This does't impact too many genes, they mostly have the same IDs (~2-300)
# This gets a translatable list to hg19 ENSG###:
perl -ne 'chomp $_; @F = split("\t", $_); print "$F[0]\t$F[6]\n";' rawdata/genelists/hg38.all_genes_with_pli.txt > rawdata/genelists/hg38_to_hg19_ENSG.txt
perl -ne 'chomp $_; @F = split("\t", $_); print "$F[0]\t$F[6]\n";' rawdata/genelists/hg38.all_genes_with_shet.txt >> rawdata/genelists/hg38_to_hg19_ENSG.txt
sort rawdata/genelists/hg38_to_hg19_ENSG.txt | uniq > rawdata/genelists/hg38_to_hg19_ENSG.2.txt
mv rawdata/genelists/hg38_to_hg19_ENSG.2.txt rawdata/genelists/hg38_to_hg19_ENSG.txt 
```

### Disease Genes

This section of the document is not evaluated, as we need to acquire disease gene resources from three locations. The actual acquisition of these files is trivial, but did not want to attempted to make reproduceable due to potential links breaking. If this section needs to be reproduced, download the resources at the _rough_ following locations and run the code in this section. Otherwise, move to the next section where the file produced by this section has already been generated. Locations to place necessary files and file dates used in the manuscript are listed below:

1. Clinvar - https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh37/archive_2.0/2019/clinvar_20191003.vcf.gz
    +  `rawdata/genelists/clinvar.vcf.gz`
    +  date: October 3, 2019
2. DDG2P - http://www.ebi.ac.uk/gene2phenotype/downloads/DDG2P.csv.gz
    +  `rawdata/genelists/DDG2P.csv`
    +  date: November 12, 2019
3. OMIM - https://www.omim.org/downloads
    +  `rawdata/genelists/morbidmap.txt`
    +  date: October 8, 2019

**Note**: To download OMIM morbid map you will need to register and place a copy of this file at: `rawdata/genelists/morbidmap.txt`

```{r Process Disease Genes, eval = F}
## Clinvar
clinvarVCFfiltered <- read_tsv("rawdata/genelists/clinvar.vcf.gz", comment = '#', col_names = F, col_types = 'cccccccccccccc') %>% 
  mutate(PHEN = str_remove(str_extract(X8, 'CLNDN=[^;]*;'), "CLNDN=")) %>% 
  mutate(PHEN = str_remove(PHEN, "not_provided")) %>% 
  mutate(PHEN = str_remove(PHEN, "|")) %>%
  mutate(PHEN = str_remove(PHEN, ";")) %>%
  filter(PHEN != '') %>% 
  mutate(GENE = str_remove(str_extract(X8, 'GENEINFO=[^;]*:'), "GENEINFO=")) %>% 
  mutate(CLNSIG = str_remove(str_extract(X8, 'CLNSIG=[^;]*;'), "CLNSIG=")) %>% 
  filter(CLNSIG %in% c("Pathogenic/Likely_pathogenic;", "Pathogenic;", "Likely_pathogenic;")) %>% 
  mutate(CLNSIG = str_remove(CLNSIG, ";")) %>% 
  mutate(ID = X3) %>% 
  select(c(X1,X2,ID, PHEN, GENE, CLNSIG)) %>% 
  drop_na(PHEN, GENE) 
clinvarGenes <- clinvarVCFfiltered %>% 
  select(GENE) %>% 
  distinct() %>% 
  transform(GENE = strsplit(GENE, "\\|")) %>%
  unnest(GENE) %>% 
  mutate(GENE = gsub(":.*", "", GENE))
clinvarGenes <- data.table(clinvarGenes)

## DDG2P
ddg2p <- fread("rawdata/genelists/DDG2P.csv") %>% 
  rename(hgnc_id = `hgnc id`)
develGenes <- ddg2p %>% 
  filter(`DDD category` %in% c("probable","confirmed","both DD and IF")) %>% 
  distinct(`hgnc_id`)
develGenes <- data.table(develGenes)

## OMIM
omimGenes <- read_tsv("rawdata/genelists/morbidmap.txt", skip = 4, comment = '#', col_types = 'ccic',
                      col_names = c("Pheno", "Gene", "MIM_Number", "Cyto_Location")) %>%
  select(Pheno, Gene) %>%
  drop_na() %>%
  mutate(Gene = gsub(",.*", "", Gene)) %>%
  mutate(omim_nondisease = ifelse(grepl("\\[", Pheno), T,F)) %>%
  mutate(omim_complex = ifelse(grepl("\\{", Pheno), T,F)) %>%
  mutate(omim_provisional = ifelse(grepl("\\?", Pheno), T,F))
omimGenes <- omimGenes %>% 
  filter(!omim_complex & !omim_nondisease)
omimGenes <- data.table(omimGenes)

hgnc.to.ENSG <- fread("rawdata/genelists/hg19.coordinates.txt",fill=T)
setnames(hgnc.to.ENSG,names(hgnc.to.ENSG),c("hg19.GENE","chr","start","end","hgnc_id","hgnc.symbol"))

clinvarGenes <- unique(merge(clinvarGenes,hgnc.to.ENSG,by.x="GENE",by.y="hgnc.symbol"))
develGenes <- unique(merge(develGenes,hgnc.to.ENSG,by="hgnc_id"))
omimGenes <- unique(merge(omimGenes[,c("Gene")],hgnc.to.ENSG,by.x="Gene",by.y="hgnc.symbol"))

diseaseGenes <- bind_rows(clinvarGenes[,c("hg19.GENE")],develGenes[,c("hg19.GENE")],omimGenes[,c("hg19.GENE")])
diseaseGenes <- unique(diseaseGenes)

write.table(diseaseGenes, "rawdata/genelists/diseaseGenes.txt",col.names = F, row.names = F, quote = F, sep ="\t")

rm(ddg2p, develGenes, omimGenes, clinvarVCFfiltered, clinvarGenes, hgnc.to.ENSG, diseaseGenes)
```

### Male Infertility Genes

This is a list of genes confirmed as playing a role in male infertility from [Manon Oud and Joris Veltman](https://academic.oup.com/humrep/article/34/5/932/5377831). We download and process the data direct from thier supplement (Table S3-S6):

**Note**: I have no idea if the link below will work for everyone. If it doesn't, download "Supplementary Table 3-6" manually from [here](https://academic.oup.com/humrep/article/34/5/932/5377831#supplementary-data) and set the name of the file to `rawdata/genelists/male_infertility_genes.xlsx`

```{bash Get Male Infertility Genes, eval = F}

curl -L -o rawdata/genelists/male_infertility_genes.xlsx "https://oup.silverchair-cdn.com/oup/backfile/Content_public/Journal/humrep/34/5/10.1093_humrep_dez022/2/dez022_supplementary_tables_siii_-_svi_final.xlsx?Expires=1610104943&Signature=aNNfVC~2fDVGJtkTwMipconDNKJ-wBpIAuirlY9Vlb10rdiEhGSlRiV2w01eiRwPdg~c7N6j~5Vvle-XbWzPs8hrNnDJkTZSzTJSzAY8rJJcGsvQrrZrmakP87O9iIEuKvsBqCyhLU04osczMLaWVRL7oSX~MmBiPNOgWIOUs7nQodIhFAGf9gyscadyUJ9q3yL5ptybEOcd2VbkeDNuHgRCWuhbE1KB7LQStWbRp6gxXR6JMf8qSHWxknccaxBTdhQ3Pk0bLYL3dxOjr2PYikPlogfO98JF2HgakoJmumx-zJ1wh~4RuB9tJykz3or6FMfOqJLSOiFPF9XVEUJ0zQ__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA"

```

```{r Process Male Infertility Gene XLSX}

male.genes.xlsx <- data.table(read_xlsx("rawdata/genelists/male_infertility_genes.xlsx",sheet = "Supplementary Table SIV",skip=1))
male.genes.xlsx <- male.genes.xlsx[4:nrow(male.genes.xlsx)]
male.genes.xlsx <- male.genes.xlsx[,c("HGNC gene name","Inheritance pattern in human","Conclusion")]
male.genes.xlsx <- male.genes.xlsx[`HGNC gene name` != ""]
male.genes.xlsx <- male.genes.xlsx[Conclusion!="No evidence"]
male.genes.xlsx <- male.genes.xlsx[`Inheritance pattern in human`!= "XL" & `Inheritance pattern in human` != "YL"]

## Need to rename some genes as they used Hg38 gene IDs:
male.genes.xlsx[,`HGNC gene name`:=if_else(`HGNC gene name`=="CATSPERE","C1orf101",`HGNC gene name`)]
male.genes.xlsx[,`HGNC gene name`:=if_else(`HGNC gene name`=="CFAP43","WDR96",`HGNC gene name`)]
male.genes.xlsx[,`HGNC gene name`:=if_else(`HGNC gene name`=="CFAP44","WDR52",`HGNC gene name`)]
male.genes.xlsx[,`HGNC gene name`:=if_else(`HGNC gene name`=="CFAP69","C7orf63",`HGNC gene name`)]
male.genes.xlsx[,`HGNC gene name`:=if_else(`HGNC gene name`=="DNAAF4","DYX1C1",`HGNC gene name`)]
male.genes.xlsx[,`HGNC gene name`:=if_else(`HGNC gene name`=="DNAAF5","HEATR2",`HGNC gene name`)]

write.table(male.genes.xlsx,"rawdata/genelists/male_infertility_genes.txt",row.names=F,col.names=F,sep="\t",quote=F)

rm(male.genes.xlsx)
```

```{bash Annotate Male Infertility Genes}

scripts/matcher.pl -file1 rawdata/genelists/hg19.genes.txt -file2 rawdata/genelists/male_infertility_genes.txt -col1 5 -r | perl -ne 'chomp $_; @F = split("\t", $_); splice(@F, 4, 9); print join("\t", @F) . "\n";' > rawdata/genelists/male_infertility_genes.annotated.txt

```

```{bash Annotate Mouse Infertility Genes}

scripts/matcher.pl -file1 rawdata/genelists/hg19.genes.txt -file2 rawdata/genelists/mouse_infertility_genes.txt -col1 5 -r | perl -ne 'chomp $_; @F = split("\t", $_); splice(@F, 4, 9); print join("\t", @F) . "\n";' > rawdata/genelists/mouse_infertility_genes.annotated.txt

```

## 4B. Load Gene Lists For Phenotype Testing

### Generic Lists

This code block just loads our generic lists that we created above, namely:

1. Our hg19 -> hg38 conversion table
2. Table of all sHET values
3. Table of all pLI values

```{r Load Generic Lists}

## Load gene translation file
gene.translate <- fread("rawdata/genelists/hg38_to_hg19_ENSG.txt", header=F)
setnames(gene.translate,names(gene.translate),c("hg38.GENE","hg19.GENE"))

## Load sHET genes
shet.genes <- fread("rawdata/genelists/shet.hgnc.txt")
setnames(shet.genes,names(shet.genes),c("hg19.GENE","GENE","sHET.val","HGNC.ID"))
shet.genes[,deciles:=cut(sHET.val,breaks=quantile(sHET.val,seq(0,1,by=0.1)),include.lowest = T)]
shet.genes[,sHET.val.binary:=cut(sHET.val,breaks=c(0,0.15,1),labels = c("lt_015","gt_015"),right = F)]

print(paste0("Total number of genes with sHET value                       : ", nrow(shet.genes)))
print(paste0("Total number of genes with sHET value in both hg19 and hg38 : ", nrow(merge(shet.genes,gene.translate))))

## Load Cassa sHET genes
shet.genes.cassa <- fread("rawdata/genelists/shet.cassa.hgnc.txt")
## Remember, we had to annotate sHET with hg38 gene IDs
setnames(shet.genes.cassa,names(shet.genes.cassa),c("hg19.GENE","GENE","sHET.val","HGNC.ID"))

## Load pLI genes:
pli.genes <- fread("rawdata/genelists/hg19.all_genes_with_pli.txt")
pli.genes <- pli.genes[,c("V1","V6","V8","V5")]
setnames(pli.genes, c("hg19.GENE","GENE","pLI.val","HGNC.ID"))
pli.genes[,pLI.val.binary:=cut(pLI.val,breaks=c(0,0.9,1),labels = c("lt_09","gt_09"),right = F)]

## Generate lists for phenotype testing:
gene.lists <- list()
gene.lists["highPLI"] = list(pli.genes[pLI.val >= 0.9,hg19.GENE])
gene.lists["highsHET"] = list(shet.genes[sHET.val >= 0.15,hg19.GENE])
```

### Disease Genes

```{r Load Disease Genes}

## Get a list of genes to exclude which are known OMIM/DD/ClinVar genes
disease.genes <- fread("rawdata/genelists/diseaseGenes.txt",header=F)
setnames(disease.genes,"V1","hg19.GENE")
print(paste0("Number of disease genes: ", nrow(disease.genes)))
```

### Male Infertility Genes

```{r Load Male Infertility Genes}
male.infertility.genes <- fread("rawdata/genelists/male_infertility_genes.annotated.txt",header=F)
setnames(male.infertility.genes,names(male.infertility.genes),c("GENE","inheritance","evidence","hg19.GENE"))

print(paste0("Number of male infertility genes: ", nrow(male.infertility.genes)))
```

### Mouse Infertility Genes

As this list was manually curated, we have simply provided the final list of 728 genes included in the manuscript at `./rawdata/genelists/mouse_infertility_genes.annotated.txt`

```{r Load Mouse Infertility Genes}

mouse.infertility.genes <- fread("rawdata/genelists/mouse_infertility_genes.annotated.txt",header=F)
setnames(mouse.infertility.genes,names(mouse.infertility.genes),c("GENE","hg19.GENE","chr","start"))

print(paste0("Number of mouse infertility genes: ", nrow(mouse.infertility.genes)))

```

## 4C. Gene Expression Data

This code just quickly downloads GTEx expression data (v7) from the [GTEx Consortium](https://gtexportal.org/home/datasets) and loads it into R so we can use it later.

```{bash Download GTEx}

curl -o rawdata/genelists/GTEx_Analysis_2016-01-15_v7_RNASeQCv1.1.8_gene_median_tpm.gct.gz https://storage.googleapis.com/gtex_analysis_v7/rna_seq_data/GTEx_Analysis_2016-01-15_v7_RNASeQCv1.1.8_gene_median_tpm.gct.gz
gunzip -f rawdata/genelists/GTEx_Analysis_2016-01-15_v7_RNASeQCv1.1.8_gene_median_tpm.gct.gz

```

```{r Load Gene Expression}
## GTEX expression data:
expression <- fread("rawdata/genelists/GTEx_Analysis_2016-01-15_v7_RNASeQCv1.1.8_gene_median_tpm.gct")
expression[,c("gene","vers"):=tstrsplit(gene_id,".",fixed=T),by=1:nrow(expression)]
setnames(expression,"gene","hg19.GENE")

## Only retain genes with an sHET score:
expression <- merge(expression, shet.genes[,"hg19.GENE"], by = "hg19.GENE")

## GTEx testis expression
expression.testis <- expression[,c("hg19.GENE","Testis")]
```

### Making Gene Lists For Calculating sHET Burden

This block generates "expressed gene" lists for the purposes of creating tissue-specific s[het] burdens. These files are used when running the code for processing raw SNV and CNV data. See `SNVCalling_Filtering.Rmd` and `CNVCalling_Filtering.Rmd` for more details on what this means.

```{r make gene lists}

## Generate highly expressed gene lists for all tissues:
tissues.for.regression <- c()
for (t in names(expression)[4:56]) {
  
  t.file <- str_remove_all(t, " ")
  t.file <- str_replace_all(t.file,"-","_")
  t.file <- str_replace_all(t.file,"\\(\\S+\\)","")
  t.file <- str_to_lower(t.file)
  file <- paste0("rawdata/genelists/tissues/",t.file,"High.txt")
  fwrite(expression[get(t) > 0.5, "hg19.GENE"], file, quote = F, row.names = F, col.names = F, sep = "\t")
  tissues.for.regression <- c(tissues.for.regression,paste0("product_sHET_no_",t.file,"High"))
  
}

tissues.for.regression <- c("product_sHET",tissues.for.regression)
```

## 4D. Variant Data

### SNV Data

#### Downloading Requisite Data From UKBB

Section tbd, but need to make sure to get two files:

1. Variant calls themselves: `rawdata/snvresources/counts.ukbb_wes.200k.txt`
2. Field of 'has wes' from our derived data. I have already created a fake version at `rawdata/snvresources/has_exome.txt`

This data is not going to be available until final publication of the manuscript. This is due to guidelines from UK Biobank that do not allow returned data fields for studies not yet through peer review. We cannot make these calls available as part of this repository due to patient/subject protection.

```{bash Get SNVs, eval = F}


```

#### Annotating Rare Variants in UKBB

SNV and InDel annotation and sHET score calculation is handled by the document `SNVCalling_Filtering.Rmd` within this repository. Please see that document for details on how to perform SNV QC and annotation.

#### Loading SNV data

This code block directly loads the SNV data produced by `SNVCalling_Filtering.Rmd`.

```{r Load SNV Data 200k, fig.height=8, fig.width=12}

UKBB.counts.200k <- readRDS("rawdata/snvresources/counts.rds")
UKBB.genes.200k <- readRDS("rawdata/snvresources/genes.rds")
snv.counts.200k <- readRDS("rawdata/snvresources/200k_counts.rds")

UKBB.counts.200k[,source:="200k"]

## This just sets up for plotting in the Supplement (OVERALL.counts name is legacy code when I had both 50k and 200k in here)
OVERALL.counts <- copy(UKBB.counts.200k)
OVERALL.counts[,CSQ:=if_else(CSQ == "PTVs","LOF_HC",
                        if_else(CSQ == "Missense","MIS",
                                if_else(CSQ == "Synonymous", "SYN", CSQ)))]

OVERALL.counts[,CSQ:=if_else(CSQ == "LOF_HC","PTVs",if_else(CSQ == "MIS","Missense",if_else(CSQ == "SYN","Synonymous",as.character(NA))))]
OVERALL.counts[,AF:=if_else(AF=="AC1","Private Vars.","MAF < 1e-3 Vars.")]
OVERALL.counts[,AF:=factor(AF,levels=c("Private Vars.","MAF < 1e-3 Vars."))]

rm(UKBB.plot)
```

#### Plotting Various Counts

Just plotting some simple variant count diagrams for QC purposes. This data will be used to make supplementary figures later.

```{r Plotting Variant Totals, fig.height=4, fig.width=6}

UKBB.plot <- ggplot(UKBB.counts.200k,aes(CSQ,count,colour=AF)) + geom_boxplot() + ggtitle("UKBB") + ylim(0,100) + theme.legend
UKBB.plot

ggplot(UKBB.counts.200k[AF=="AC1" & CSQ=="LOF_HC"],aes(count,..density..)) + scale_alpha_discrete(range=c(1,0.5)) + geom_histogram(binwidth=1,position="identity") + ggtitle("AC = 1 LoF Variants") + theme.legend
ggplot(UKBB.counts.200k[AF=="AC1" & CSQ=="MIS"],aes(count,..density..)) + scale_alpha_discrete(range=c(1,0.5)) + geom_histogram(binwidth=1,position="identity") + ggtitle("AC = 1 Missense Variants") + theme.legend
ggplot(UKBB.counts.200k[AF=="AC1" & CSQ=="SYN"],aes(count,..density..)) + scale_alpha_discrete(range=c(1,0.5)) + geom_histogram(binwidth=1,position="identity") + ggtitle("AC = 1 Synonymous Variants") + theme.legend

ggplot(UKBB.counts.200k[AF=="AF0.1" & CSQ=="LOF_HC"],aes(count,..density..)) + scale_alpha_discrete(range=c(1,0.5)) + geom_histogram(binwidth=1,position="identity") + ggtitle("AF < 0.001 LoF Variants") + theme.legend
ggplot(UKBB.counts.200k[AF=="AF0.1" & CSQ=="MIS"],aes(count,..density..)) + scale_alpha_discrete(range=c(1,0.5)) + geom_histogram(binwidth=1,position="identity") + ggtitle("AF < 0.001 Missense Variants") + theme.legend
ggplot(UKBB.counts.200k[AF=="AF0.1" & CSQ=="SYN"],aes(count,..density..)) + scale_alpha_discrete(range=c(1,0.5)) + geom_histogram(binwidth=1,position="identity") + ggtitle("AF < 0.001 Synonymous Variants") + theme.legend

rm(UKBB.plot)
```

#### Variants per Gene

This is just basic QC to make sure we have a linear relationship between total synonymous variants and gene length. Note that there are a small number of genes that have 0 variants (n = 46), and they are genes:

* where the primary transcript is noncoding
* which do not have baits on the UKBB exome capture kit
* which have a high number of individuals with low allelic depth (< 10 reads) – the number of individuals w/sufficient depth is ~50k which seems too coincidental to me that something happened when going from 50k --> 200k
* which have alignment issues in Hg38 (gnomAD also has poor coverage for these genes)

```{r fig.height=6, fig.width=8}

genes <- fread("rawdata/genelists/hg38.coordinates.txt")
setnames(genes,names(genes),c("hg38.GENE","CHR","START","STOP","HGNC","NAME"))
genes <- merge(genes,gene.translate,by="hg38.GENE")

to.plot <- merge(shet.genes[,c("hg19.GENE")],UKBB.genes.200k[CSQ=="SYN" & maf == 0,c("GENE","UKBB")], by.x="hg19.GENE",by.y="GENE",all.x=T)
setnames(to.plot,"UKBB","new")

to.plot[,new:=ifelse(is.na(new), 0, new)]

to.plot <- merge(to.plot,genes[,c("hg19.GENE","hg38.GENE")],all.x=T,by="hg19.GENE")
to.plot[,ran:=hg19.GENE %in% genes[,hg19.GENE]]

gnomad <- fread("rawdata/genelists/gnomad.v2.1.1.lof_metrics.by_gene.txt")
to.plot <- merge(to.plot,gnomad[,c("gene_id","cds_length","chromosome")],by.x="hg19.GENE",by.y="gene_id",all.x=T)

ggplot(to.plot[chromosome!="X" & chromosome !="Y"],aes(new,cds_length)) + geom_point(size = 0.5) + xlab("# Syn Variants (MAF < 1e-3)") + ylab("Coding Length") + theme

```

### CNV Data

For how CNV data was prepared prior to this step, please see the document `CNVCalling_and_Filtering.RMD` which handles all of the data collation and QC of CNVs. This section only deals with getting the result of that document into R and attaching/quantifying gene data. The ultimate outputs of that document read in here are:

1. Annotations for each CNV as created with VEP and custom perl script.
2. Individuals which actually have CNV data.
3. The CNVs themselves.

#### Downloading Requisite Data From UKBB

Section tbd, but need to make to get two files (and place at the following location):

1. Variant calls themselves: `rawdata/cnvresources/ukbb.cnvs.qcd.txt`
2. Field of 'has CNV data' from our derived data. I have already created a fake version at `rawdata/cnvresources/has_cnvs.txt`

This data is not going to be available until final publication of the manuscript. This is due to guidelines from UK Biobank that do not allow returned data fields for studies not yet through peer review. We cannot make these calls available as part of this repository due to patient/subject protection.

```{bash Get CNVs, eval = F}


```

#### Building R Annotations from VEP:

This file was created as part of CNV QC in the `CNVCalling_Filtering.Rmd` document. See that document for more information on how this file was generated and CNV annotation in general.

```{r Build Annotations, fig.height=4, fig.width=10}

## Get annotation information:
annotations<-fread("rawdata/cnvresources/cnv_vep_parsed.revision.sorted.header.bed")

# Add a matchable locus:
annotations[,locus:=paste0(chr,":",start,"-",end)]
```

#### Build Data Files

Have created a file that contains the merged loci for MAF calculation:

And using VEP annotated file as generated from `CNVCalling_Filtering.Rmd`

UKBB Samples dropped here are derived from a few different sources:

1. Those who failed CNV QC (n = 2,591 individuals)
2. They were part of a batch with no calls (currently only know of batch18 - n = 4,620)

All other individuals are included if they have broadly European ancestry and are unrelated (as set in the above 'Setting Individuals To Use' section).

```{r Build CNV Calls}

## Get allele frequency information from CNVCalling_Filtering.Rmd:
ukbb.annotated.cnvs.qcd <- fread("rawdata/cnvresources/ukbb.cnvs.qcd.txt")
ukbb.annotated.cnvs.qcd[,eid:=as.character(eid)]

# Add impact information
ukbb.annotated.cnvs.qcd <- merge(ukbb.annotated.cnvs.qcd,annotations[,-c("chr","start","end","plis","shets")],by=c("locus","ct"))

for (g in names(gene.lists)) {
  ukbb.annotated.cnvs.qcd[,eval(g):=get(g)*gt]
}

## Get individuals w/o any CNV data (regardless of QC)
has.cnv.data <- fread("rawdata/cnvresources/has_cnvs.txt")
has.cnv.data[,eid:=as.character(eid)]
```

#### Variant Totals / Individual / Gene Group

Doing it this way so all of my main association testing uses the exact same numbers!

```{r CNV Variant Totals}

## Set variables with our "shet pre-calculated" figures to enable easy computation
product.shet.lists <- names(annotations)[grep("product",names(annotations))]
product.shet.lists.calc <- paste0("list(",paste(paste0("1-prod(1-",product.shet.lists,")"),collapse = ","),")")

## Set samples with Phenotype data
samples.UKBB.cnv <- UKBB.phenotype.data[,c("eid")]

## ... And remove bad arrays/missing data:
samples.UKBB.cnv <- samples.UKBB.cnv[eid %in% has.cnv.data[has_cnvs==1,eid]]

get.gene.counts.cnvs <- function (maf, type) {
  
  samp.size <- nrow(samples.UKBB.cnv)
  test <- ukbb.annotated.cnvs.qcd[eid %in% samples.UKBB.cnv[,eid] & filter.0.95.wes.support.score == T & ct == type]

  allele.frq <- test[ct==type,sum(gt),by=c("locus")]
  allele.frq[,frq:=V1/(samp.size*2)]
  setnames(allele.frq,"V1","ac")
  
  test <- merge(test,allele.frq[,c("locus","frq","ac")],by=c("locus"))
  
  g.names <- names(gene.lists)
  
  if (maf == 0) {
    test.counted <- test[ac == 1,lapply(.SD, sum),by="eid", .SDcols=g.names]
    test.counted.product <- test[ac == 1, eval(parse(text=product.shet.lists.calc)),by="eid"]
  } else {
    test.counted <- test[frq <= maf,lapply(.SD, sum),by="eid", .SDcols=g.names]
    test.counted.product <- test[frq <= maf, eval(parse(text=product.shet.lists.calc)),by="eid"]
  }
  
  setnames(test.counted.product,paste0("V",1:length(product.shet.lists)),product.shet.lists)
  final.stats <- merge(samples.UKBB.cnv,test.counted,by="eid",all.x=T)
  final.stats <- merge(final.stats,test.counted.product,by="eid",all.x=T)
  final.stats[,allele.freq:=maf]
  final.stats[,type:=type]
  final.stats[is.na(final.stats)] <- 0
  
  setnames(final.stats,"eid","sample_id")
  
  return(final.stats)
  
}

cnv.counts <- bind_rows(get.gene.counts.cnvs(1e-5,"DEL"),
                        get.gene.counts.cnvs(1e-5,"DUP"),
                        get.gene.counts.cnvs(1e-4,"DEL"),
                        get.gene.counts.cnvs(1e-4,"DUP"),
                        get.gene.counts.cnvs(1e-3,"DEL"),
                        get.gene.counts.cnvs(1e-3,"DUP"),
                        get.gene.counts.cnvs(0,"DEL"),
                        get.gene.counts.cnvs(0,"DUP"))

rm(annotations)
```

### Combining All Variant Totals

```{r Combine Variants}

## Master table of ALL variants:
# Set SNV column order the same because of the quantification magic I have to do above:
setcolorder(snv.counts.200k,neworder=names(cnv.counts))
variant.counts <- rbind(cnv.counts,snv.counts.200k)

paste0("Number of individuals with CNV data: ", length(unique(variant.counts[type == "DEL" & allele.freq == 0,sample_id])))
paste0("Number of individuals with SNV data: ", length(unique(variant.counts[type == "LOF_HC" & allele.freq == 0,sample_id])))

rm(cnv.counts, snv.counts, snv.counts.200k)
```

#### Burden of variants by Sex

Just to make sure sexes aren't burdened different between variant classes among highly constrained genes.

```{r Plotting Quant sHET, fig.height=5, fig.width=4}

## Sex specific burden:
sex.burden.calc <- function(t) {

  sex.specific <- variant.counts[allele.freq == 0 & type == t,c("sample_id","product_sHET")]
  sex.specific <- merge(sex.specific,UKBB.phenotype.data[,c("eid","sexPulse")],by.x="sample_id",by.y="eid")
  print(paste0("Number of Individuals w/",t,"s :", nrow(sex.specific)))

  ## Can do two tests, proportion in each sex with SOME sHET value, and actual testing burden of sHET
  ks <- ks.test(sex.specific[sexPulse == 1,product_sHET],sex.specific[sexPulse == 2,product_sHET])
  chi.sq <- matrix(c(nrow(sex.specific[product_sHET < 0.15 & sexPulse == 1]),
                     nrow(sex.specific[product_sHET < 0.15 & sexPulse == 2]),
                     nrow(sex.specific[product_sHET >= 0.15 & sexPulse == 1]),
                     nrow(sex.specific[product_sHET >= 0.15 & sexPulse == 2])),
                   nrow = 2,
                   dimnames = list(c("Male","Female"),c(paste0("No ",t),paste0("Has ", t))))

  prop <- prop.table(chi.sq,margin = 1)*100
  
  format.text <- paste0(t, " ≥ 0.15 sHET Prop Male (", sprintf("%0.2f",prop[1,2]),"), Female (", sprintf("%0.2f",prop[2,2]),"), p = ", sprintf("%0.2f", ks$p.value))
  return(format.text)

}

format.sex.burden.DEL <- sex.burden.calc("DEL")
format.sex.burden.DEL
format.sex.burden.PTV <- sex.burden.calc("LOF_HC")
format.sex.burden.PTV
```

### Add A Covariate for Having WES Data

Need this value in the UKBB Phenotype data.table to be able to filter out WES individuals when running Deletion models so that meta analysis isn't biased in anyway. This information is contained in the file `rawdata/snv_resources/has_exome.txt`, which is loaded here.

```{r add WES tag}

has.wes <- fread("rawdata/snvresources/has_exome.txt")
has.wes[,eid:=as.character(eid)]

UKBB.phenotype.data <- merge(UKBB.phenotype.data, has.wes, by = "eid", all.x = T)
```

### Add a Covariate for Having a Pathogenic CNV

```{r add Path CNV tag}

ukbb.path.carriers <- unique(ukbb.annotated.cnvs.qcd[path.locus != "null" & filter.0.95.wes.support.score == T, eid])
UKBB.phenotype.data[,has.path.cnv:=if_else(eid %in% ukbb.path.carriers, 1, 0)]

```

# 5. Variant Burden Impact on Traits

## 5A. Run Models on Compute Cluster

### Write Files

This chunk has a function which creates a table as input into our run.regression function. It then creates such a table for all possible models that we run in this (i.e. section #5) code section.

```{r Write Files}

## Save variants and Phenotype data that is the same for all data:
saveRDS(UKBB.phenotype.data, "rawdata/models/UKBB.phenotype.rdat")
saveRDS(variant.counts, "rawdata/models/variant_counts.rdat")

make.table <- function(gene.list,
                       y.var,
                       model.family,
                       name,
                       add.covars = c(),
                       remove.zeros = F,
                       remove.sequenced = T,
                       cutoff.high = F,
                       num.pcs = 40,
                       num.rare.pcs = 100,
                       indv.to.exclude = "",
                       indv.to.exclude.value = -1, 
                       nagel = F,
                       return.data = F,
                       add.maf = F
                       ) {
  
  if (return.data == F) {
    variants <- c("DEL","DUP","LOF_HC","SYN","MIS")
  } else {
    variants <- c("DEL","LOF_HC")
  }
  
  if (add.maf == T) {
    poss.maf <-  c(0, 1e-3, 1e-4, 1e-5)
  } else {
    poss.maf <- c(0, 1e-3)
  }
  
  model.table <- data.table(crossing(maf = poss.maf,
                                       sex = c(1,2),
                                       variant.type = variants,
                                       gene.list = gene.list,
                                       y.var = y.var,
                                       model.family = model.family,
                                       name = name,
                                       remove.zeros = remove.zeros,
                                       remove.sequenced = remove.sequenced,
                                       cutoff.high = cutoff.high,
                                       num.pcs = num.pcs,
                                       num.rare.pcs = num.rare.pcs,
                                       indv.to.exclude = indv.to.exclude,
                                       indv.to.exclude.value = indv.to.exclude.value,
                                       nagel = nagel,
                                       return.data = return.data
                                       ))
    
  if (is.null(add.covars)) {
    model.table[,add.covars:=list()]
  } else {
    model.table[,add.covars:=list(add.covars)]
  }

  return(model.table)
  
}

lm.master.table <- rbind(
  ## Fertility
  make.table("product_sHET","num.children","binomial","results.fertility", add.maf = T),
  make.table("product_sHET","num.children","quasipoisson","results.fertility.linear"),
  make.table("highsHET","num.children","binomial","results.fertility.genelists",cutoff.high = T),
  make.table("highPLI","num.children","binomial","results.fertility.genelists",cutoff.high = T),
  make.table("product_sHET_old","num.children","binomial","results.fertility.cassa"),
  make.table("product_sHET","num.children","quasipoisson","results.fertility.zero", remove.zeros = T),
  make.table("product_sHET_no_maleInfertilityGenes","num.children","binomial","results.excl.male"),
  make.table("product_sHET_no_mouseInfertilityGenes","num.children","binomial","results.excl.mouse"),
  make.table("product_sHET","num.children","binomial","results.fertility.no.male.infertility",indv.to.exclude="has.male.infertility",indv.to.exclude.value = 1),
  make.table("product_sHET_no_diseaseGenes","num.children","binomial","results.excl.disease"),
  make.table("product_sHET_no_mhdGenes","num.children","binomial","results.excl.mhd"),
  make.table("product_sHET","num.children","binomial","results.fertility.no.path",indv.to.exclude="has.path.cnv",indv.to.exclude.value = 1),
  make.table("product_sHET","num.children","binomial","results.fertility.age",indv.to.exclude="is.age.1940",indv.to.exclude.value = 1),
  make.table("product_sHET","num.children","binomial","results.fertility.age",indv.to.exclude="is.age.1950",indv.to.exclude.value = 1),
  make.table("product_sHET","num.children","binomial","results.fertility.age",indv.to.exclude="is.age.1960",indv.to.exclude.value = 1),
  
  ## Partner
  make.table("product_sHET","partner.in.house","binomial","results.partner"),
  make.table("product_sHET","lives.alone","binomial","results.lives.alone"),
  
  ## Cognition
  make.table("product_sHET","fluid.intel","gaussian","results.cog"),
  
  ## EA
  make.table("product_sHET","completed.college","binomial","results.ea"),
  
  ## Results MHT
  make.table("product_sHET","fi.developmental_disorder","binomial","results.mht"),
  make.table("product_sHET","fi.asd","binomial","results.mht"),
  make.table("product_sHET","fi.add","binomial","results.mht"),
  make.table("product_sHET","fi.scizo","binomial","results.mht"),
  make.table("product_sHET","fi.bipolar","binomial","results.mht"),
  
  make.table("product_sHET","hes.developmental_disorder","binomial","results.mht"),
  make.table("product_sHET","hes.asd","binomial","results.mht"),
  make.table("product_sHET","hes.add","binomial","results.mht"),
  make.table("product_sHET","hes.scizo","binomial","results.mht"),
  make.table("product_sHET","hes.bipolar","binomial","results.mht"),
  
  make.table("product_sHET","mht.binary","binomial","results.mht"),
  
  make.table("product_sHET","num.children","binomial","results.fertility.no.mhq",indv.to.exclude = "mht.binary", indv.to.exclude.value = 1),
  make.table("product_sHET","mhq.answered_mhq","binomial","results.answered.mhq"),
  make.table("product_sHET","has.email","binomial","results.email"),
  make.table("product_sHET","has.gp.data","binomial","results.has.CHOD"),
  
  ## Household Income
  make.table("product_sHET","household.income","gaussian","results.household.income", add.covars = c("partner.in.house")),
  
  ## Same Sex Behaviour
  make.table("product_sHET","same.sex","binomial","results.same.sex"),
  make.table("product_sHET","num.children","binomial","results.fertility.no.same.sex",indv.to.exclude = "same.sex", indv.to.exclude.value = 1),
  
  ## Had Sex
  make.table("product_sHET","had.sex","binomial","results.had.sex"),
  
  ## Neutral Phenotypes
  make.table("product_sHET","fresh.fruit","gaussian","results.fruit"),
  make.table("product_sHET","handedness","binomial","results.handedness"),
  make.table("product_sHET","is.blonde","binomial","results.hair"),
  
  ## Male Infertility Codes
  make.table("product_sHET","fi.fert","binomial","results.fertility.MIC.CHOD"),
  
  ## Townsend Index
  make.table("product_sHET","townsend.index","gaussian","results.townsend"),
  
  ## Joint Models:
  make.table("product_sHET","num.children","binomial","joint.models",return.data = T, nagel = T)

)

## (More) Joint Models
add.traits <- c("mht.binary","partner.in.house","completed.college","fi.fert","had.sex")

add.traits.all <- list()
z <- 1

for (x in c(1:length(add.traits))) {
  combs <- combn(add.traits, x, FUN = list)
  for (y in c(1:length(combs))) {
    add.traits.all[[z]] <- combs[[y]]
    z <- z+1
  }
}

for (i in c(1:length(add.traits.all))) {
  
  curr.cov <- add.traits.all[[i]]
  lm.master.table <- rbind(lm.master.table,
                           make.table("product_sHET","num.children","binomial","joint.models",return.data = T, nagel = T, add.covars = curr.cov))
  
}

## PCs
for (i in seq(10,40)) {
  lm.master.table <- rbind(lm.master.table,
                           make.table("product_sHET","num.children","binomial","results.pcs",num.pcs = i))
}
## For PCs we want to limit our number of models or this will take forever...
lm.master.table <- lm.master.table[name != "results.pcs" | (name == "results.pcs" & sex == 1 & (variant.type == "DEL" | variant.type == "LOF_HC") & maf == 0)]

## Rare PCs
for (i in seq(10,100)) {
  lm.master.table <- rbind(lm.master.table,
                           make.table("product_sHET","num.children","binomial","results.rare.pcs",num.rare.pcs = i))
}
## For PCs we want to limit our number of models or this will take forever...
lm.master.table <- lm.master.table[name != "results.rare.pcs" | (name == "results.rare.pcs" & sex == 1 & (variant.type == "DEL" | variant.type == "LOF_HC") & maf == 0)]

## Expressed Genes in Tissues
for (tissues in tissues.for.regression) {
  lm.master.table <- rbind(lm.master.table,
                           make.table(eval(tissues),"num.children","binomial","results.tissues"))
}
## For tissue-specific lists we want to limit our number of models or this will take forever...
lm.master.table <- lm.master.table[name != "results.tissues" | (name == "results.tissues" & (variant.type == "DEL" | variant.type == "LOF_HC") & maf == 0)]

saveRDS(lm.master.table, "rawdata/models/lm.master.table.rdat")
  
```

### Run Models on Farm

This is given as an example execution block for execution of the script `run_regressions.R` using the input rdat from above, and then mashing the results together with the script provided at `scripts.combine.R`.

```{bash run models, eval = F}

## Run the regressions:
bsub -q normal -M 3000 -o gridout/MODEL.%J.%I -J 'LM[1-1596]' './run_regressions.R'

## Mash them together into a single Rdat
./combine.R
```

### Read Final Models

We then read in all the final models. The code chunks below read this data back in, split it into the individual result, and make a plot of all possible models for that phenotype (i.e. 2 sexes, 5 variant classes, 2 MAF cutoffs)

```{r Read Models}

lm.results.table <- readRDS("rawdata/models/models.rdat")

```

## 5B. Function for Linear Modeling

This code blob builds a function which does linear or logistic modeling for all my variant associations of the format:

$ phenotype \sim s_{het[i,v]} + age + age^2 + birth.cohort + wes.status + PC1_{common}..PC40_{common} + PC1_{rare}..PC100_{rare} $

Where

$ s_{het[i,v]}$

is the s~het~ burden in individual $i$ for variant class $v$, where $v$ can be DEL, DUP, PTV, Missense, or Synonymous. This block has a number of flags to handle the differing cases we test (i.e. logistic vs linear model, removing all individuals without children, etc.). 

This function is *NOT* regularly used in the R doc and I instead use a compute cluster to run most models (see section above), but is here for legacy purposes, debugging, and as an example. This function is, for all intents and purposes, identical to the one included in the script `scripts/run_regressions.R`, with slight modifications for parallel computing purposes.

```{r Linear Regression Function}

run.regression <- function(maf,
                           gene.list,
                           y.var,
                           sex,
                           variant.type,
                           model.family,
                           add.covars = c(),
                           return.data = F,
                           remove.zeros = F,
                           remove.sequenced = T,
                           cutoff.high = F,
                           num.pcs = 40,
                           num.rare.pcs = 100,
                           indv.to.exclude = c(),
                           nagel = F,
                           age.group = "all") {
 
  id.name <- "eid"
  
  cols.to.keep <- c("sample_id",gene.list)
  final.stats <- variant.counts[type == variant.type & allele.freq == maf,..cols.to.keep]

  if (cutoff.high == T & (variant.type == "DEL" | variant.type == "DUP" | variant.type == "LOF_HC")) {
    final.stats <- final.stats[get(gene.list) <= 3]
  }
  
  final.stats <- merge(final.stats,UKBB.phenotype.data,by.x="sample_id",by.y=id.name)
  
  ## Remove WES individuals from CNV analyses for meta-analysis purposes
  if (remove.sequenced == T & (variant.type == "DEL" | variant.type == "DUP")) {
    final.stats <- final.stats[has.wes == 0]
  } else {
    add.covars <- c(add.covars,"has.wes")
  }
  
  ## Remove missing y.var data
  final.stats <- final.stats[!is.na(get(y.var))]
  if (remove.zeros == T) {
    final.stats <- final.stats[get(y.var)>0]
  }

  ## Do a model w/o sex
  if (sex == 1 | sex == 2) {
    final.stats <- final.stats[sexPulse == sex]
  }
  
  ## Model seperate birth cohorts
  if (age.group == "first") {
    final.stats <- final.stats[birth.year > median(birth.year)]
  } else if (age.group == "last") {
    final.stats <- final.stats[birth.year <= median(birth.year)]
  }
  
  ## Remove individuals based on some criteria passed to this function
  final.stats <- final.stats[!sample_id %in% indv.to.exclude]
  
  ## Set linear or logistic model
  if (model.family == "binomial") {
    ## And force the phenotype to binary:
    final.stats[,binary.stat:=if_else(get(y.var) > 0,1,0)]
    y.var <- "binary.stat"
  }
  
  ## Remove missing additional covar data
  add.covars <- unlist(add.covars)
  for (cov in add.covars) {
    final.stats <- final.stats[!is.na(get(cov))]
  }
  
  covariates <- c(gene.list,add.covars,"agePulse.squared","agePulse","sexPulse","birth.year.cut")
  ## Set numbers of common/rare ancestry PCs
  if (num.pcs == 40) {
    covariates <- c(covariates,paste0("PC",seq(1,40)))
  } else if (num.pcs > 0)  {
    for (pc.num in c(1:num.pcs)) {
      covariates <- c(covariates, paste0("PC",pc.num))
    }
  }
  if (num.rare.pcs == 100) {
    covariates <- c(covariates,paste0("scaled.rare.PC",seq(1,100)))
  } else if (num.rare.pcs > 0) {
    for (pc.num in c(1:num.rare.pcs)) {
      covariates <- c(covariates, paste0("scaled.rare.PC",pc.num))
    }
  }
  
  cov.string <- paste(covariates, collapse=" + ")
  formated.formula <- as.formula(paste(y.var, cov.string,sep=" ~ "))
  
  test.lm <- glm(formated.formula, data=final.stats, family=model.family)
  coef.lm <- tidy(test.lm) %>% data.table()
  final.stats <- augment(test.lm) %>% data.table()
  total.hits <- final.stats[,sum(get(gene.list))]

  ## NAGLEKERKE STUFF
  if (nagel == T) {
    covariates <- c("agePulse.squared","agePulse","sexPulse","birth.year.cut",add.covars,paste0("PC",seq(1,40)),paste0("scaled.rare.PC",seq(1,100)))
  
    cov.string <- paste(covariates, collapse=" + ")
    formated.formula <- as.formula(paste(y.var, cov.string,sep=" ~ "))
  
    test.lm.init <- glm(formated.formula, data=final.stats, family=model.family)
    nag.shet.out <- nagelkerke(test.lm, null = test.lm.init)
    
    ## NAGLEKERKE STUFF
    if (return.data == T) {  
      return(list(coef.lm[term==eval(gene.list),estimate],
        coef.lm[term==eval(gene.list),std.error],
        coef.lm[term==eval(gene.list),p.value],
        total.hits,
        nrow(final.stats),
        list(coef.lm),
        nag.shet.out$Pseudo.R.squared.for.model.vs.null[3]))
    } else {
      return(list(coef.lm[term==eval(gene.list),estimate],
        coef.lm[term==eval(gene.list),std.error],
        coef.lm[term==eval(gene.list),p.value],
        total.hits,
        nrow(final.stats),
        NULL,
        nag.shet.out$Pseudo.R.squared.for.model.vs.null[3]))
    }
    
  } else {
    
    if (return.data == T) {
      return(list(coef.lm[term==eval(gene.list),estimate],
        coef.lm[term==eval(gene.list),std.error],
        coef.lm[term==eval(gene.list),p.value],
        total.hits,
        nrow(final.stats),
        list(coef.lm),
        NaN))
    } else {
      return(list(coef.lm[term==eval(gene.list),estimate],
        coef.lm[term==eval(gene.list),std.error],
        coef.lm[term==eval(gene.list),p.value],
        total.hits,
        nrow(final.stats),
        NULL,
        NaN))
    }
    
  }
  
}
```

## 5C. Plotting Function

This is a plotting function for generating a nice formated plot for initial data visualization purposes. It isn't used for any main text/supplemental data or figures.

```{r Plotting Function, fig.height=7, fig.width=8}

plot.result <- function(data, binary, num.tests, ymin, ymax, y.lab) {

  plottable <- copy(data)
  ## This just makes it so the labels aren't 500 miles long:
  plottable[,gene.list.2:=factor(gene.list,
                                 levels=sort(unique(plottable[,gene.list])),
                                 labels=c(str_wrap(gsub("_"," ",gsub("\\."," ",sort(unique(plottable[,gene.list])))),width=20)))]
  
  ## Checks for significance:
  # Note: We did 140 total tests in UKBB-CHOD data if including both MAFs, both sexes, and all gene lists
  # I think 140 is a bit restrictive as the MAF tests are likely independent, so going with 70 tests (exclude MAF cutoffs from correction)
  sig.threshold <- 0.05/num.tests
  
  ## Confidence Intervals and significance
  if (binary == T) {
    ## Convert to OR (I don't think this should effect the original table...)
    plottable[,var.ci.upper:=exp(var.beta + (1.96*var.stderr))]
    plottable[,var.ci.lower:=exp(var.beta - (1.96*var.stderr))]
    plottable[,sig.pos:=if_else(var.beta<0,var.ci.lower-0.1,var.ci.upper+0.1)]
    
    plottable[,var.beta:=exp(var.beta)]
    ylab <- "Odds Ratio"
    yline <- 1

  } else {
    plottable[,var.ci.upper:=var.beta + (1.96*var.stderr)]
    plottable[,var.ci.lower:=var.beta - (1.96*var.stderr)]
    plottable[,sig.pos:=if_else(var.beta<0,var.ci.lower-0.05,var.ci.upper+0.05)]

    ylab <- "Effect Size"
    yline <- 0
  }
  
  ## Set Arrows!
  plottable[,var.ci.lower.symbol:=if_else(var.ci.lower<ymin,25,NaN)]
  plottable[,var.ci.lower:=if_else(var.ci.lower<ymin,ymin,var.ci.lower)]
  plottable[,var.ci.upper.symbol:=if_else(var.ci.upper>ymax,24,NaN)]
  plottable[,var.ci.upper:=if_else(var.ci.upper>ymax,ymax,var.ci.upper)]
  
  plottable[,sig:=if_else(var.p <= sig.threshold, "*", "")]
  
  plottable[,sex:=factor(sex,levels=c(1,2,3),labels=c("Male","Female","Both"))]
  
  betas <- ggplot(plottable,aes(x=gene.list.2,y=var.beta,group=interaction(sex,variant.type,maf),colour=variant.type,linetype=as.factor(maf),shape=as.numeric(sex)+16)) +
    geom_hline(aes(yintercept=yline),linetype=7,colour="red") +
    geom_point(position=position_dodge(width=1)) +
    scale_x_discrete(name="") +
    scale_y_continuous(name=ylab,limits = c(ymin,ymax)) +
    geom_errorbar(aes(ymin=var.ci.lower,ymax=var.ci.upper),position=position_dodge(width=1),width=0) +
    geom_text(aes(y=sig.pos,label=sig),position=position_dodge(width=1),size=10,hjust="middle") +
    geom_point(aes(y=var.ci.lower,shape=var.ci.lower.symbol,fill=variant.type),position=position_dodge(width=1)) +
    geom_point(aes(y=var.ci.upper,shape=var.ci.upper.symbol,fill=variant.type),position=position_dodge(width=1)) +
    ## This code which uses the 'unique' function is to keep it from breaking if Males aren't present
    scale_shape_identity(guide=guide_legend(title = "Sex"),breaks=as.numeric(unique(plottable[,sex])) + 16,labels=unique(plottable[,sex])) + 
    scale_color_discrete(guide="none") +
    scale_linetype_discrete(guide=guide_legend(title = "MAF Threshold")) +
    theme.legend + theme(axis.text.x = element_blank())
  
  counts <- ggplot(plottable,aes(gene.list.2,n.var,group=interaction(sex,variant.type,maf),fill=variant.type,linetype=as.factor(maf))) +
    geom_col(position=position_dodge(),colour="black") +
    scale_x_discrete(name="") +
    scale_y_log10(name=y.lab) +
    scale_fill_discrete(guide=guide_legend(title="Variant Type")) +
    scale_linetype_discrete(guide="none") +
    theme.legend +
    theme(axis.text.x=element_blank())
  
  plot <- (counts / betas) + plot_layout(heights=c(1,3),guides = "collect")
  
  plot
  
  return(plot)
  
}
```

## 5D. Fertility

### Main Regression

```{r Childlessness Regression, fig.height=7, fig.width=12}
results.fertility <- lm.results.table[name == "results.fertility"]

plot.result(results.fertility,T,20,0,1.4, "# of Individuals\nWith sHET\n> 0.15")
```

### Additional Analyses

Note: I also tested if removing any CNV that overlaps the MHC locus (~chr6:29000000-33000000) changes the effect size. It did not. I don't have it documented here as it requires me to retool the code base and did not observe any change in effect. Likely reasons"

- Not many individuals have a rare variant that overlaps MHC (many have a variant but a lot are more common).
- HLA genes don't really have high s~het~ scores.
- Genes in this locus are more to do with the immune system. I believe our effect is going to be modulated more via neurodev/brain active genes.

#### Linear Model Instead of Logistic

This tests the relationship of s~het~ burden with actual number of children:

```{r Linear Childlessness Regression,  fig.height=7, fig.width=8}
results.fertility.linear <- lm.results.table[name == "results.fertility.linear"]

plot.result(results.fertility.linear,F,20,-0.65,0.4, "# of Individuals\nWith sHET\n> 0.15")
```

#### Using Gene Lists Instead of Quantitative sHET

```{r Gene List Regression, fig.height=7, fig.width=8}
results.fertility.genelists <- lm.results.table[name == "results.fertility.genelists"]

plot.result(results.fertility.genelists,T,70,0.5,1.2,"Total Number of\nVariants")

paste0("Number of DEL Individuals Lost, pLI ≥ 0.9  : ", results.fertility[variant.type == "DEL" & maf == 0,sum(n.indvs)] - results.fertility.genelists[variant.type == "DEL" & maf == 0 & gene.list == "highPLI",sum(n.indvs)])
paste0("Number of DEL Individuals Lost, sHET ≥ 0.9 : ", results.fertility[variant.type == "DEL" & maf == 0,sum(n.indvs)] - results.fertility.genelists[variant.type == "DEL" & maf == 0 & gene.list == "highsHET",sum(n.indvs)])

paste0("Number of PTV Individuals Lost, pLI ≥ 0.9  : ", results.fertility[variant.type == "LOF_HC" & maf == 0,sum(n.indvs)] - results.fertility.genelists[variant.type == "LOF_HC" & maf == 0 & gene.list == "highPLI",sum(n.indvs)])
paste0("Number of PTV Individuals Lost, sHET ≥ 0.9 : ", results.fertility[variant.type == "LOF_HC" & maf == 0,sum(n.indvs)] - results.fertility.genelists[variant.type == "LOF_HC" & maf == 0 & gene.list == "highsHET",sum(n.indvs)])

```

#### Cassa et al. s~het~

```{r cassa shet, fig.height=7, fig.width=8}

results.fertility.cassa <- lm.results.table[name == "results.fertility.cassa"]

plot.result(results.fertility.cassa,T,20,0,1.4, "# of Individuals\nWith Cassa sHET\n> 0.15")
```

#### Excluding Various Genes/Individuals

##### Individuals With >0 Children

```{r GT 0 Children Regression, fig.height=7, fig.width=8}
results.fertility.zero <- lm.results.table[name == "results.fertility.zero"]

plot.result(results.fertility.zero,F,20,-1,0.5,"# of Individuals\nWith sHET\n> 0.15")
```

##### Male Infertility Genes

```{r Exclude Male Infertility Genes, fig.height=7, fig.width=8}

results.excl.male <- lm.results.table[name == "results.excl.male"]

plot.result(results.excl.male,T,20,0,1.4,"# of Individuals\nWith sHET\n> 0.15")
```

##### Mouse Infertility Genes

```{r Exclude Mouse Infertility Genes, fig.height=7, fig.width=8}

results.excl.mouse <- lm.results.table[name == "results.excl.mouse"]

plot.result(results.excl.mouse,T,20,0,1.4,"# of Individuals\nWith sHET\n> 0.15")

```

##### Male Infertility Carriers

```{r No Male Infertility Carriers, fig.height=7, fig.width=8}

## Get males that have an infertility code
results.fertility.no.male.infertility <- lm.results.table[name == "results.fertility.no.male.infertility"]

plot.result(results.fertility.no.male.infertility,T,20,0,1.4,"# of Individuals\nWith sHET\n> 0.15")

## Total number of individuals with Male Infertility Coding:
paste0("# of Individuals with Male Infertility coding: ", length(UKBB.phenotype.data[has.male.infertility == 1]))

```

##### Known Disease Genes

```{r Exclude Disease, fig.height=7, fig.width=8}

results.excl.disease <- lm.results.table[name == "results.excl.disease"]

plot.result(results.excl.disease,T,70,0,1.4,"# of Individuals\nWith sHET\n> 0.15")

results.excl.mhd <- lm.results.table[name == "results.excl.mhd"]

plot.result(results.excl.mhd,T,70,0,1.4,"# of Individuals\nWith sHET\n> 0.15")
```

##### Pathogenic CNV Carriers

```{r No Pathogenic Regression, fig.height=7, fig.width=8}

## Kind of have to do this in a weird way so that I don't have to change my function massively
results.fertility.no.path <- lm.results.table[name == "results.fertility.no.path"]

plot.result(results.fertility.no.path,T,20,0,1.4,"# of Individuals\nWith sHET\n> 0.15")

## Total number of individuals with path CNVs:
path.cnv.counts <- ukbb.annotated.cnvs.qcd[eid %in% samples.UKBB.cnv[,eid] & filter.0.95.wes.support.score == T & path.locus != "null"]

paste0("CNV Carriers account for ",sprintf("%0.1f",(length(unique(path.cnv.counts[,eid]))/nrow(samples.UKBB.cnv)*100)),"% (", length(unique(path.cnv.counts[,eid])),") of individuals.)")
```

##### Expressed Genes in GTEx tissues

```{r gene regression, fig.height=20, fig.width=10}

results.genelists <- lm.results.table[name == "results.tissues"]

```


#### Gene Expression in Testis

```{r Testis Expression, fig.height=5, fig.width=5}

dels.male <- ukbb.annotated.cnvs.qcd[ct == "DEL" & eid %in% samples.UKBB.cnv[,eid] & !eid %in% has.wes[has.wes > 0,eid] & filter.0.95.wes.support.score == T,c("eid","genes","locus","wes.support.score","gt")]

af <- dels.male[,sum(gt),by="locus"]
setnames(af,"V1","ac")
tot.samps <- nrow(samples.UKBB.cnv)
af[,af:=ac/(tot.samps*2)]

dels.male <- merge(dels.male,af,by="locus")
dels.male <- dels.male[ac == 1]

counts.dels <- data.table(table(dels.male[,unlist(genes)]))
setnames(counts.dels,c("V1","N"),c("hg19.GENE","N.del"))

counts.ptvs <- UKBB.genes.200k[CSQ == "LOF_HC" & maf == 0,c("GENE","UKBB")]
setnames(counts.ptvs,c("GENE","UKBB"),c("hg19.GENE","N.ptv"))

shet.genes.expr <- merge(shet.genes[,c("hg19.GENE","sHET.val","GENE")],expression.testis[,c("hg19.GENE","Testis")],by="hg19.GENE")
shet.genes.expr[,male.infertility:=hg19.GENE %in% male.infertility.genes[,hg19.GENE]]

shet.genes.expr <- merge(shet.genes.expr,counts.dels,by="hg19.GENE", all.x = T)
shet.genes.expr <- merge(shet.genes.expr,counts.ptvs,by="hg19.GENE", all.x = T)
shet.genes.expr[,N.del:=ifelse(is.na(N.del),0L,N.del)]
shet.genes.expr[,N.ptv:=ifelse(is.na(N.ptv),0L,N.ptv)]

shet.genes.expr[,has.del:=N.del>0]
shet.genes.expr[,has.ptv:=N.ptv>0]
shet.genes.expr[,log.mean:=log(Testis)]

## Has a Private DEL
ggplot(shet.genes.expr[sHET.val > 0.15], aes(as.factor(has.del), log.mean, group = as.factor(has.del))) + geom_boxplot() + scale_x_discrete(name = "Has a private DEL") + scale_y_continuous(name = "Mean Expr. Testis") + ggtitle(expression(s[HET]~`>`~0.15~Genes~Only)) + theme
wilcox.test(log.mean ~ has.del, data = shet.genes.expr[sHET.val > 0.15],alternative=c("less"))

ggplot(shet.genes.expr, aes(as.factor(has.del), log.mean, group = as.factor(has.del))) + geom_boxplot() + scale_x_discrete(name = "Has a private DEL") + scale_y_continuous(name = "Mean Expr. Testis") + ggtitle("All Genes") + theme
wilcox.test(log.mean ~ has.del, data = shet.genes.expr,alternative=c("less"))

## Has a Private PTV
ggplot(shet.genes.expr[sHET.val > 0.15], aes(as.factor(has.ptv), log.mean, group = as.factor(has.ptv))) + geom_boxplot() + scale_x_discrete(name = "Has a private PTV") + scale_y_continuous(name = "Mean Expr. Testis") + ggtitle(expression(s[HET]~`>`~0.15~Genes~Only)) + theme
wilcox.test(log.mean ~ has.ptv, data = shet.genes.expr[sHET.val > 0.15],alternative=c("less"))

ggplot(shet.genes.expr, aes(as.factor(has.ptv), log.mean, group = as.factor(has.ptv))) + geom_boxplot() + scale_x_discrete(name = "Has a private PTV") + scale_y_continuous(name = "Mean Expr. Testis") + ggtitle("All Genes") + theme
wilcox.test(log.mean ~ has.ptv, data = shet.genes.expr,alternative=c("less"))

## Is a male infertility gene
ggplot(shet.genes.expr[sHET.val > 0.15], aes(as.factor(male.infertility), log.mean, group = as.factor(male.infertility))) + geom_boxplot() + scale_x_discrete(name = "Is a male infertility gene") + scale_y_continuous(name = "Mean Expr. Testis") + ggtitle(expression(s[HET]~`>`~0.15~Genes~Only)) + theme
wilcox.test(log.mean ~ male.infertility, data = shet.genes.expr[sHET.val > 0.15],alternative=c("less"))

ggplot(shet.genes.expr, aes(as.factor(male.infertility), log.mean, group = as.factor(male.infertility))) + geom_boxplot() + scale_x_discrete(name = "Is a male infertility gene") + scale_y_continuous(name = "Mean Expr. Testis") + ggtitle("All Genes") + theme
wilcox.test(log.mean ~ male.infertility, data = shet.genes.expr,alternative=c("less"))
```

#### Correlation of sHET in Various Tissues

```{r fig.height=14, fig.width=12.5}
expression <- fread("rawdata/genelists/GTEx_Analysis_2016-01-15_v7_RNASeQCv1.1.8_gene_median_tpm.gct")
expression[,c("gene","vers"):=tstrsplit(gene_id,".",fixed=T),by=1:nrow(expression)]
setnames(expression,"gene","hg19.GENE")
expression <- expression[,-c("gene_id","vers")]

tissues <- names(expression)[2:54]

expression <- merge(expression,shet.genes[,c("hg19.GENE","sHET.val")], by = "hg19.GENE")

expression <- merge(expression, UKBB.genes.200k[CSQ == "LOF_HC" & maf == 0,c("GENE","UKBB")], by.x = "hg19.GENE", by.y = "GENE", all.x = T)
setnames(expression, "UKBB","LOF_HC")
expression <- merge(expression, UKBB.genes.200k[CSQ == "SYN" & maf == 0,c("GENE","UKBB")], by.x = "hg19.GENE", by.y = "GENE", all.x = T)
setnames(expression, "UKBB","SYN")
expression[,LOF_HC:=ifelse(is.na(LOF_HC), 0, LOF_HC)]
expression[,SYN:=ifelse(is.na(SYN), 0, SYN)]

expression <- merge(expression,gnomad[,c("gene_id","cds_length","chromosome")],by.x="hg19.GENE",by.y="gene_id")

check.expr <- function(tissue) {
  
  cols <- c("hg19.GENE","sHET.val","LOF_HC","SYN","cds_length","chromosome",tissue)
  test.dt <- expression[,..cols]
  test.dt <- test.dt[chromosome != "X" & chromosome != "Y"]
  test.dt[,log.expr:=log(get(tissue))]
  test.dt[,cds_length.kb:=cds_length/1000]
  test.dt <- test.dt[!is.infinite(log.expr)]
  test.dt[,log.expr.norm:=(log.expr - mean(test.dt[,log.expr])) / sd(test.dt[,log.expr])]
  test.dt[,variant.mod.LOF:=(LOF_HC * sHET.val)]
  test.dt[,variant.mod.SYN:=(SYN * sHET.val)]
  
  ## First test sHET ~ expression + gene length
  covariates <- c("cds_length.kb","log.expr.norm")
  cov.string <- paste(covariates, collapse=" + ")
  formated.formula <- as.formula(paste("sHET.val", cov.string,sep=" ~ "))
  expr.lm <- lm(formated.formula, data = test.dt)
  expr.gl <- data.table(glance(expr.lm))
  
  ## Then test sHET * LOF ~ expression + gene length
  covariates <- c("cds_length.kb","log.expr.norm","LOF_HC")
  cov.string <- paste(covariates, collapse=" + ")
  formated.formula <- as.formula(paste("variant.mod.LOF", cov.string,sep=" ~ "))
  LOF.expr.lm <- lm(formated.formula, data = test.dt)
  LOF.expr.gl <- data.table(glance(LOF.expr.lm))
  LOF.expr.dt <- data.table(tidy(LOF.expr.lm))
  
  ## Then test sHET * SYN ~ expression + gene length
  covariates <- c("cds_length.kb","log.expr.norm","SYN")
  cov.string <- paste(covariates, collapse=" + ")
  formated.formula <- as.formula(paste("variant.mod.SYN", cov.string,sep=" ~ "))
  SYN.expr.lm <- lm(formated.formula, data = test.dt)
  SYN.expr.gl <- data.table(glance(SYN.expr.lm))
  SYN.expr.dt <- data.table(tidy(SYN.expr.lm))
  
  return(list(expr.gl$r.squared,
              LOF.expr.gl$r.squared,
              SYN.expr.gl$r.squared))
  
}

expr.test <- data.table(tissue = tissues)
expr.test[,c("r.sqr.shet","r.sqr.LOF","r.sqr.SYN"):=check.expr(tissue),by=1:nrow(expr.test)]
```

#### Effect In Varying Age Groups

```{r Age Groupings, fig.height=4, fig.width=10}

res.age <- lm.results.table[maf == 0 & name == "results.fertility.age" & (variant.type == "DEL" | variant.type == "LOF_HC")]
## Set age variable:
res.age[,age:=if_else(indv.to.exclude == "is.age.1940",1940,
                   if_else(indv.to.exclude == "is.age.1950",1950,1960))]
## Keep only relevant columns:
cols <- c("var.beta","var.stderr","y.var","sex","variant.type","age","n.indvs","var.p")
res.age <- res.age[,..cols]

for (i in c(1940,1950,1960)) {
  
  for (s in c(1,2)) {
    
    meta.table <- res.age[age == i & sex == s]
    
    meta.analy <- metagen(var.beta,
                          var.stderr,
                          studlab = variant.type,
                          method.tau = "SJ",
                          sm = "OR",
                          data = meta.table)

    res.age <- rbind(res.age,data.table(var.beta = meta.analy$TE.fixed,var.stderr = meta.analy$seTE.fixed,y.var = "num.children",sex = s,variant.type = "META",age = i, n.indvs=meta.table[,sum(n.indvs)],var.p=meta.analy$pval.fixed))
    
  }
  
}
  
res.age[,age:=as.character(age)]
## Actual results to compare against:
male <- results.fertility[sex == 1 & maf == 0 & (variant.type == "DEL" | variant.type == "LOF_HC"),c("var.beta","var.stderr","y.var","sex","variant.type","n.indvs","var.p")]
male[,age:="ALL"]
meta.analy.male <- metagen(var.beta,
                          var.stderr,
                          studlab = variant.type,
                          method.tau = "SJ",
                          sm = "OR",
                          data = male)

male <- rbind(male,data.table(var.beta = meta.analy.male$TE.fixed,var.stderr = meta.analy.male$seTE.fixed,y.var = "num.children",sex = 1, variant.type = "META",age = "ALL", n.indvs=male[,sum(n.indvs)],var.p=meta.analy.male$pval.fixed))

female <- results.fertility[sex == 2 & maf == 0 & (variant.type == "DEL" | variant.type == "LOF_HC"),c("var.beta","var.stderr","y.var","sex","variant.type","n.indvs","var.p")]
female[,age:="ALL"]
meta.analy.female <- metagen(var.beta,
                          var.stderr,
                          studlab = variant.type,
                          method.tau = "SJ",
                          sm = "OR",
                          data = female)

female <- rbind(female,data.table(var.beta = meta.analy.female$TE.fixed,var.stderr = meta.analy.female$seTE.fixed,y.var = "num.children",sex = 2,variant.type = "META",age = "ALL",n.indvs=female[,sum(n.indvs)],var.p=meta.analy.female$pval.fixed))

res.age <- rbind(res.age,female,male)

res.age[,var.ci.upper:=exp(var.beta + (1.96*var.stderr))]
res.age[,var.ci.lower:=exp(var.beta - (1.96*var.stderr))]
res.age[,var.beta:=exp(var.beta)]

rm(male, female, meta.analy.female, meta.analy.male)
```

#### Ancestry PC Analysis

##### Effect of PC on Having Children

```{r PCs}

phenotypes <- UKBB.phenotype.data[,c("eid",paste0("PC",seq(1,40)),paste0("scaled.rare.PC",seq(1,100)),"agePulse.squared","agePulse","sexPulse","num.children","has.wes","birth.year.cut")]
phenotypes[,has.children:=if_else(num.children>0,1,0)]
phenotypes <- phenotypes[sexPulse == 1]

covariates <- c("agePulse.squared","agePulse","has.wes","birth.year.cut",paste0("PC",seq(1,40)))
cov.string <- paste(covariates, collapse=" + ")
formated.formula <- as.formula(paste("has.children", cov.string,sep=" ~ "))

test.lm.init.common <- glm(formated.formula, data=phenotypes, family="binomial")
coef.lm.init.common <- tidy(test.lm.init.common) %>% data.table()
coef.lm.init.common[,pc.type:="common"]

covariates <- c("agePulse.squared","agePulse","has.wes","birth.year.cut",paste0("scaled.rare.PC",seq(1,100)))
cov.string <- paste(covariates, collapse=" + ")
formated.formula <- as.formula(paste("has.children", cov.string,sep=" ~ "))

test.lm.init.rare <- glm(formated.formula, data=phenotypes, family="binomial")
coef.lm.init.rare <- tidy(test.lm.init.rare) %>% data.table()
coef.lm.init.rare[,pc.type:="rare"]

coef.lm.init <- rbind(coef.lm.init.common, coef.lm.init.rare)

pc.effects.table <- coef.lm.init[grepl("PC", term),c("term","estimate","std.error","p.value","pc.type")]
setnames(pc.effects.table, names(pc.effects.table), c("PC","var.beta", "var.stderr", "var.p", "pc.type"))

pc.effects.table[,pc.num:=str_split(PC,"PC",simplify=T)[2],by=1:nrow(pc.effects.table)]
pc.effects.table[,pc.num:=as.integer(pc.num)]

pc.effects.table[,var.ci.upper:=exp(var.beta + (1.96*var.stderr))]
pc.effects.table[,var.ci.lower:=exp(var.beta - (1.96*var.stderr))]
pc.effects.table[,sig.pos:=if_else(var.beta<0,var.ci.lower-0.1,var.ci.upper+0.1)]
pc.effects.table[,var.beta:=exp(var.beta)]

rm(test.lm.init, coef.lm.init, phenotypes)
```

##### Separate PCs

```{r All PCs}

results.pcs <- lm.results.table[name == "results.pcs"]

get.pc.meta.val <- function(n) {
  
  meta.table <- results.pcs[num.pcs == n]
  
  meta.analy.var <- metagen(var.beta,
                            var.stderr,
                            studlab = variant.type,
                            method.tau = "SJ",
                            sm = "OR",
                            data = meta.table)
  
  return(list(meta.analy.var$TE.fixed,
              meta.analy.var$seTE.fixed,
              meta.analy.var$pval.fixed
              ))
  
}

results.pcs.meta <- unique(results.pcs[,c("maf","gene.list","y.var","sex","num.pcs")])
results.pcs.meta[,c("var.beta","var.stderr","var.p"):=get.pc.meta.val(num.pcs),by=1:nrow(results.pcs.meta)]
results.pcs.meta[,PCs:=paste0("..PC",num.pcs)]
results.pcs.meta[,PCs:=if_else(PCs == "..PC10","PC1..PC10",PCs)]
results.pcs.meta[,pc.type:="common"]

results.pcs.rare <- lm.results.table[name == "results.rare.pcs"]

get.pc.meta.val <- function(n) {
  
  meta.table <- results.pcs.rare[num.rare.pcs == n]
  
  meta.analy.var <- metagen(var.beta,
                            var.stderr,
                            studlab = variant.type,
                            method.tau = "SJ",
                            sm = "OR",
                            data = meta.table)
  
  return(list(meta.analy.var$TE.fixed,
              meta.analy.var$seTE.fixed,
              meta.analy.var$pval.fixed
              ))
  
}

results.pcs.rare.meta <- unique(results.pcs.rare[,c("maf","gene.list","y.var","sex","num.rare.pcs")])
results.pcs.rare.meta[,c("var.beta","var.stderr","var.p"):=get.pc.meta.val(num.rare.pcs),by=1:nrow(results.pcs.rare.meta)]
results.pcs.rare.meta[,PCs:=paste0("..PC",num.rare.pcs)]
results.pcs.rare.meta[,PCs:=if_else(PCs == "..PC10","PC1..PC10",PCs)]
setnames(results.pcs.rare.meta,"num.rare.pcs","num.pcs")
results.pcs.rare.meta[,pc.type:="rare"]

results.pcs.meta <- rbind(results.pcs.meta,
                          results.pcs.rare.meta)

results.pcs.meta[,log.p:=-1*log(var.p,10)]
results.pcs.meta[,var.ci.upper:=exp(var.beta + (1.96*var.stderr))]
results.pcs.meta[,var.ci.lower:=exp(var.beta - (1.96*var.stderr))]
results.pcs.meta[,sig.pos:=if_else(var.beta<0,var.ci.lower-0.1,var.ci.upper+0.1)]
results.pcs.meta[,var.beta:=exp(var.beta)]

```

## 5E. Partner at Home

### Main Regression

```{r Partner Regression, fig.height=7, fig.width=8}
results.partner <- lm.results.table[name == "results.partner" | name == "results.lives.alone"]

plot.result(results.partner,T,20,0,1.4,"# of Individuals\nWith sHET\n> 0.15")
```

## 5F. Cognition

### Main Regression

This code actually tests just overall effect on cognition if using my filter.

```{r Cognition Linear Regression, fig.height=7, fig.width=8}
results.cog <- lm.results.table[name == "results.cog"]

plot.result(results.cog,F,20,-1.4,0.5,"# of Individuals\nWith sHET\n> 0.15")
```

## 5G. Educational Attainment

### Main Regression

```{r Educational Attainment Regression, fig.height=7, fig.width=8}
results.ea <- lm.results.table[name == "results.ea"]

plot.result(results.ea,T,20,0,1.4,"# of Individuals\nWith sHET\n> 0.15")
```

## 5H. Mental Health

### Fertility Ratios

Here attempting to replicate the result from [Power et al](https://jamanetwork.com/journals/jamapsychiatry/article-abstract/1390257) which identified differential fertility rates among carriers/non carriers.

```{r Fertility Ratios, fig.height=4, fig.width=15}

fertility.ratios <- data.table(crossing(phenotype=gsub("hes.","",names(UKBB.phenotype.data)[grep("hes",names(UKBB.phenotype.data))]),
                                        sex=c(1,2),
                                        data.source=c("mhq","hes","fi")))
fertility.ratios <- fertility.ratios[(phenotype == "developmental_disorder" & data.source == "mhq") == F]
fertility.ratios <- fertility.ratios[phenotype != "fert"]

calc.mean.fertility <- function(sex, phenotype, data.source) {
  
  if (phenotype == "infertility" & (sex == 2 | data.source == "mhq")) {
    return(list(1.0,1.0,1L,1.0,1.0,1L,1.0,0.0,0.0,1.0))
  } else {
    col <- paste(data.source,phenotype,sep=".") 
    if (sex == 1) {
      relevant.fertility <- "children.fathered"
    } else {
      relevant.fertility <- "live.births"
    }
    
    cols <- c(relevant.fertility,col,"agePulse","agePulse.squared","PC1","PC2","PC3","PC4","PC5","PC6","PC7","PC8","PC9","PC10")
    table.to.use <- UKBB.phenotype.data[sexPulse == sex &!is.na(get(col)) & !is.na(get(relevant.fertility)),..cols]
    
    covariates <- c("PC1","PC2","PC3","PC4","PC5","PC6","PC7","PC8","PC9","PC10","agePulse.squared","agePulse")
    
    cov.string <- paste(covariates, collapse=" + ")
    formated.formula <- as.formula(paste(relevant.fertility, cov.string,sep=" ~ "))
  
    test.lm <- glm(formated.formula, data=table.to.use, family="gaussian")
    resid.test <- augment(test.lm) %>% data.table()
    resid.test[,eval(col):=table.to.use[,get(col)]]
    setnames(resid.test,".fitted","fitted")
    resid.test[,corrected:=`.resid` - min(resid.test[,`.resid`])]
    formated.formula <- as.formula(paste("corrected",paste0("as.factor(",col,")"),sep=" ~ "))
    ratio.test <- ttestratio(formated.formula,data=resid.test,base=1)
  
    res <- table.to.use[,list(mean(get(relevant.fertility),na.rm=T),sd(get(relevant.fertility),na.rm=T)),by=col]
    return(list(res[get(col)==0,V1],
                res[get(col)==0,V2],
                nrow(UKBB.phenotype.data[sexPulse==sex & get(col) == 0]),
                res[get(col)==1,V1],
                res[get(col)==1,V1],
                nrow(UKBB.phenotype.data[sexPulse==sex & get(col) == 1]),
                ratio.test$estimate[3],
                ratio.test$conf.int[1],
                ratio.test$conf.int[2],
                ratio.test$p.value))
  }
}

fertility.ratios[,c("mean.children.unaffected","sd.children.unaffected","n.unaffected","mean.children.affected","sd.children.affected","n.affected","ratio","ci.lower","ci.upper","p.val"):=calc.mean.fertility(sex,phenotype,data.source),by=1:nrow(fertility.ratios)]

fertility.ratios[,sex:=factor(sex,levels=c(1,2),labels=c("Male","Female"))]
test <- copy(fertility.ratios)
## Set Arrows!
test[,ci.lower.symbol:=if_else(ci.lower<0,25,NaN)]
test[,ci.lower:=if_else(ci.lower<0,0,ci.lower)]
test[,ci.upper.symbol:=if_else(ci.upper>1.4,24,NaN)]
test[,ci.upper:=if_else(ci.upper>1.4,1.4,ci.upper)]

test[,sig:=if_else(p.val <= 0.05/28, "*", "")]
test[,sig.pos:=if_else(ratio<1,ci.lower-0.03,ci.upper+0.03)]

test[,ratio:=if_else(phenotype == "infertility" & (sex == "Female" | data.source == "mhq"), NaN, ratio)]

mhq <- ggplot(test[data.source=="mhq"],aes(phenotype,ratio,group=sex,colour=sex)) +
  geom_point(position = position_dodge(width=0.5)) +
  geom_errorbar(aes(ymin=ci.lower,ymax=ci.upper),position=position_dodge(width=0.5),width=0) +    
  geom_text(aes(y=sig.pos,label=sig),position=position_dodge(width=0.8),size=5) +
  ylim(0,1.4) +
  geom_point(aes(y=ci.lower,shape=ci.lower.symbol,fill=sex),position=position_dodge(width=0.5)) +
  geom_point(aes(y=ci.upper,shape=ci.upper.symbol,fill=sex),position=position_dodge(width=0.5)) +
  scale_shape_identity() +
  theme.legend +
  ggtitle("MHQ") +
  coord_flip()

icd <- ggplot(test[data.source=="hes"],aes(phenotype,ratio,group=sex, colour=sex)) +
  geom_point(position = position_dodge(width=0.5)) +
  geom_errorbar(aes(ymin=ci.lower,ymax=ci.upper),position=position_dodge(width=0.5),width=0) +
  ylim(0,1.4) +
  geom_text(aes(y=sig.pos,label=sig),position=position_dodge(width=0.8),size=5) +
  theme.legend +
  geom_point(aes(y=ci.lower,shape=ci.lower.symbol,fill=sex),position=position_dodge(width=0.5)) +
  geom_point(aes(y=ci.upper,shape=ci.upper.symbol,fill=sex),position=position_dodge(width=0.5)) +
  scale_shape_identity() +
  ggtitle("HES") +
  coord_flip() +
  scale_x_discrete(name = "")
  # theme(axis.text.y=element_blank())

fi <- ggplot(test[data.source=="fi"],aes(phenotype,ratio,group=sex, colour=sex)) +
  geom_point(position = position_dodge(width=0.5)) +
  geom_errorbar(aes(ymin=ci.lower,ymax=ci.upper),position=position_dodge(width=0.5),width=0) +
  ylim(0,1.4) +
  geom_text(aes(y=sig.pos,label=sig),position=position_dodge(width=0.8),size=5) +
  theme.legend +
  geom_point(aes(y=ci.lower,shape=ci.lower.symbol,fill=sex),position=position_dodge(width=0.5)) +
  geom_point(aes(y=ci.upper,shape=ci.upper.symbol,fill=sex),position=position_dodge(width=0.5)) +
  scale_shape_identity() +
  ggtitle("CHOD") +
  coord_flip() +
  scale_x_discrete(name = "")
  # theme(axis.text.y=element_blank())

mhq + icd + fi + plot_layout(ncol = 3, guides = "collect")

rm(mhq,icd)
```

### Main Regressions

Conclusions from above are that the ICD-10 data is too sparse for us to be able to use it truly effectively. Also confirmed is that, in general, it also looks like the UKBB is healthier than the population as a whole (at least when comparing to Power et al.), which is a fairly obvious conclusion. With that in mind, we think we can only really do two different regressions:

* Binary of do you have a disability that has been previously shown to be associated with [rare variant burden](https://www.sciencedirect.com/science/article/pii/S0002929718301630)?:
    +  Schizophrenia, Autism, ADHD, Bipolar, ID/DD

This code block adds a binary value for having any of those phenotypes ('mht.binary') to our primary phenotype table. We also test the impact of s[het] burden on the individual disorders merged together in our "mht" binary score.

Note: I don't plot the results directly here as I do a custom figure creation in the Supplementary Figures section.

```{r MHQ regression, fig.height=7, fig.width=8}
results.mht <- lm.results.table[name == "results.mht"]

plot.result(results.mht[y.var == "mht.binary"],T,20,0,75,"# of Individuals\nWith sHET\n> 0.15")
```

### Exploring Issues with MHQ Data

#### Only test non-carriers

See if we just remove MH patients from our model, do we still have an effect?

```{r Remove MH Patients, fig.height=7, fig.width=8}

results.fertility.no.mhq <- lm.results.table[name == "results.fertility.no.mhq"]

plot.result(results.fertility.no.mhq,T,20,0,1.4,"# of Individuals\nWith sHET\n> 0.15")
```

#### Did Or Did Not Answer the MHQ

```{r Answered MHQ Regression, fig.height=7, fig.width=8}

results.answered.mhq <- lm.results.table[name == "results.answered.mhq"]

plot.result(results.answered.mhq,T,20,0,1.4,"# of Individuals\nWith sHET\n> 0.15")
```

#### Has an Email

```{r Email Regression, fig.height=8, fig.width=7}

results.email <- lm.results.table[name == "results.email"]

plot.result(results.email,T,20,0,1.4,"# of Individuals\nWith sHET\n> 0.15")
```

#### Has GP Records

This is just to show that when we use the binary for having GP records, we don't see the same bias that we see for respondents to the MHQ.

```{r has GP records}

results.has.CHOD <- lm.results.table[name == "results.has.CHOD"]

plot.result(results.has.CHOD,T,20,0,1.4,"# of Individuals\nWith sHET\n> 0.15")
```

## 5I. Household Income

### Main Regression

```{r household income, fig.height=7, fig.width=8}

## First do our standard linear model:
results.household.income <- lm.results.table[name == "results.household.income"]

plot.result(results.household.income,F,20,-1.5,0.5,"# of Individuals\nWith sHET\n> 0.15")
```

## 5J. Same Sex Sexual Behaviour

### Main Regression

```{r same sex, fig.height=7, fig.width=8}

## First do our standard linear model:
results.same.sex <- lm.results.table[name == "results.same.sex"]

plot.result(results.same.sex,T,20,0,5,"# of Individuals\nWith sHET\n> 0.15")

```

### Exclude Same Sex Behaviour Individuals

```{r Exclude same sex behaviour individuals, fig.height=7, fig.width=8}

results.fertility.no.same.sex <- lm.results.table[name == "results.fertility.no.same.sex"]

plot.result(results.fertility.no.same.sex,T,20,0,1.4,"# of Individuals\nWith sHET\n> 0.15")
```

## 5K. Sexual Behaviour Phenotypes

### Main Regression

```{r Sexual Behaviour, fig.height=7, fig.width=8}

## Had sex at all
results.had.sex <- lm.results.table[name == "results.had.sex"]

plot.result(results.had.sex,T,20,0,1.4, "# of Individuals\nWith sHET\n> 0.15")

```

## 5L. Townsend Deprivation Index

### Main Regression

```{r townsend, fig.height=8, fig.width=8.5}

results.townsend <- lm.results.table[name == "results.townsend"]
plot.result(results.townsend,F,20,-1,4, "# of Individuals\nWith sHET\n> 0.15")

```

## 5M. Neutral Phenotypes

### Main Regression

```{r Neutral Phenotypes, fig.height =7, fig.width=8}

results.fruit <- lm.results.table[name == "results.fruit"]
plot.result(results.fruit,F,20,-1,1, "# of Individuals\nWith sHET\n> 0.15")

results.handedness <- lm.results.table[name == "results.handedness"]
plot.result(results.handedness,T,20,0,1.4,"# of Individuals\nWith sHET\n> 0.15")

results.hair <- lm.results.table[name == "results.hair"]
plot.result(results.hair,T,20,0,1.4,"# of Individuals\nWith sHET\n> 0.15")

```

# 6. Correcting Childlessness for ICD10 Codes

## 6A. Loading the ICD-10 Tree

This code snippet just loads in the icd10 tree so we can build labels on our data.tables and figures.

```{r load icd10 tree}

## Load the ICD-10 tree
icd.codes <- fread("rawdata/phewas/icd10_tree.tsv")

## Factorize chapters for plotting
chapters <- c("I","II","III","IV","V","VI","VII","VIII","IX","X","XI","XII","XIII","XIV","XV","XVI","XVII","XVIII","XIX","XX","XXI","XXII")
chapters <- paste("Chapter", chapters, sep = " ")

chapters.table <- icd.codes[grepl("Chapter",coding),c("coding","meaning")]
chapters.table[,chapter:=factor(coding,levels=chapters)]
chapters.table[,rn:=str_remove(chapter,"Chapter ")]
setkey(chapters.table,chapter)

```

## 6B. Running Logistic Models

Going to just try a (relatively) naive approach at the moment based on the logic of [TreeWAS](https://www.nature.com/articles/ng.3926). This just means that not only do we test lower codes (like A00), but we also test groups of related codes (like A00-A09) and entire chapters (like Chapter I). For Hospital Episode Statistic (HES) data, we also test more specific codes (like A00.1) since that data has that granularity. For Complete Health Outcomes Data (CHOD), they limited to the disease level (i.e. A00), so we just test at that level. In short, we are just running the model:

$$ has.children \sim s_{het[i,v]} + has.icd.code + age + age^2 + PC1..PC10 $$

Due to the necessity to separately test 19,194 separate codings/coding blocks, this method is split into three parts to allow for parallelization: 

1. Writing the input files
2. Running the actual jobs (with the script `./scripts/run_logistic_model.R`)
    + This script runs the primary logistic model defined above, as well as meta-analyses combined PTV and DEL results.
3. Reading back in and getting final tables for analysis.

For the purposes of this R document and reproducibility, we have provided the results of our models in this repository at `./rawdata/phewas/`. If you want to run the models yourselves, please make use of the code below, but adjust to your own computer cluster. We have provided a script at `./scripts/run_logistic_model.R` which you can use to perform all possible logistic regressions. 

### Preparing Input Data

This section prepares input for the `run_logistic_model.R` script. It is not necessary to run this if you are using our provided data tables. If you do use this, make sure that you adjust the path within this script for the ICD_10 tree and the rdat files that are created.

```{r PhewasModel, fig.height=6, fig.width=8}

## Make HES Table:
# Convert individual level ICD data into format for TreeWAS-esque logit model:
disease.table <- hes.data.long[,c("eid","icd.code")]
# Add a dummy variable for age that will always be included so that we don't need two scripts for HES and CHOD codes.
disease.table[,age.at.incidence:=0]
saveRDS(disease.table,"rawdata/phewas/samples.hes.rdat")

## Make CHOD Table:
# Convert individual level ICD data into format for TreeWAS/logit model:
# Has to be a TSV here since that's what TreeWAS requires
disease.table <- processed.CHOD[,c("eid","code","age.at.incidence")]
setnames(disease.table,"code","icd.code")
saveRDS(disease.table,"rawdata/phewas/samples.chod.rdat")

```

### Running Model Scripts

These are simply given as an example of how these were run on the Sanger computer cluster and are not intended to be run here. 

```{bash Run logit PheWAS, eval = F}

## HES Models:
bsub -q normal -M 3000 -o gridout/logistic.%J.%I -J 'LOG[1-19154]%500' './run_logistic_model.R rawdata/phewas/samples.hes.rdat 0 100 FALSE outfiles_hes/'
cat outfiles_hes/glm.*.out > glm.hes.out

## CHOD Models:
bsub -q normal -M 2500 -o gridout_fi/logistic.%J.%I -J 'LOGFI[1-19154]%500' './run_logistic_model.R rawdata/phewas/samples.chod.rdat 0 100 FALSE outfiles_chod/'
cat outfiles_chod/glm.*.out > glm.chod.out
```

### Loading Data

```{r icd}

build.ICD.table <- function(file) {
 
  data <- fread(file)

  setnames(data,names(data),c("coding","meaning","node_id","sex","variant.type","var.est","var.err","var.p","icd.est","icd.err","icd.p","N","N.cases","chapter","level"))
  
  data[,chapter:=factor(chapter,levels=chapters.table[,chapter],labels = chapters.table[,rn])]
  
  data[,var.err.upper:=exp(var.est + (1.96*var.err))]
  data[,var.err.lower:=exp(var.est - (1.96*var.err))]
  data[,var.or:=exp(var.est)]
  
  data[,icd.err.upper:=exp(icd.est + (1.96*icd.err))]
  data[,icd.err.lower:=exp(icd.est - (1.96*icd.err))]
  data[,icd.or:=exp(icd.est)]
  
  data[,var.p.log:=-1*log(var.p,10)]
  data[,icd.p.log:=-1*log(icd.p,10)]
  
  ## This just adds a +/- value for which direction the OR is for plotting later
  data[,factor:=if_else(icd.est>0, 1, -1)]
  
  return(data)
   
}

## Load completed LMs
hes.analysis.table <- build.ICD.table("rawdata/phewas/glm.hes.out")
fi.analysis.table <- build.ICD.table("rawdata/phewas/glm.chod.out")

```

## 6C. Comparing Data Sources

```{r Comparing HES and CHOD data}

phewas.compare <- merge(fi.analysis.table[level == 3 & variant.type == "META",c("coding","meaning","sex","chapter","var.p.log","icd.p.log")], hes.analysis.table[level == 3 & variant.type == "META",c("coding","meaning","sex","var.p.log","icd.p.log")], by = c("coding","sex","meaning"), suffixes = c(".fi",".hes"))

ggplot(phewas.compare[sex == "MALE" & !is.na(var.p.log.fi)], aes(var.p.log.fi,var.p.log.hes,colour=chapter)) + 
  geom_point(size = 0.5) + 
  xlab("CHOD -log10 p.value") +
  ylab("HES -log10 p.value") +
  geom_text(data = phewas.compare[sex == "MALE" & !is.na(var.p.log.fi) & var.p.log.fi < 7.5 & var.p.log.hes < 12.85], aes(label = meaning),size=2) +
  theme.legend
ggplot(phewas.compare[sex == "MALE" & !is.na(var.p.log.fi)], aes(icd.p.log.fi,icd.p.log.hes,colour=chapter)) + 
  geom_point(size = 0.5) + 
  xlab("CHOD -log10 p.value") +
  ylab("HES -log10 p.value") +
  geom_text(data = phewas.compare[sex == "MALE" & !is.na(var.p.log.fi) & icd.p.log.fi > 12], aes(label = meaning),size=2) +
  theme.legend
```

## 6D. Infertility Codes

### Main Regression

Here we are checing the impact of s[het] burden on having any (male or female) fertility code with both the CHOD data and HES data independently. We are restricting our analyses of CHOD individuals to those with complete GP data to make sure we don't introduct additional noise. Unsure if added noise (from non GP indv.) > loss in power (due to remove 50% of indv.), or <, or = - but this is the easiest to conceptualize.

```{r MIC, fig.height =7, fig.width=8}

## CHOD Analyses
results.fertility.MIC.CHOD <- lm.results.table[name == "results.fertility.MIC.CHOD"]

plot.result(results.fertility.MIC.CHOD,F,20,-1,10, "# of Individuals\nWith sHET\n> 0.15")

```

### Additional Analyses

#### Childlessness + Infertility Code

This just pulls the ORs out of our PheWAS tables and plots them nicely for infertility codings for males (N46) vs females (N97).

```{r Male Infertility Codes, fig.height=5, fig.width=10}

columns <- c("coding","variant.type","var.or","var.err.lower","var.err.upper","var.p","icd.est","icd.err","icd.or","icd.err.lower","icd.err.upper","icd.p","N","sex")

male.codings.fi <- fi.analysis.table[coding=="N46" & sex == "MALE",..columns]
male.codings.fi[,data:="CHOD"]

male.codings.icd <- hes.analysis.table[coding=="N46" & sex == "MALE",..columns]
male.codings.icd[,data:="ICD"]

female.codings.fi <- fi.analysis.table[coding=="N97" & sex == "FEMALE",..columns]
female.codings.fi[,data:="CHOD"]

female.codings.icd <- hes.analysis.table[coding=="N97" & sex == "FEMALE",..columns]
female.codings.icd[,data:="ICD"]

infertility.codings <- rbind(male.codings.fi, male.codings.icd,female.codings.fi,female.codings.icd)
infertility.codings[,sexPulse:=factor(sex,levels=c("MALE","FEMALE"),labels=c("Male","Female"))]

male.ors.plot <- ggplot(infertility.codings[variant.type == "META"], aes(sexPulse, icd.or, group = interaction(data,sexPulse), linetype = data, colour = sexPulse)) +
  scale_x_discrete(name = "") +
  scale_y_continuous(name = "OR for the Impact of Having an\nInfertility Code (N46/N97) on Childlessness") +
  geom_hline(yintercept = 1, colour = "red", linetype = 2) +
  geom_point(position = position_dodge(0.5), size = 3) +
  geom_errorbar(aes(ymin = icd.err.lower, ymax = icd.err.upper),position = position_dodge(0.5), width = 0, size = 1) +
  scale_linetype_discrete(guide=guide_legend(title="Data Source")) +
  sex.colours.colour +
  coord_flip() + 
  theme.legend + theme(panel.grid.major.y=element_blank())

male.counts.plot <- ggplot(infertility.codings[variant.type == "META"], aes(sexPulse, N, group = interaction(data,sexPulse), linetype = data)) +
  scale_x_discrete(name = "") +
  scale_y_continuous(name = "# of Partic.") +
  geom_col(aes(fill = sexPulse), position = position_dodge(0.5),width=0.5,size=1, colour = "black") +
  coord_flip() +
  sex.colours.fill +
  scale_linetype_discrete(guide=guide_legend(title="Data Source")) +
  theme + theme(panel.grid.major.y = element_blank(), axis.text.y = element_blank())
  
male.ors.plot + male.counts.plot + plot_layout(ncol = 2, widths = c(3,1), guides = 'collect')

```

# 7. Modulation of Traits by Variant Burden

This section of code is what was used to estimate the contribution of each of our measured traits to overall childlessness and fitness. We first determine the contribution of [Fertility Alone](#7c._fertility_alone) and then do the same for other traits except for household income (due to issues with how the trait was recorded).

Each section other than the first 4 include a subheading:

1. For estimating the effect of a trait on childlessness alone through a general linear model.
2. For estimating the effect of a trait on childlessness and overall fitness

**Note**: All plotting of estimated fitness is done when actually generating [Figures](#7._figures).

## 7A. Function for Testing Simple Regressions Via GLM

This section is used for all traits to address the simple regression of:

$ childlessness \sim phenotype+age+age^2+PC1..PC10 $

There is a flag in the function for excluding PCs. This is so that we can use the models generated by the function to estimate the contribution of phenotype to childlessness when we cannot simulate PCs due to too much complexity. All ORs/Effect sizes reported on the manuscript for the effect of a trait on childlessness include PCs and are calculated when creating the Supplementary Figure in which they are reported.

This will also return a 'fit model' for the expected trait that we can then feed into our simulations.

```{r simple lm function for childlessness x phenotype}

run.lm <- function(sex, x.var, add.covars=c(), inc.PCs = F, return.model = T) {

  y.var <- "num.children"
  phenotypes <- UKBB.phenotype.data[sexPulse == sex & !is.na(get(y.var)) & !is.na(get(x.var))]
  
  ## Make sure none of our additional covariates are NA
  for (cov in add.covars) {
    if(grepl("\\*",cov) == F) {
      phenotypes <- phenotypes[!is.na(get(cov))]
    }
  }
  
  phenotypes[,binary.stat:=if_else(get(y.var) > 0,1,0)]
  
  covariates <- c(x.var,"agePulse.squared","agePulse",add.covars)
  if (inc.PCs == T) {
    covariates <- c(covariates,paste0("PC",seq(1,40)),paste0("scaled.rare.PC",seq(1,100)))
  }
  cov.string <- paste(covariates, collapse=" + ")
  form <- as.formula(paste("binary.stat", cov.string,sep=" ~ "))

  test.lm <- glm(form, data=phenotypes[sexPulse == sex], family = "binomial")
  coef.lm <- tidy(test.lm) %>% data.table()
  resid.table <- augment(test.lm) %>% data.table()
  resid.table[,sexPulse:=sex]
  
  cols <- c(".resid",x.var,"sexPulse",add.covars[-grep("\\*",add.covars)])
  
  if (return.model == T) {
    return(list(coef.lm[term==eval(x.var),estimate],
                coef.lm[term==eval(x.var),std.error],
                coef.lm[term==eval(x.var),p.value],
                list(resid.table[,..cols]),
                list(test.lm)))
  } else {
    return(list(coef.lm[term==eval(x.var),estimate],
                coef.lm[term==eval(x.var),std.error],
                coef.lm[term==eval(x.var),p.value],
                list(resid.table[,..cols]),
                list()))
  }
  
}

run.glm <- function(sex, x.var, add.covars=c(), inc.PCs = F, return.model = T) {

  y.var <- "num.children"
  phenotypes <- UKBB.phenotype.data[sexPulse == sex & !is.na(y.var) & !is.na(get(x.var))]
  
  ## Make sure none of our additional covariates are NA
  for (cov in add.covars) {
    if(grepl("\\*",cov) == F) {
      phenotypes <- phenotypes[!is.na(get(cov))]
    }
  }
  
  covariates <- c(x.var,"agePulse.squared","agePulse",add.covars)
  if (inc.PCs == T) {
    covariates <- c(covariates,paste0("PC",seq(1,40)),paste0("scaled.rare.PC",seq(1,100)))
  }
  cov.string <- paste(covariates, collapse=" + ")
  form <- as.formula(paste(y.var, cov.string,sep=" ~ "))

  test.lm <- glm(form, data=phenotypes[sexPulse == sex], family = "quasipoisson")
  coef.lm <- tidy(test.lm) %>% data.table()
  resid.table <- augment(test.lm) %>% data.table()
  resid.table[,sexPulse:=sex]
  
  cols <- c(".resid",x.var,"sexPulse",add.covars[-grep("\\*",add.covars)])
  
  if (return.model == T) {
    return(list(coef.lm[term==eval(x.var),estimate],
                coef.lm[term==eval(x.var),std.error],
                coef.lm[term==eval(x.var),p.value],
                list(resid.table[,..cols]),
                list(test.lm)))
  } else {
    return(list(coef.lm[term==eval(x.var),estimate],
                coef.lm[term==eval(x.var),std.error],
                coef.lm[term==eval(x.var),p.value],
                list(resid.table[,..cols]),
                list()))
  }
  
}

```

## 7B. Calculating Base Fertility Statistics for UKBB Participants

This code just generates base-level statistics for individuals in the UK Biobank. It also includes some functions for calculating the expected number of individuals with a given trait at a specified odd ratio. 

For the "mean fertility in the UK" value, we downloaded the 2019 table from the [UK Office for National Statistics](https://www.ons.gov.uk/peoplepopulationandcommunity/birthsdeathsandmarriages/livebirths/datasets/birthsummarytables) and simply computed the average of the "Total Fertility Rate (TFR)" column from Table 1 for years 1970-2000.

```{r Base Stats}

calc.base <- function(s, with.zero, childlessness) {
  covariates <- c("eid","sexPulse","children.fathered","live.births","has.wes")
        
  tab <- UKBB.phenotype.data[,..covariates]
  tab[,children:=if_else(sexPulse==1,children.fathered,live.births)]
  tab <- tab[!is.na(children)]
  tab[,has.children:=if_else(children>0,1,0)]
  
  ## Function to generate plots for Dels and PTVs
  final.data <- merge(tab,variant.counts[type=="LOF_HC" & allele.freq==0,c("sample_id","product_sHET","type")],by.x="eid",by.y="sample_id")
  final.data <- rbind(final.data,merge(tab[has.wes == 0],variant.counts[type=="DEL" & allele.freq==0,c("sample_id","product_sHET","type")],by.x="eid",by.y="sample_id"))
  
  if (childlessness == T) {
    return(nrow(final.data[sexPulse == s & eid %in% variant.counts[,sample_id] & children == 0 & product_sHET == 0])/nrow(final.data[sexPulse == s & eid %in% variant.counts[,sample_id] & children >= 0 & product_sHET == 0]))
  } else {
    if (with.zero == T) {
      return(final.data[sexPulse == s & eid %in% variant.counts[,sample_id] & children >= 0 & product_sHET == 0,mean(children)])
    } else {
      return(final.data[sexPulse == s & eid %in% variant.counts[,sample_id] & children > 0 & product_sHET == 0,mean(children)])
    }
  }
  
}

## Build an object of all base fertilities for Male and Female:
# 1. Base fertility only for individuals with children:
base.fertilities <- data.table(fertility = calc.base(1, F, F),inc.zero = F, sex = 1)
base.fertilities <- bind_rows(base.fertilities, data.table(fertility = calc.base(2, F, F),inc.zero = F, sex = 2))

# 2. Base fertility including all individuals:
base.fertilities <- bind_rows(base.fertilities, data.table(fertility = calc.base(1, T, F), inc.zero = T, sex = 1))
base.fertilities <- bind_rows(base.fertilities, data.table(fertility = calc.base(2, T, F), inc.zero = T, sex = 2))

## Base childlessness for all UKBB Participants
base.childlessness.male <- calc.base(1, NA, T)
base.childlessness.female <- calc.base(2, NA, T)

paste0("Base Childlessness Male             : ", sprintf("%0.1f",base.childlessness.male*100))
paste0("Base Childlessness Female           : ", sprintf("%0.1f",base.childlessness.female*100))

paste0("Base Children Among Males w/Child   : ", sprintf("%0.2f",base.fertilities[sex == 1 & inc.zero == F,fertility]))
paste0("Base Children Among Females w/Child : ", sprintf("%0.2f",base.fertilities[sex == 2 & inc.zero == F,fertility]))

paste0("Base Children Among Males          : ", sprintf("%0.2f",base.fertilities[sex == 1 & inc.zero == T,fertility]))
paste0("Base Children Among Females        : ", sprintf("%0.2f",base.fertilities[sex == 2 & inc.zero == T,fertility]))

## Array of ages to sample from:
ages <- UKBB.phenotype.data[eid %in% variant.counts[,sample_id], agePulse]

## Helper Functions:
calc.prop.indvs <- function(odds.ratio, healthy.ratio) {
  
  healthy.ratio / (odds.ratio + healthy.ratio)
  
}

simulate.proportion <- function(expected) {
  if (expected>=1) {
    return(0L)
  } else {
    return(rbinom(1,1,1-expected))
  }
}

```

## 7C. Fertility Alone

Here we estimate just the effect of s~het~ on childlessness and, through that estimate, overall fitness.

```{r Fertility Alone, fig.height=4, fig.width=12}

model.fertility <- data.table()

for (s in c(1,2)) {
  
  ## Get meta-anlysis OR from the original fertility LM that was calculated above in section 5C.
  meta.analy <- metagen(var.beta,
                        var.stderr,
                        studlab = variant.type,
                        method.tau = "SJ",
                        sm = "OR",
                        data = results.fertility[sex == s & maf == 0 & (variant.type == "DEL" | variant.type == "LOF_HC")])
  
  ## Get actual proportion of individuals without a child 
  ## our data generated above
  if (s == 1) {
    stat <- "children.fathered"
    prop.affected <- base.childlessness.male
  } else {
    stat <- "live.births"
    prop.affected <- base.childlessness.female
  }
  
  ## This is just a constant denominator when calculating expected 
  ## childlessness via the function calc.prop.indvs()
  aff <- prop.affected/(1-prop.affected)
  
  ## Determine expected proportions of childlessness at various shet 
  ## values based on meta-analysis OR
  for (modifier in seq(0,1,by=0.1)) {
  
    ## Calculate value for the actual OR, as well as upper and lower confidence intervals
    for (place in c("mid","lower","upper")) {
      
      ## Just modifies the actual OR (actual or upper/lower CI) that 
      ## will be used to get expected proportion of childlessness
      if (place == "mid") {
        or <- exp(meta.analy$TE.fixed*modifier)
      } else if (place == "upper") {
        or <- exp((meta.analy$TE.fixed + (1.96*(meta.analy$seTE.fixed))) * modifier)
      } else if (place == "lower") {
        or <- exp((meta.analy$TE.fixed - (1.96*(meta.analy$seTE.fixed))) * modifier)
      }
      
      ## Get proportion of individuals that are childless based off the OR
      prop.shet <- calc.prop.indvs(or, aff)
      # This works because we know that individuals w/o children are 0 and individuals with children do
      # not deviate from the population mean (see subsection in Fertility - Only Individuals With >0 Children)
      # So we obviously don't have to do (prop.shet * 0) + (1-prop.shet * base.fertility) since 0 * anything == 0 
      # (duh... but here since my brain forgot that for a second and got nervous that I did something wrong) 
      mean.children <- (1 - prop.shet) * base.fertilities[inc.zero == F & sex == s, fertility]
      
      ## Grab a mean fertility value to calculate a fertility ratio from
      if (s == 1) {
        to.use.ratio <- base.fertilities[sex == 1 & inc.zero == T, fertility]
      } else {
        to.use.ratio <- base.fertilities[sex == 2 & inc.zero == T, fertility]
      }
      
      model.fertility <- bind_rows(model.fertility,
                                   data.table(shet = modifier, 
                                              sex = s, 
                                              error = place, 
                                              mean.childlessness = 1 - prop.shet,
                                              mean.children = mean.children, 
                                              ratio = mean.children / to.use.ratio))
    
    }
  
  }
    
}

model.fertility[,mean.childlessness:=1-mean.childlessness]

model.fertility <- data.table(pivot_wider(model.fertility, names_from = error, values_from = c(mean.childlessness,mean.children,ratio)))

model.fertility[,sexPulse:=factor(sex,levels=c(1,2),labels=c("Male","Female"))]

paste0("Contribution of sHET to Fitness (sex averaged): ",
       sprintf("%0.1f",((1-model.fertility[shet == 1,mean(ratio_mid)])*100)),
       "% (",
       sprintf("%0.1f",((1-model.fertility[shet == 1,mean(ratio_upper)])*100)),
       " - ",
       sprintf("%0.1f",((1-model.fertility[shet == 1,mean(ratio_lower)])*100)),
       "%)")
```

## 7D. Fertility + Partner at Home

### Linear Model

```{r Partner Linear Model, fig.height=4, fig.width=7}

## Actual LM
sex.diff.partner <- data.table(sex=c(1,2))
sex.diff.partner[,c("beta","std.error","p.val","data","model"):=run.lm(sex,"partner.in.house"),by=1:nrow(sex.diff.partner)]

## Add CIs, factorize sex, get OR (from log odds ratio)
sex.diff.partner[,var.ci.upper:=exp(beta + (1.96*std.error))]
sex.diff.partner[,var.ci.lower:=exp(beta - (1.96*std.error))]
sex.diff.partner[,sig.pos:=if_else(beta<1,var.ci.lower-0.1,var.ci.upper+0.1)]
sex.diff.partner[,beta:=exp(beta)]
sex.diff.partner[,sexPulse:=factor(sex,levels=c(1,2),labels=c("Male","Female"))]

## This is just a plot of the betas from the LM
beta.plot <- ggplot(sex.diff.partner, aes(sexPulse,beta,colour=sexPulse)) + 
  geom_point() + 
  geom_errorbar(aes(ymin=var.ci.lower,ymax=var.ci.upper)) + 
  scale_y_continuous(name="OR", limits=c(0.9,6)) + 
  sex.colours.colour + 
  theme.legend

## This grabs the actual fit values from the LM to plot
post.lm.dt <- rbind(sex.diff.partner[1,data][[1]],sex.diff.partner[2,data][[1]])
post.lm.dt[,sexPulse:=factor(sexPulse,levels=c(1,2),labels=c("Male","Female"))]
post.lm.dt[,dummy:=1]

## Tabulate the means by sex
means <- post.lm.dt[,list(mean(`.resid`),sd(`.resid`),sum(dummy)),by=c("partner.in.house","sexPulse")]
setnames(means,c("V1","V2","V3"),c("mean","sd","n"))
means[,ci:=(sd/sqrt(n))*1.96]

## Now plot the means by sex
mean.plot <- ggplot(means,aes(as.factor(partner.in.house),mean,group=sexPulse,colour=sexPulse)) +
  geom_point(position=position_dodge(0.5)) +
  geom_errorbar(aes(ymin=mean-ci,ymax=mean+ci),width=0,position=position_dodge(0.5)) +
  scale_y_continuous(name="Births Corrected For\nAge & PC1..10") +
  scale_x_discrete(name = "Partner In Home",labels=c("False","True")) +
  sex.colours.colour +
  theme.legend

## Use patchwork to mash them together
(mean.plot | beta.plot) + plot_layout(widths=c(3,1),guides = "collect")

rm(sex.diff.partner)
```

## 7E. Fertility + Educational Attainment

### Linear Model

```{r EA Linear Model}

## Actual LM
sex.diff.ea <- data.table(sex=c(1,2))
sex.diff.ea[,c("beta","std.error","p.val","data","model"):=run.lm(sex,"completed.college"),by=1:nrow(sex.diff.ea)]

## Add CIs, factorize sex, get OR (from log odds ratio)
sex.diff.ea[,var.ci.upper:=exp(beta + (1.96*std.error))]
sex.diff.ea[,var.ci.lower:=exp(beta - (1.96*std.error))]
sex.diff.ea[,sig.pos:=if_else(beta<1,var.ci.lower-0.1,var.ci.upper+0.1)]
sex.diff.ea[,beta:=exp(beta)]
sex.diff.ea[,sexPulse:=factor(sex,levels=c(1,2),labels=c("Male","Female"))]

## This is just a plot of the betas from the LM
beta.plot <- ggplot(sex.diff.ea, aes(sexPulse,beta,colour=sexPulse)) + 
  geom_point() + 
  geom_errorbar(aes(ymin=var.ci.lower,ymax=var.ci.upper)) + 
  scale_y_continuous(name="OR",limits=c(-0.1,1)) + 
  sex.colours.colour +
  theme.legend

## This grabs the actual fit values from the LM to plot
post.lm.dt <- rbind(sex.diff.ea[1,data][[1]],sex.diff.ea[2,data][[1]])
post.lm.dt[,sexPulse:=factor(sexPulse,levels=c(1,2),labels=c("Male","Female"))]
post.lm.dt[,dummy:=1]

## Tabulate the means by sex
means <- post.lm.dt[,list(mean(`.resid`),sd(`.resid`),sum(dummy)),by=c("completed.college","sexPulse")]
setnames(means,c("V1","V2","V3"),c("mean","sd","n"))
means[,ci:=(sd/sqrt(n))*1.96]

## Now plot the means by sex
mean.plot <- ggplot(means,aes(as.factor(completed.college),mean,group=sexPulse,colour=sexPulse)) +
  geom_point(position=position_dodge(0.5)) +
  geom_errorbar(aes(ymin=mean-ci,ymax=mean+ci),width=0,position=position_dodge(0.5)) +
  scale_y_continuous(name="Births Corrected For\nAge & PC1..10") +
  scale_x_discrete(name = "Completed College",labels=c("False","True")) +
  sex.colours.colour +
  theme.legend

## Use patchwork to mash them together
(mean.plot | beta.plot) + plot_layout(widths=c(3,1),guides = "collect")

rm(sex.diff.ea)
```

## 7F. Fertility + Infertility Codes

### Linear Model

```{r}

sex.diff.infertility <- data.table(sex=c(1,2))
sex.diff.infertility[,c("beta","std.error","p.val","data","model"):=run.lm(sex,"fi.fert"),by=1:nrow(sex.diff.infertility)]

## Add CIs, factorize sex, get OR (from log odds ratio)
sex.diff.infertility[,var.ci.upper:=exp(beta + (1.96*std.error))]
sex.diff.infertility[,var.ci.lower:=exp(beta - (1.96*std.error))]
sex.diff.infertility[,sig.pos:=if_else(beta<1,var.ci.lower-0.1,var.ci.upper+0.1)]
sex.diff.infertility[,beta:=exp(beta)]
sex.diff.infertility[,sexPulse:=factor(sex,levels=c(1,2),labels=c("Male","Female"))]

## This is just a plot of the betas from the LM
beta.plot <- ggplot(sex.diff.infertility, aes(sexPulse,beta,colour=sexPulse)) + 
  geom_point() + 
  geom_errorbar(aes(ymin=var.ci.lower,ymax=var.ci.upper)) + 
  scale_y_continuous(name="OR", limits=c(0,1.5)) + 
  sex.colours.colour + 
  theme.legend

## This grabs the actual fit values from the LM to plot
post.lm.dt <- rbind(sex.diff.infertility[1,data][[1]],sex.diff.infertility[2,data][[1]])
post.lm.dt[,sexPulse:=factor(sexPulse,levels=c(1,2),labels=c("Male","Female"))]
post.lm.dt[,dummy:=1]

## Tabulate the means by sex
means <- post.lm.dt[,list(mean(`.resid`),sd(`.resid`),sum(dummy)),by=c("fi.fert","sexPulse")]
setnames(means,c("V1","V2","V3"),c("mean","sd","n"))
means[,ci:=(sd/sqrt(n))*1.96]

## Now plot the means by sex
mean.plot <- ggplot(means,aes(as.factor(fi.fert),mean,group=sexPulse,colour=sexPulse)) +
  geom_point(position=position_dodge(0.5)) +
  geom_errorbar(aes(ymin=mean-ci,ymax=mean+ci),width=0,position=position_dodge(0.5)) +
  scale_y_continuous(name="Births Corrected For\nAge & PC1..10") +
  scale_x_discrete(name = "Has Infertility Code",labels=c("False","True")) +
  sex.colours.colour +
  theme.legend

## Use patchwork to mash them together
(mean.plot | beta.plot) + plot_layout(widths=c(3,1),guides = "collect")

rm(sex.diff.infertility)
```

## 7G. Fertility + Household Income

### Linear Model

```{r HHI Linear Model, fig.height=5, fig.width=12}

## Actual LM
sex.diff.hhi <- data.table(sex=c(1,2))
sex.diff.hhi[,c("beta","std.error","p.val","data","model"):=run.lm(sex,"household.income",add.covars=c("partner.in.house","partner.in.house*household.income")),by=1:nrow(sex.diff.hhi)]

## Add CIs, factorize sex, get OR (from log odds ratio)
sex.diff.hhi[,var.ci.upper:=exp(beta + (1.96*std.error))]
sex.diff.hhi[,var.ci.lower:=exp(beta - (1.96*std.error))]
sex.diff.hhi[,sig.pos:=if_else(beta<1,var.ci.lower-0.1,var.ci.upper+0.1)]
sex.diff.hhi[,beta:=exp(beta)]
sex.diff.hhi[,sexPulse:=factor(sex,levels=c(1,2),labels=c("Male","Female"))]

## This is just a plot of the betas from the LM
beta.plot <- ggplot(sex.diff.hhi, aes(sexPulse,beta,colour=sexPulse)) +
  geom_point() +
  geom_errorbar(aes(ymin=var.ci.lower,ymax=var.ci.upper)) +
  scale_y_continuous(name="OR",limits=c(-0.1,1)) +
  sex.colours.colour +
  theme.legend

## This grabs the actual fit values from the LM to plot
post.lm.dt <- rbind(sex.diff.hhi[1,data][[1]],sex.diff.hhi[2,data][[1]])
post.lm.dt[,sexPulse:=factor(sexPulse,levels=c(1,2),labels=c("Male","Female"))]
post.lm.dt[,household.income.binned:=cut(household.income,breaks=c(seq(0,5,by=1)))]
post.lm.dt[,dummy:=1]

## Tabulate the means by sex
means <- post.lm.dt[,list(mean(`.resid`),sd(`.resid`),sum(dummy)),by=c("household.income.binned","sexPulse","partner.in.house")]
setnames(means,c("V1","V2","V3"),c("mean","sd","n"))
means[,ci:=(sd/sqrt(n))*1.96]

## Now plot the means by sex
mean.plot <- ggplot(means,aes(household.income.binned,mean,group=interaction(sexPulse,partner.in.house),colour=sexPulse,shape=as.factor(partner.in.house))) +
  geom_point(position=position_dodge(0.5)) +
  geom_errorbar(aes(ymin=mean-ci,ymax=mean+ci),width=0,position=position_dodge(0.5)) +
  scale_y_continuous(name="Births Corrected For\nAge & PC1..10") +
  scale_x_discrete(name = "Household Income Bin",labels = c(0:29)) +
  sex.colours.colour +
  theme.legend

## Use patchwork to mash them together
(mean.plot | beta.plot) + plot_layout(widths=c(3,1),guides = "collect")

rm(mean.plot, beta.plot, sex.diff.hhi)
```

## 7H. Fertility + Same Sex Sexual Behaviour

We also don't estimate increased/decreased childlessness for same sex sexual behaviour as there is no effect due to s~het~ burden.

### Linear Model

```{r Same Sex Linear Model, fig.height=5, fig.width=12}

## Actual LM
sex.diff.same.sex <- data.table(sexPulse=c(1,2))
sex.diff.same.sex[,c("beta","std.error","p.val","data","model"):=run.lm(sexPulse,"same.sex"),by=1:nrow(sex.diff.same.sex)]

## Add CIs, factorize sex, get OR (from log odds ratio)
sex.diff.same.sex[,var.ci.upper:=exp(beta + (1.96*std.error))]
sex.diff.same.sex[,var.ci.lower:=exp(beta - (1.96*std.error))]
sex.diff.same.sex[,sig.pos:=if_else(beta<1,var.ci.lower-0.1,var.ci.upper+0.1)]
sex.diff.same.sex[,beta:=exp(beta)]
sex.diff.same.sex[,sexPulse:=factor(sexPulse,levels=c(1,2),labels=c("Male","Female"))]

## This is just a plot of the betas from the LM
beta.plot <- ggplot(sex.diff.same.sex, aes(sexPulse,beta,colour=sexPulse)) +
  geom_point() +
  geom_errorbar(aes(ymin=var.ci.lower,ymax=var.ci.upper)) + 
  scale_y_continuous(name="OR",limits=c(-0.1,1)) + 
  sex.colours.colour +
  theme.legend

## This grabs the actual fit values from the LM to plot
post.lm.dt <- rbind(sex.diff.same.sex[1,data][[1]],sex.diff.same.sex[2,data][[1]])
post.lm.dt[,sexPulse:=factor(sexPulse,levels=c(1,2),labels=c("Male","Female"))]
post.lm.dt[,dummy:=1]

## Tabulate the means by sex
means <- post.lm.dt[,list(mean(`.resid`),sd(`.resid`),sum(dummy)),by=c("same.sex","sexPulse")]
setnames(means,c("V1","V2","V3"),c("mean","sd","n"))
means[,ci:=(sd/sqrt(n))*1.96]

## Now plot the means by sex
mean.plot <- ggplot(means,aes(as.factor(same.sex),mean,group=sexPulse,colour=sexPulse)) +
  geom_point(position=position_dodge(0.5)) +
  geom_errorbar(aes(ymin=mean-ci,ymax=mean+ci),width=0,position=position_dodge(0.5)) +
  scale_y_continuous(name="Births Corrected For\nAge & PC1..10") +
  scale_x_discrete(name = "Has Had Same Sex Behaviour",labels=c("False","True")) +
  sex.colours.colour +
  theme.legend

## Use patchwork to mash them together
(mean.plot | beta.plot) + plot_layout(widths=c(3,1),guides = "collect")
rm(mean.plot, beta.plot, sex.diff.same.sex)
```

## 7I. Fertility + Fluid Intelligence

### Linear Model

Testing the interaction of cognition and fertility via a glm of $Fertility \sim Fluid.Intel + control.covars$

```{r FI Linear Model, fig.height=4, fig.width=10}

## Actual LM
sex.diff.fluidintel <- data.table(sex=c(1,2))
sex.diff.fluidintel[,c("beta","std.error","p.val","data","model"):=run.lm(sex,"fluid.intel"),by=1:nrow(sex.diff.fluidintel)]

## Add CIs, factorize sex, get OR (from log odds ratio)
sex.diff.fluidintel[,var.ci.upper:=exp(beta + (1.96*std.error))]
sex.diff.fluidintel[,var.ci.lower:=exp(beta - (1.96*std.error))]
sex.diff.fluidintel[,sig.pos:=if_else(beta<1,var.ci.lower-0.1,var.ci.upper+0.1)]
sex.diff.fluidintel[,beta:=exp(beta)]
sex.diff.fluidintel[,sexPulse:=factor(sex,levels=c(1,2),labels=c("Male","Female"))]

## This is just a plot of the betas from the LM
beta.plot <- ggplot(sex.diff.fluidintel, aes(sexPulse,beta,colour=sexPulse)) +
  geom_point() + 
  geom_errorbar(aes(ymin=var.ci.lower,ymax=var.ci.upper)) + 
  scale_y_continuous(name="OR", limits=c(0.8,1.5)) +
  sex.colours.colour +
  theme.legend

## This grabs the actual fit values from the LM to plot
post.lm.dt <- rbind(sex.diff.fluidintel[1,data][[1]],sex.diff.fluidintel[2,data][[1]])
post.lm.dt[,sexPulse:=factor(sexPulse,levels=c(1,2),labels=c("Male","Female"))]
post.lm.dt[,fluid.intel.binned:=cut(fluid.intel,breaks=14)]
post.lm.dt[,dummy:=1]

## Tabulate the means by sex
means <- post.lm.dt[,list(mean(`.resid`),sd(`.resid`),sum(dummy)),by=c("fluid.intel.binned","sexPulse")]
setnames(means,c("V1","V2","V3"),c("mean","sd","n"))
means[,ci:=(sd/sqrt(n))*1.96]

## Now plot the means by sex
mean.plot <- ggplot(means,aes(fluid.intel.binned,mean,group=sexPulse,colour=sexPulse)) +
  geom_point(position=position_dodge(0.5)) +
  geom_errorbar(aes(ymin=mean-ci,ymax=mean+ci),width=0,position=position_dodge(0.5)) +
  scale_y_continuous(name="Births Corrected For\nAge & PC1..10") +
  scale_x_discrete(name = "Fluid Intel Score",labels = c(0:29)) +
  sex.colours.colour +
  theme.legend

## Use patchwork to mash them together
(mean.plot | beta.plot) + plot_layout(widths=c(3,1),guides = "collect")

rm(sex.diff.fluidintel)
```

### Effect on Childlessness

This section uses paired IQ and fertility data from the Swedish birth cohort presented in the study.

#### Load Data and Fit Expected Models

We first load data for both:

1. IQ vs Mean Children: `rawdata/cognitive_data/cognitive_data_raw.txt`
2. IQ vs Increased Childlessness: `rawdata/cognitive_data/cognitive_childlessness_data_raw.txt`

The data that is loaded is identical to that which is presented in the Supplementary Materials of the manuscript. I have simply provided raw data files at the above locations.

##### IQ and Mean Children

```{r Fit Mean Children Data}

## Need to generate a fit for the model -- first load data and correct some errors
cog.raw <- fread("rawdata/cognitive_data/cognitive_data_raw.txt")
cog.raw[,Obs:=as.integer(str_replace(Obs,",",""))]
cog.raw[,SD:=as.numeric(SD)]

## This basically takes a set of input "estimated" parameters that are reasonably close by eye and generates a set of optimized parameters for a sigmoid curve
fit.log <- nls(Mean ~ a/(1 + exp(-b * (newiq - c))), start = list(a = 1.6, b = 0.15, c = 70), data = cog.raw[newiq <= 120])

## Generate a table to predict on that also contains actual data:
cog.raw <- bind_rows(data.table(newiq = c(1:62),Obs=NA,Mean=NA,SD=NA,Min=NA,Max=NA),cog.raw,data.table(newiq = c(140:200),Obs=NA,Mean=NA,SD=NA,Min=NA,Max=NA))
cog.raw[,pred.log:=predict(fit.log,cog.raw)]
cog.raw[,ci:=(SD/sqrt(Obs))*1.96]

## Generate quick plots of actual data:
ggplot(cog.raw,aes(newiq, Mean), colour="blue") +
  geom_line(colour="blue") +
  geom_ribbon(aes(ymin=Mean-ci,ymax=Mean+ci),colour="grey",alpha=0.5) +
  scale_alpha_continuous(range=c(0,1)) +
  scale_x_continuous(name = "IQ", limits=c(0,140)) +
  scale_y_continuous(name = "Average Children", limits=c(-0.1,2)) +
  theme

## The fitted model
ggplot(cog.raw) + 
  geom_line(data = cog.raw[newiq<120],aes(x=newiq, y=Mean),colour="blue") +
  geom_line(aes(x = newiq, y=pred.log),colour="green") +
  scale_x_continuous(name = "IQ", limits=c(0,140)) +
  scale_y_continuous(name = "Average Children", limits=c(-0.1,2)) +
  theme

## The IQ Distribution
ggplot(cog.raw,aes(newiq, Obs)) +
  geom_col() +
  theme

## Generate mean/sd from the actual distributions of Swedish IQ data and UKBB Fluid Intel for Males
iq.table <- data.table(iq = cog.raw[!is.na(newiq) & !is.na(Obs),rep(newiq, Obs)])
mean.cog <- iq.table[,mean(iq)]
sd.cog <- iq.table[,sd(iq)]

mean.fi <- UKBB.phenotype.data[sexPulse == 1 & !is.na(fluid.intel), mean(fluid.intel)]
sd.fi <- UKBB.phenotype.data[sexPulse == 1 & !is.na(fluid.intel), sd(fluid.intel)]

## Plot the FI/IQ distribution
fi.table <- UKBB.phenotype.data[sexPulse == 1 & !is.na(fluid.intel),c("fluid.intel")]
fi.table[,fi.cut:=cut(fluid.intel,breaks = c(seq(mean.fi-(sd.fi*4),mean.fi-(sd.fi*1),by=sd.fi),seq(mean.fi+(sd.fi*1),mean.fi+(sd.fi*4),by=sd.fi)))]
fi.table[,dummy:=1]
means <- fi.table[,sum(dummy)/nrow(fi.table),by=fi.cut]

ggplot(means, aes(fi.cut,V1)) + 
  geom_col() + 
  xlab("Fluid Intel Bin") + 
  ylab("Proportion of Individuals") + 
  theme

iq.table[,iq.cut:=cut(iq,breaks = c(seq(mean.cog-(sd.cog*4),mean.cog-(sd.cog*1),by=sd.cog),seq(mean.cog+(sd.cog*1),mean.cog+(sd.cog*4),by=sd.cog)))]
iq.table[,dummy:=1]
means <- iq.table[,sum(dummy)/nrow(iq.table),by=iq.cut]

ggplot(means, aes(iq.cut,V1)) + 
  geom_col() + 
  xlab("IQ Bin") + 
  ylab("Proportion of Individuals") + 
  theme

paste0("Mean IQ, Swedish Data: ", sprintf("%0.0f", mean.cog))
paste0("SD   IQ, Swedish Data: ", sprintf("%0.0f", sd.cog))
```

##### IQ and Childlessness

**Note**: The data provided in this section indicate _increased_ childlessness, where the baseline is at IQ = 100. All values are thus ± from the childlessness at IQ 100, and we adjust for that when doing our estimates.

```{r Fit Childlessness Data}

## Read in initial data:
childless.raw <- fread("rawdata/cognitive_data/cognitive_childlessness_data_raw.txt")

## Plot actual distribution
ggplot(childless.raw,aes(iq,inc.childlessness)) + 
  geom_line() + 
  geom_ribbon(aes(ymin=ci.lower,ymax=ci.upper),alpha = 0.5) + 
  theme

## What is actual childlessness?
childless.raw[,pred.childlessness:=inc.childlessness+base.childlessness.male]

## This basically takes a set of input "estimated" parameters that are reasonably close by eye and generates a set of optimized parameters for a sigmoid curve
## We have to invert the data so that it scales to 0 properly...
childless.raw[,inv.pred.childless:=(1 - (pred.childlessness))]
fit.log <- nls(inv.pred.childless ~ a/(1 + exp(-b * (iq - c))), start = list(a = 1.6, b = 0.15, c = 70), data = childless.raw[iq <= 120])
summary(fit.log)

## Generate a table to predict on that also contains actual data:
childless.raw <- bind_rows(data.table(iq = c(1:62),inc.childlessness=NA,`std. err.`=NA,t=NA,p.val=NA,ci.lower=NA,ci.upper=NA,inv.childless=NA,inv.pred.childless=NA,pred.childlessness=NA),childless.raw,data.table(iq = c(140:200),inc.childlessness=NA,`std. err.`=NA,t=NA,p.val=NA,ci.lower=NA,ci.upper=NA,inv.childless=NA,inv.pred.childless=NA,pred.childlessness=NA))

## Flip it back the same way again:
childless.raw[,pred.log:=predict(fit.log,childless.raw)]
childless.raw[,pred.log.inv:=(1-(pred.log))-base.childlessness.male]

## Plot fitted data
ggplot(childless.raw,aes(iq,inc.childlessness)) + 
  geom_line() + 
  geom_ribbon(aes(ymin=ci.lower,ymax=ci.upper),alpha = 0.5) + 
  geom_line(aes(iq,pred.log.inv),colour="blue") + 
  theme
```

#### Simulations

##### Helper Functions

First code block is just a set of helper functions to assist with simulating.

```{r FI Helper Functions}

## Simulate a set of individuals with reduced IQ and match their fertility scores
sim.cog <- function(effect.iq) {
  
  ## number to include in sample
  num.random<-100000 
  ## Use the mean and sd from our IQ distributions calculated above to simulate "healthy" individuals 
  IQ.sim <- data.table(norm.iq=round(rnorm(num.random, mean=mean.cog, sd=sd.cog)))
  
  ## Get 'drop' on IQ given effect.iq, where effect.iq 
  ## is the expected decrease in IQ given an sHET score.
  IQ.sim[,changed.iq:=round(norm.iq-effect.iq),by=1:nrow(IQ.sim)]
  
  ## This grabs the expected fertility at each IQ for both the healthy cohort and simulated sHET cohort
  IQ.sim <- merge(IQ.sim,cog.raw[,c("newiq","pred.log")],by.x="norm.iq",by.y="newiq")
  setnames(IQ.sim,"pred.log","norm.fertility")
  IQ.sim <- merge(IQ.sim,cog.raw[,c("newiq","pred.log")],by.x="changed.iq",by.y="newiq")
  setnames(IQ.sim,"pred.log","changed.fertility")
  
  ## And then just return the fertility ratio:
  return(IQ.sim[,mean(changed.fertility)/mean(norm.fertility)])

}

## Simulate a set of individuals with reduced IQ and decide if they are childless or not with random selection
## Function is very similar to above, but just for childlessness instead
sim.childlessness <- function(effect.iq, base.childlessness) {
  
  ## number to include in sample
  num.random<-100000
  ## Use the mean and sd from our IQ distributions calculated above to simulate "healthy" individuals 
  IQ.sim <- data.table(norm.iq=round(rnorm(num.random, mean=mean.cog, sd=sd.cog)))
  
  ## Get 'drop' on IQ given effect.iq, where effect.iq 
  ## is the expected decrease in IQ given an sHET score.  
  IQ.sim[,changed.iq:=round(norm.iq-effect.iq),by=1:nrow(IQ.sim)]
  IQ.sim <- merge(IQ.sim,childless.raw[,c("iq","pred.log.inv")],by.x="changed.iq",by.y="iq")
  setnames(IQ.sim,"pred.log.inv","changed.childlessness")
  
  ## This converts from an increase in childlessness to actual childlessness
  IQ.sim[,changed.childlessness:=changed.childlessness+base.childlessness]
  
  ## This removes VERY high IQ values that sometimes appear due to simulations
  IQ.sim <- na.omit(IQ.sim)
  
  ## Now simulate childlessness for each individual given changed childlessness
  IQ.sim[,has.child:=simulate.proportion(changed.childlessness),by=1:nrow(IQ.sim)]

  ## Return proportion of simulated childless individuals
  return(nrow(IQ.sim[has.child==0])/nrow(IQ.sim))

}

```

##### Actual Calculation

This section then does the actual simulations. This uses the formula: $\Delta_{IQ}= \beta_{fluid.intel} * \sigma_{IQ}$ to determine the expected change in IQ given an individuals expected drop in fluid intelligence as a function of s~het~.

```{r FI Childlessness}

## Datatable for return:
model.cog <- data.table()

for (s in c(1,2)) {

  ## Get meta-anlysis OR from the original fertility LM that was calculated above section.
  meta.analy <- metagen(var.beta,
                        var.stderr,
                        studlab = variant.type,
                        sm = "SMD",
                        prediction=T,
                        data = results.cog[sex == s & maf == 0 & (variant.type == "DEL" | variant.type == "LOF_HC")])
  
  ## Get actual proportion of individuals without a partner from our data generated above
  if (s == 1) {
    stat <- "children.fathered"
    prop.affected <- base.childlessness.male
  } else {
    stat <- "live.births"
    prop.affected <- base.childlessness.female
  }
  
  ## Determine expected proportions of childlessness at various shet
  ## values based on meta-analysis OR
  for (modifier in seq(0,1,by=0.1)) {
  
    ## Calculate value for the actual OR, as well as upper and 
    ## lower confidence intervals
    for (place in c("mid","lower","upper")) {
    
      ## Just modifies the actual OR (actual or upper/lower CI) that will
      ## be used to get expected proportion of childlessness
      if (place == "mid") {
        effect.fi <- meta.analy$TE.fixed*modifier
      } else if (place == "upper") {
        effect.fi <- (meta.analy$TE.fixed + (1.96 * meta.analy$seTE.fixed))*modifier
      } else if (place == "lower") {
        effect.fi <- (meta.analy$TE.fixed - (1.96 * meta.analy$seTE.fixed))*modifier
      }
    
      ## Uses the above function to determine a drop in IQ
      effect.iq <- abs(effect.fi * sd.cog)
      
      ## Uses above helper functions to determine simulated fertility/childlessness
      actual.effect.fert <- sim.cog(effect.iq)
      actual.effect.child <- sim.childlessness(effect.iq, prop.affected)
      
      ## Make a returnable data.table:
      model.cog <- bind_rows(model.cog,
                             data.table(val = effect.iq, 
                                        ratio = actual.effect.fert[[1]], 
                                        mean.childlessness = actual.effect.child, 
                                        shet = modifier, 
                                        sex = s, 
                                        error = place))
  
    }
      
  }

}

model.cog[,expected.iq:=mean.cog-val]
model.cog[,val:=NULL]

model.cog <- data.table(pivot_wider(model.cog, names_from = error, values_from = c(expected.iq, ratio, mean.childlessness)))

model.cog[,sexPulse:=factor(sex,levels=c(1,2),labels=c("Male","Female"))]

paste0("Contribution of Cognition to Fitness: ",
       sprintf("%0.1f",(((1 - model.cog[shet == 1 & sex == 1,ratio_mid]) / (1 - model.fertility[shet == 1 & sex == 1,ratio_mid]))*100)),
       "% (",
       sprintf("%0.1f",(((1 - model.cog[shet == 1 & sex == 1,ratio_upper]) / (1 - model.fertility[shet == 1 & sex == 1,ratio_upper]))*100)),
       " - ",
       sprintf("%0.1f",(((1 - model.cog[shet == 1 & sex == 1,ratio_lower]) / (1 - model.fertility[shet == 1 & sex == 1,ratio_lower]))*100)),
       "%)")

paste0("Predicted drop in IQ for sHET = 1 male  : " ,sprintf("%0.2f", model.cog[shet == 1 & sex == 1,100 - expected.iq_mid]))
paste0("Predicted drop in IQ for sHET = 1 female: " ,sprintf("%0.2f", model.cog[shet == 1 & sex == 2,100 - expected.iq_mid]))
```

## 7J. Fertility + Mental Health

### Linear Model

```{r MH Linear Model}

## Actual LM
sex.diff.mhq <- data.table(sex=c(1,2))
sex.diff.mhq[,c("beta","std.error","p.val","data","model"):=run.lm(sex,"mht.binary"),by=1:nrow(sex.diff.mhq)]

## Add CIs, factorize sex, get OR (from log odds ratio)
sex.diff.mhq[,var.ci.upper:=exp(beta + (1.96*std.error))]
sex.diff.mhq[,var.ci.lower:=exp(beta - (1.96*std.error))]
sex.diff.mhq[,sig.pos:=if_else(beta<1,var.ci.lower-0.1,var.ci.upper+0.1)]
sex.diff.mhq[,beta:=exp(beta)]
sex.diff.mhq[,sexPulse:=factor(sex,levels=c(1,2),labels=c("Male","Female"))]

## This is just a plot of the betas from the LM
beta.plot <- ggplot(sex.diff.mhq, aes(sexPulse,beta,colour=sexPulse)) + 
  geom_point() + 
  geom_errorbar(aes(ymin=var.ci.lower,ymax=var.ci.upper)) + 
  scale_y_continuous(name="OR",limits=c(-0.1,1)) + 
  sex.colours.colour +
  theme.legend

## This grabs the actual fit values from the LM to plot
post.lm.dt <- rbind(sex.diff.mhq[1,data][[1]],sex.diff.mhq[2,data][[1]])
post.lm.dt[,sexPulse:=factor(sexPulse,levels=c(1,2),labels=c("Male","Female"))]
post.lm.dt[,dummy:=1]

## Tabulate the means by sex
means <- post.lm.dt[,list(mean(`.resid`),sd(`.resid`),sum(dummy)),by=c("mht.binary","sexPulse")]
setnames(means,c("V1","V2","V3"),c("mean","sd","n"))
means[,ci:=(sd/sqrt(n))*1.96]

## Now plot the means by sex
mean.plot <- ggplot(means,aes(as.factor(mht.binary),mean,group=sexPulse,colour=sexPulse)) +
  geom_point(position=position_dodge(0.5)) +
  geom_errorbar(aes(ymin=mean-ci,ymax=mean+ci),width=0,position=position_dodge(0.5)) +
  scale_y_continuous(name="Births Corrected For\nAge & PC1..10") +
  scale_x_discrete(name = "Has Power et al. Disorder",labels=c("False","True")) +
  sex.colours.colour +
  theme.legend

## Use patchwork to mash them together
(mean.plot | beta.plot) + plot_layout(widths=c(3,1),guides = "collect")

rm(mean.plot, beta.plot, sex.diff.mhq)
```

### Effect on Childlessness

As described in the methods, we are using ORs extracted from [Ganna et al.](https://doi.org/10.1016/j.ajhg.2018.05.002) for three MH traits that have an association with rare variant burden, and fertility statistics for those same MH traits from [Power et al.](https://doi.org/10.1001/jamapsychiatry.2013.268). We provide tabulated forms of this data with the Supplementary Materials of the manuscript.

This calculation uses the function $OR_{s_{het}[x,t]}=\frac{log(OR_{ganna}) *s_{het}[x]} {0.1618034}$ to convert from an OR of one additional high pLI (≥ 0.9) gene. The calculation for a 'high pLI' gene is given below.

```{r MH Childlessness}

## Get mean high pLI
mean.highpLI <- merge(pli.genes,shet.genes)[pLI.val>=0.9,mean(sHET.val)]
print(paste0("high pLI genes (≥ 0.9) have a mean sHET value of : ",sprintf("%0.3f",mean.highpLI)))

## These are tabulated values that were extracted from either Ganna et al or Power et al.
modeling <- data.table(or = rep(c(1.4,1.3,1.25),2),
                       or.upper = rep(c(1.5,1.4,1.35),2),
                       n.indv = rep(c(2947,18890,14439),2),
                       sex.ratio = c(2/(2+1),1.5/(1.5+1),1/(1+1.5),1/(2+1),1/(1.5+1),1.5/(1+1.5)),
                       ratio = c(0.25,0.23,0.75,0.48,0.47,0.85),
                       trait = rep(c("asd","scizo","bipolar"),2),
                       sex = c(rep(1,3),rep(2,3)))

## Power et al. lists 1.76 as the mean number of children per person, 
## for which we can extrapolate the mean number of children for each trait
modeling[,mean.children:=1.76*ratio]
modeling[,incidence:=(sex.ratio*n.indv)/1178299]
modeling[,healthy.ratio:=(incidence/(1-incidence))]

## This is to make arbitrarily even CIs (why don't they provide their point estimates!!!)
modeling[,or.lower:=exp((-1*log(or.upper/or)) + log(or))]

## Use this function to calculate the expected number of individuals with 
## a MH trait at a given sHET value
calc.prop.indvs.mhq <- function(odds.ratio, healthy.ratio) {
  
  x <- odds.ratio * healthy.ratio
  y <- x + 1
  x / y

}

## Datatable for return:
mh.affected <- data.table()

for (s in c(1,2)) {

  ## Determine expected proportions of childlessness at various shet
  ## values based on meta-analysis OR
  for (shet in seq(0,1,by=0.1)) {
  
    ## Just modifies the actual OR (actual or upper/lower CI) that will
    ## be used to get expected proportion of childlessness
    for (place in c("mid","lower","upper")) {
    
      disorder.table <- data.table()
      
      for (disorder in unique(modeling[,trait])) {
        
        if (place == "mid") {
          or <- modeling[trait == disorder & sex == s,or]
        } else if (place == "upper") {
          or <- modeling[trait == disorder & sex == s,or.upper]
        } else if (place == "lower") {
          or <- modeling[trait == disorder & sex == s,or.lower]
        }
      
        ## This is the function that we use to convert from Ganna et al. ORs to sHET ORs.
        or <- (log(or) * shet)/mean.highpLI
        or <- exp(or)
        
        healthy.ratio <- modeling[trait == disorder & sex == s,healthy.ratio]
        children.affected <- modeling[trait == disorder & sex == s,ratio] * base.fertilities[sex == s & inc.zero == T, fertility]
        
        prop.shet <- calc.prop.indvs.mhq(or,healthy.ratio)
      
        disorder.table <- bind_rows(disorder.table,
                                    data.table(prop.affected = prop.shet,
                                               children.affected = children.affected,
                                               error = place,
                                               trait = disorder))
        
      }
      
      prop.affected.total <- disorder.table[,sum(prop.affected)]
      children.unaffected <- base.fertilities[sex == s & inc.zero == T, fertility]
      mean.children <- sum(disorder.table[1:3, prop.affected * children.affected]) + (children.unaffected * (1 - prop.affected.total))
      mh.affected <- bind_rows(mh.affected,
                               data.table(shet = shet, sex = s, error = place, mean.children = mean.children, 
                                          mean.has.disorder = prop.affected.total,
                                          inc.scizo=disorder.table[trait=="scizo", prop.affected],
                                          inc.asd=disorder.table[trait=="asd", prop.affected],
                                          inc.bipolar = disorder.table[trait=="bipolar", prop.affected]))
      
    }
  }
}

## Calculate a fertility ratio for each trait.
mh.affected[,ratio:=if_else(sex == 1,
                            mean.children/base.fertilities[sex==1 & inc.zero == T, fertility],
                            mean.children/base.fertilities[sex==2 & inc.zero == T, fertility])]

## And generate the final model data.table like for other traits
model.mhq <- data.table(pivot_wider(mh.affected[,-c("inc.scizo","inc.asd","inc.bipolar")], names_from = error, values_from = c(mean.children,ratio,mean.has.disorder)))
model.mhq[,sexPulse:=factor(sex,levels=c(1,2),labels=c("Male","Female"))]

## The following data table is used to plot incidence of each of the 
## MH traits we measure in the study
inc.mht <- data.table(pivot_longer(mh.affected,cols=starts_with("inc."),names_sep="\\.",names_to = c(".value","condition")))
inc.mht <- data.table(pivot_wider(inc.mht[,c("shet","sex","inc","condition","error")], names_from=error,values_from=inc))
inc.mht[,sexPulse:=factor(sex,levels=c(1,2),labels=c("Male","Female"))]

paste0("Contribution of MHTs to Fitness: ",
       sprintf("%0.1f",(((1 - model.mhq[shet == 1 & sex == 1,ratio_mid]) / (1 - model.fertility[shet == 1 & sex == 1,ratio_mid]))*100)),
       "% (",
       sprintf("%0.1f",(((1 - model.mhq[shet == 1 & sex == 1,ratio_upper]) / (1 - model.fertility[shet == 1 & sex == 1,ratio_upper]))*100)),
       " - ",
       sprintf("%0.1f",(((1 - model.mhq[shet == 1 & sex == 1,ratio_lower]) / (1 - model.fertility[shet == 1 & sex == 1,ratio_lower]))*100)),
       "%)")
```

## 7K. Joint Model

Here we are performing joint multivariate regressions with various different terms from above included.

### Main Regressions

```{r fig.height=12, fig.width=15}

## Get Nagelkerke results from my parallel model table
final.results <- lm.results.table[name == "joint.models" & maf == 0]

add.traits <- c("mht.binary","partner.in.house","completed.college","fi.fert","had.sex")
add.traits.all <- list()
z <- 2
add.traits.all[[1]] <- c()
for (x in c(1:length(add.traits))) {
  combs <- combn(add.traits, x, FUN = list)
  for (y in c(1:length(combs))) {
    add.traits.all[[z]] <- combs[[y]]
    z <- z+1
  }
}

final.results.matrix <- data.table()

for (i in c(1:length(add.traits.all))) {
  
  curr.cov <- add.traits.all[[i]]
  curr.cov.list <- list(curr.cov)
  
  ## This is the only way I could figure out how to do direct list equivalancy in data.table..... XD
  curr.results <- final.results[final.results[,identical(add.covars,curr.cov.list),by=1:nrow(final.results)][V1 == T,nrow]]
  setnames(curr.results,c("add.covars"),c("curr.cov"))
  cols <- c("maf","gene.list","y.var","variant.type","curr.cov","sex","var.beta","var.stderr","var.p","n.var","n.indvs","model","inc.r.shet")
  curr.results <- curr.results[,..cols]
  
  for (s in c(1,2)) {
    meta.table <- curr.results[sex == s]
    meta.analy <- metagen(var.beta,
                          var.stderr,
                          studlab = variant.type,
                          method.tau = "SJ",
                          sm = "OR",
                          data = meta.table)
    meta.table <- data.table(maf = 0,
                             gene.list = "product_sHET",
                             y.var = "num.children",
                             variant.type = "META",
                             curr.cov = curr.cov.list,
                             sex = s,
                             var.beta = meta.analy$TE.fixed,
                             var.stderr = meta.analy$seTE.fixed,
                             var.p = meta.analy$pval.fixed,
                             n.var = NaN,
                             n.indvs = curr.results[sex == s,sum(n.indvs)],
                             model = list(),
                             inc.r.shet = NaN)
    
    curr.results <- rbind(curr.results,meta.table)
    
  }

  final.results.matrix <- rbind(final.results.matrix, curr.results)
  
}

final.results.matrix[,curr.cov.string:=paste0(curr.cov), by = 1:nrow(final.results.matrix)]
final.results.matrix[,n.terms:=if_else(curr.cov.string == "", 0L, length(unlist(curr.cov))),by=1:nrow(final.results.matrix)]

final.results.matrix[,sexPulse:=factor(sex,levels=c(1,2),labels=c("Male","Female"))]
final.results.matrix[,var.ci.upper:=exp(var.beta + (1.96*var.stderr))]
final.results.matrix[,var.ci.lower:=exp(var.beta - (1.96*var.stderr))]
final.results.matrix[,or:=exp(var.beta)]

## This gets rid of weird vectorization
final.results.matrix[,curr.cov.string:=str_replace_all(str_replace_all(str_replace(curr.cov.string,"c\\(\"",""),"\"",""),"\\)","")]
## These set the actual names of covariates as human readable:
final.results.matrix[,curr.cov.string:=str_replace(curr.cov.string,"mht.binary","Has MHT")]
final.results.matrix[,curr.cov.string:=str_replace(curr.cov.string,"partner.in.house","Has Partner")]
final.results.matrix[,curr.cov.string:=str_replace(curr.cov.string,"completed.college","Completed College")]
final.results.matrix[,curr.cov.string:=str_replace(curr.cov.string,"fi.fert","Has ICD-10 Infertility Code")]
final.results.matrix[,curr.cov.string:=str_replace(curr.cov.string,"had.sex","Ever Had Sex")]
```

# 8. Figures

Create a directory to drop all figures and supplementary data into

```{bash Make Figure Dirs}

mkdir -p figures/supplement/

```

## 8A. Universal function to make forest plots

```{r Meta analysis calc and plotting}

## Helper functions for the main function:
get.meta.val.logistic <- function(study, m, s, g) {
  
  meta.table <- study[maf == m & (variant.type == "DEL" | variant.type == "LOF_HC") & sex == s & gene.list == g]
  meta.analy <- metagen(var.beta,
                        var.stderr,
                        studlab = variant.type,
                        method.tau = "SJ",
                        sm = "OR",
                        data = meta.table)
  
    return(list(meta.analy$TE.fixed,
                meta.analy$seTE.fixed,
                meta.analy$pval.fixed))
  
}

get.meta.val.linear <- function(study, m, s, g) {

  meta.table <- study[maf == m & (variant.type == "DEL" | variant.type == "LOF_HC") & sex == s & gene.list == g]
  meta.analy <- metagen(var.beta,
                        var.stderr,
                        studlab = variant.type,
                        sm = "SMD",
                        prediction=T,
                        data = meta.table)

    return(list(meta.analy$TE.fixed,
                meta.analy$seTE.fixed,
                meta.analy$pval.fixed))
    
}

## This formats p.values for the figures
format.p <- function(var.p) {

  if (var.p < 1e-2) {
    match <- sprintf("%0.1e",var.p)
    found <- str_match(match, "(\\d\\.\\d)e(\\-\\d+)")
    prefix <- as.double(found[,2])
    exponent <- as.integer(found[,3])
    return(list(prefix,exponent))
  } else {
    return(list(NaN,NaN))
  }

}

make.meta.table <- function(data, is.linear, 
                            gene.list = "product_sHET",
                            allele.freq = 0,
                            ymin = -0.15, 
                            ymax = 1.25, 
                            b = 0.2, 
                            block = 0, 
                            p.pos = 0.05,
                            title = "",
                            show.x = T,
                            alt.y.axis = NA) {

  meta.result <- data.table(crossing(maf = allele.freq,
                                  sex = c(1,2),
                                  gene.list = gene.list,
                                  variant.type = "META"))

  if (is.linear == T) {
    meta.result[,c("var.beta","var.stderr","var.p"):=get.meta.val.linear(data,maf,sex,gene.list),by=1:nrow(meta.result)]
    y.axis <- expression(bold(Effect~size~at~s[het]~burden == 1))
    plot.breaks <- unique(c(seq(0,ymin,by=-1 * b),seq(0,ymax,by=b)))
    plot.breaks <- plot.breaks[plot.breaks != block]

    meta.table <- rbind(data[maf == allele.freq & (variant.type == "DEL" | variant.type == "LOF_HC"),c("variant.type","sex","var.beta","var.stderr","var.p","n.var","n.indvs")],
                          meta.result[maf == allele.freq,c("variant.type","sex","var.beta","var.stderr","var.p")],
                        fill=TRUE)
    
    meta.table[,var.ci.upper:=var.beta + (1.96*var.stderr)]
    meta.table[,var.ci.lower:=var.beta - (1.96*var.stderr)]
    
  } else {
    meta.result[,c("var.beta","var.stderr","var.p"):=get.meta.val.logistic(data,maf,sex,gene.list),by=1:nrow(meta.result)]

    y.axis <- expression(bold(Odds~ratio~at~s[het]~burden==1))
    plot.breaks <- unique(c(seq(1,ymin,by=-1 * b),1, seq(b+b,ymax,by=b)))
    plot.breaks <- plot.breaks[plot.breaks != block]
    
    meta.table <- rbind(data[maf == allele.freq & (variant.type == "DEL" | variant.type == "LOF_HC"),c("variant.type","sex","var.beta","var.stderr","var.p","n.var","n.indvs")],
                          meta.result[maf == allele.freq ,c("variant.type","sex","var.beta","var.stderr","var.p")],
                        fill=TRUE)
    
    meta.table[,var.ci.upper:=exp(var.beta + (1.96*var.stderr))]
    meta.table[,var.ci.lower:=exp(var.beta - (1.96*var.stderr))]
    meta.table[,var.beta:=exp(var.beta)]
    
  }

  if (!is.na(alt.y.axis)) {
    y.axis <- alt.y.axis
  }
  
  n.male <- data[sex == 1 & maf == allele.freq & (variant.type == "DEL" | variant.type == "LOF_HC"), sum(n.indvs)]
  n.female <- data[sex == 2 & maf == allele.freq & (variant.type == "DEL" | variant.type == "LOF_HC"), sum(n.indvs)]
  
  meta.table[,n.indvs:=if_else(variant.type=="META",if_else(sex == 1, n.male, n.female),n.indvs)]
  meta.table[,Sex:=factor(sex,levels=c("1","2"),labels = c("Male","Female"))]
  meta.table[,sex:=NULL]
  meta.table[,variant.type:=factor(variant.type,levels=c("META","LOF_HC","DEL"))]
  meta.table[,variant.shape:=if_else(variant.type=="META",18,15)]
  meta.table[,p.nudge:=if_else(variant.type=="LOF_HC",-0.2,-0.3)]
  meta.table[,var.ci.upper:=if_else(var.ci.upper>ymax,ymax,var.ci.upper)]
  meta.table[,var.ci.lower:=if_else(var.ci.lower<ymin,ymin,var.ci.lower)]
  meta.table[,c("prefix","exponent"):=format.p(var.p),by=1:nrow(meta.table)]
  meta.table[,p.sig:=if_else(var.p < 0.0025, "*","")]
  
  if (show.x == T) {
    add.theme <- theme(panel.grid.major.y = element_blank())
  } else {
    add.theme <- theme(axis.title.x=element_blank(),axis.text.x=element_blank(), panel.grid.major.y = element_blank())
  }
  
  plot <- ggplot(meta.table,aes(variant.type,var.beta,group=Sex,colour=Sex)) +
    geom_hline(aes(yintercept=if_else(is.linear==T,0, 1)),colour="red",linetype=2,size=1) +
    geom_point(aes(size=n.indvs,shape=variant.shape),position=position_dodge(0.7)) +
    geom_errorbar(aes(ymin=var.ci.lower,ymax=var.ci.upper),width=0,position=position_dodge(0.7)) +
    geom_text(aes(y = p.pos, label=paste0("p==",if_else(var.p<1e-2,paste0(prefix,"%*%10^",exponent),sprintf("%0.2f",var.p)))), parse = T,position=position_dodge(0.75),size=3.5,hjust=1,show.legend = F) +
    geom_text(aes(y = p.pos, label = p.sig), position=position_dodge(0.65),size=7,hjust=0,show.legend = F) +
    scale_x_discrete(name = title, position = "top",labels=c("Meta","PTVs","Dels")) +
    scale_y_continuous(name=y.axis,limits = c(ymin,ymax), breaks=plot.breaks) +
    scale_shape_identity() +
    scale_size_area(breaks=c(25000,100000,175000),guide=guide_legend(title="# of Indivs."), limits = c(0, 200000)) +
    sex.colours.colour.rev +
    coord_flip(clip = "off") +
    theme.figures.legend + add.theme

  return(list(meta.table,plot))
  
}
```

## 8B. Main Text

**Note**: A very light amount of editing was used to make better figure legends for main text figures 1 and 2 and move A/B panels in Figure 1.

Figure size widths for _Nature_ are the following:

1   Column: 89  mm = 3.5 in
1.5 Column: 118 mm = 4.7 in
2   Column: 183 mm = 7.2 in

Maximum Depth is:

247 mm = 9.72 in

### Figure 1.

#### Fertility Meta-Analysis

```{r Main Text Figure 1a, fig.height=3, fig.width=8.5}

## This just gets a forest plot for childlessness linear regression
plot.a <- make.meta.table(results.fertility.linear, T, ymin = -1.15, ymax= 0.15, block = -1, p.pos = -0.9, b = 0.25)
plot.a

## This just gets a forest plot for our primary childlessness logistic regression
plot.b <- make.meta.table(results.fertility, F, b = 0.25,p.pos=0.1, ymin = -0.15, ymax = 1.15)
plot.b

```

#### Burden and Proportion Plots 

```{r Main Text Figure 1b-e, fig.height=5, fig.width=7}

## the 'tab' data.frame is relevant only for plot b and c, using old regressions for plot. 
## This is to just tabulate vitality statistics for each individual
tab <- UKBB.phenotype.data[,c("eid","sexPulse","children.fathered","live.births")]

tab[,children:=if_else(sexPulse==1,children.fathered,live.births)]
tab <- tab[!is.na(children)]

## Function to generate plots for Dels and PTVs
plot.dist.mean <- function (v) {
  
  ## Attach sHET burden to each individual
  quants <- merge(tab,variant.counts[type==v & allele.freq==0,c("sample_id","product_sHET")],by.x="eid",by.y="sample_id",all.x=T)
  quants[,sexPulse:=factor(sexPulse,levels=c("1","2"),labels = c("Male","Female"))]
  
  ## Only include individuals for which we have CNV or PTV data
  quants <- quants[eid %in% samples.UKBB.cnv[,eid] | eid %in% has.wes[has.wes>0,eid]]
  
  ## Get whether individuals have children or not:
  quants[,children.binary:=if_else(children>0,1,0)]
  quants[,dummy:=1]
  
  ## Remove individuals without an sHET score
  quants <- quants[!is.na(product_sHET)]
  
  ## The following code is to set axis limits and labels -- purely graphical
  # Bin sizes
  b <- c(-1,((0:4)*0.15),100)
  quants[,product_sHET.cut:=cut(product_sHET,breaks=b)]
  quants[,dummy:=1]
  
  # Quantify proportion of individuals in each sHET bin we create above
  totals <- quants[,sum(dummy),by=c("product_sHET.cut","sexPulse")]
  sums <- quants[,sum(dummy),by=c("sexPulse")]
  setnames(sums,c("sexPulse"),c("s"))
  totals[,prop:=V1/sums[s==sexPulse,V1],by=1:nrow(totals)]
  totals[,prop.2:=prop*100000]
  totals[,ci:=1.96*sqrt((prop.2*(100000-prop.2))/sums[s==sexPulse,V1]),by=1:nrow(totals)]
  
  ## Change the ugly formating that is the direct output of 'cut()' to something better for a plot label
  x.axis.labels <- str_replace(str_replace(str_replace(str_replace(totals[,levels(product_sHET.cut)],"\\[",""),"\\]",""),",","-"),"\\(","")
  x.axis.labels.2 <- c()
  for (l in x.axis.labels) {
    if (grepl("-100",l)) {
      l <- str_replace(l,"\\-100","")
      l <- paste0(">",l)
      x.axis.labels.2 <- c(x.axis.labels.2,l)
    } else if (grepl("-1-",l)) {
      l <- str_replace(l,"\\-1-","")
      x.axis.labels.2 <- c(x.axis.labels.2,l)
    } else {
      x.axis.labels.2 <- c(x.axis.labels.2,l)
    }
  }
  
  ## Cumulative Density Plot of joint.product_sHET by sex
  plot.dist <- ggplot(totals,aes(product_sHET.cut,prop.2,group=sexPulse,fill=sexPulse)) +
    geom_col(position=position_dodge(1),colour="black",size=0.5) +
    scale_x_discrete(name = "",labels=x.axis.labels.2) +
    scale_y_log10(name = "Proportion of individuals",breaks=c(1,10,100,1000,10000,100000),labels=paste0(c(0.001,0.01,0.1,1,10,100),"%"), limits = c(1, 100000)) +
    geom_errorbar(aes(ymin=prop.2-ci,ymax=prop.2+ci),position=position_dodge(1),width=0) +
    sex.colours.fill +
    theme.figures + theme(axis.text.x=element_blank(), line=element_line(size=0.75,colour="black",lineend="round"),panel.grid.major.x=element_blank())

  ## Percentage of Individuals With Children By Sex and Variant Type
  means <- quants[,list(mean(children.binary),sd(children.binary),sum(dummy)),by=c("sexPulse","product_sHET.cut")]
  setnames(means,c("V1","V2","V3"),c("mean","sd","n"))
  
  means[,mean:=mean*100]
  means[,sd:=sd*100]
  means[,ci:=(sd/sqrt(n))*1.96]
  means[,ci.lower:=mean-ci]
  means[,ci.upper:=mean+ci]
  
  means[,ci.lower.symbol:=if_else(ci.lower<0,25,NaN)]
  means[,ci.lower:=if_else(ci.lower<0,0,ci.lower)]
  means[,ci.upper.symbol:=if_else(ci.upper>100,24,NaN)]
  means[,ci.upper:=if_else(ci.upper>100,100,ci.upper)]
  
  means[,sex:=factor(sexPulse,levels=c(1,2,3),labels=c("Male","Female","Both"))]
  
  lines <- means[product_sHET.cut=="(-1,0]"]
  
  if (v == "DEL") {
    label.x <- expression(bold(Deletion~s[het]~Burden))
  } else {
    label.x <- expression(bold(PTV~s[het]~Burden))
  }
  
  plot.means <- ggplot(means,aes(product_sHET.cut,mean,group=sexPulse,colour=as.factor(sexPulse))) +
    geom_hline(data=lines,aes(yintercept=mean,colour=as.factor(sexPulse)),linetype=2) +
    geom_point(position=position_dodge(0.5),shape=16,size=2) +
    geom_errorbar(aes(ymin=ci.lower,ymax=ci.upper),width=0,position=position_dodge(0.5),size=0.75) +
    scale_x_discrete(name = label.x,labels=x.axis.labels.2) +
    scale_y_continuous(name="Proportion of individuals",labels = paste0(seq(0,100,by=25),"%"), limits=c(0,100)) +
    sex.colours.colour +
    theme.figures + theme(line=element_line(size=0.75,colour="black",lineend="round"),panel.grid.major.x=element_blank())
  
  return(list(plot.dist,plot.means))
  
}

del.plots <- plot.dist.mean("DEL")
lof.plots <- plot.dist.mean("LOF_HC")
```

#### Final Figure

For the final figure I generate two .svg files and then merge them with Illustrator as getting ggplot + patchwork to do what I want under Nature figure space constraints seems difficult/impossible.

##### Top Panel

```{r Main Text Figure 1 Top, fig.height=4.5, fig.width=3.5}

plot.a[[2]] <- plot.a[[2]] + theme(plot.margin = margin(0, -0.5, 0, 0, unit = "cm"), legend.position = "none")
plot.b[[2]] <- plot.b[[2]] + theme(plot.margin = margin(0, -0.5, 0, 0, unit = "cm"), legend.position = "bottom", legend.box = "vertical", legend.margin = margin(-0.35,0,0,0, unit= "cm"), legend.box.just = "left")

figure.1.top <- plot.a[[2]] / plot.b[[2]] +
  plot_layout(guides="keep") + plot_annotation(tag_levels = "a")
  
figure.1.top

ggsave("figures/Figure1.top.svg",figure.1.top,width=3.5,height=4.5,units = "in")

```

#### Bottom Panel

```{r Main Text Figure 1 Bottom, fig.height=4.5, fig.width=3.5}

figure.1.bot <- (del.plots[[1]] + 
                   (lof.plots[[1]] + theme(axis.text.y=element_blank(),axis.title.y=element_blank())) +
                   del.plots[[2]] + 
                   (lof.plots[[2]] + theme(axis.text.y=element_blank(),axis.title.y=element_blank()))) + 
  plot_layout(widths=c(1,1), heights = c(3,4), nrow = 2, ncol = 2)  

figure.1.bot

ggsave("figures/Figure1.bottom.svg",figure.1.bot,width=3.5,height=4.5,units = "in")

```

### Figure 2.

Effect of rare variants on various metrics:

```{r Main Text Figure 2, fig.height=9.7, fig.width=4.0}

fig.partner <- make.meta.table(results.partner[y.var=="partner.in.house"], F, title = "Partner at Home",show.x=F,b=0.25,ymin=-0.3,p.pos=-.1,block=-.25)
fig.had.sex <- make.meta.table(results.had.sex, F, title = "Ever Had Sex", show.x=F,b=0.25,ymin=-0.3,p.pos=-.1,block=-.25)
fig.ea <- make.meta.table(results.ea, F, title = "University Degree",b=0.25,ymin=-0.3,p.pos=-.1,block=-.25)
fig.hhi <- make.meta.table(results.household.income, T, title = "Household Income",show.x=F, ymin = -2.6, ymax= 0.5, block = -2.5, p.pos = -2.2, b = 0.5)
fig.cog <- make.meta.table(results.cog, T, title = "Fluid Intelligence", ymin = -2.6, ymax= 0.5, block = -2.5, p.pos = -2.2, b = 0.5)
fig.mht <- make.meta.table(results.mht[y.var == "mht.binary"], F, title = "Mental Health Disorder", b = 25, ymin = -0.3, ymax = 75, p.pos = 0.03,alt.y.axis = expression(bold(log(Odds~Ratio)~at~s[het]~burden==1)))

fig.mht[[2]] <- fig.mht[[2]] +
  scale_y_log10(limits = c(0.01,75),breaks = c(0.1,1,10,70), name = expression(bold(Odds~Ratio~at~s[het]~burden==1)))

## And mash it all together...
figure.2 <- fig.partner[[2]] + theme(legend.position = "none",plot.margin=margin(0,0,0,2,unit = "cm")) + 
  fig.had.sex[[2]] + theme(legend.position = "none",plot.margin=margin(0,0,0,2,unit = "cm")) +
  fig.ea[[2]] + theme(legend.position = "none",plot.margin=margin(0,0,0,2,unit = "cm")) +
  fig.mht[[2]] + theme(legend.position = "none",plot.margin=margin(0,0,0,2,unit = "cm")) +
  fig.hhi[[2]] + theme(legend.position = "none",plot.margin=margin(0,0,0,2,unit = "cm")) + 
  fig.cog[[2]] + theme(legend.position = "bottom",plot.margin=margin(0,0,0,2,unit = "cm")) + 
  guide_area() + 
  plot_layout(nrow=7, guides = "collect") + plot_annotation(tag_levels = "a")
figure.2

ggsave("figures/Figure2.svg",figure.2,width=4,height=9.7)
```

### Figure 3.

#### Reduction in Fitness

```{r Reduction in Fitness, fig.height=3, fig.width=4}

fitness.redux <- ggplot(model.fertility, aes(x = shet, group = sexPulse, fill = sexPulse)) + 
  geom_line(aes(y = ratio_mid, colour = sexPulse), size=2) +
  geom_ribbon(aes(ymin= ratio_lower, ymax = ratio_upper), alpha = 0.5) +
  scale_alpha_continuous(range = c(0,1)) + 
  sex.colours.colour + 
  sex.colours.fill + 
  geom_abline(intercept=1,slope=-1,linetype=2) + 
  scale_x_continuous(name=expression(bold(s[het])), limits = c(0,1), expand = c(0,0),position = "top") + 
  scale_y_continuous(name = "Predicted reduction in fitness", limits = c(0,1.05)) + 
  theme.figures.legend + theme(axis.title = element_text(size = 12), axis.text = element_text(size = 12),axis.text.x = element_text(hjust=0))

fitness.redux

```

#### Multiple Regression

```{r Multiple Regression, fig.height=6, fig.width=9}

final.results.matrix[,curr.cov.string:=factor(curr.cov.string, levels = final.results.matrix[,unique(curr.cov.string)])]

figure.4.subset.plot <- final.results.matrix[grepl("Fluid Intelligence", curr.cov.string) == F & (n.terms <= 1 | n.terms == 6 | curr.cov.string == "Has Partner, Ever Had Sex"), ]

figure.4.subset.plot[,c("prefix","exponent"):=format.p(var.p),by=1:nrow(figure.4.subset.plot)]

ors.plot <- ggplot(figure.4.subset.plot[variant.type == "META"], aes(curr.cov.string, or, group = sexPulse, colour = sexPulse)) +
  geom_hline(yintercept = 1, colour = "red", linetype = 2, size = 1) +
  geom_point(aes(size = n.indvs), position = position_dodge(0.6)) +
  geom_errorbar(aes(ymin = var.ci.lower, ymax = if_else(var.ci.upper > 1.25, 1.25, var.ci.upper)), width = 0, position = position_dodge(0.6)) +
  geom_text(aes(y = -0.13, label = paste0("p==",if_else(var.p<1e-2,paste0(prefix,"%*%10^",exponent),sprintf("%0.2f",var.p)))), parse = T, position = position_dodge(0.8), hjust = 0,size = 3.5) +
  scale_x_discrete(name = "") +
  scale_y_continuous(name = expression(bold(atop(atop(Odds~ratio~at~s[het]~burden==1,on~having~children),atop("","")))),limits = c(-0.15, 1.21), breaks = seq(0.2,1.2,by=0.2), expand = c(0,0)) +
  scale_size_area(breaks = c(25000,100000,175000),guide=guide_legend(title="# of Indivs."), limits = c(0, 200000)) +
  sex.colours.colour +
  coord_flip() +
  theme.figures.legend + theme(panel.grid.major.y = element_blank(),panel.grid.minor.y = element_line(colour = "grey",size = 2), axis.text.y = element_blank(), axis.title.x = element_text(size = 14),axis.text.x = element_text(size = 12))

r2.plot.table <- figure.4.subset.plot[variant.type == "LOF_HC"]
r2.plot.table[,val:=if_else(sexPulse == "Male",
                            inc.r.shet/r2.plot.table[sex == 1 & curr.cov.string == "NULL",inc.r.shet],
                            inc.r.shet/r2.plot.table[sex == 2 & curr.cov.string == "NULL",inc.r.shet])]
r2.plot.table[,val:=(val)*100]

inc.r.shet.plot <- ggplot(r2.plot.table,aes(curr.cov.string, val, group = sexPulse, fill = sexPulse)) +
  geom_hline(yintercept=100,colour="red",linetype = 2, size = 1) +
  geom_col(position = position_dodge(0.75), size = 1, width = 0.75, colour = "black") +
  scale_y_continuous(name = expression(bold(atop(atop("","Percent of the variance explained by"),atop(s[het]~"compared to the null model:",has.children %~% y.axis.covariates + control.covariates))))) +
  scale_x_discrete(name = "") +
  sex.colours.fill +
  coord_flip() + 
  theme.figures.legend + theme(axis.text.y = element_blank(),panel.grid.major.y=element_blank(), axis.title.x = element_text(size = 14),axis.text = element_text(size = 12))

labels.table <- data.table(current.cov.string=figure.4.subset.plot[variant.type == "META",unique(curr.cov.string)])
labels.table[,shet:=1]
labels.table[,mht:=if_else(str_detect(current.cov.string,"Has MHT") == T, 1, 0)]
labels.table[,partner:=if_else(str_detect(current.cov.string,"Has Partner") == T, 1, 0)]
labels.table[,college:=if_else(str_detect(current.cov.string,"Completed College") == T, 1, 0)]
labels.table[,infertile:=if_else(str_detect(current.cov.string,"ICD-10") == T, 1, 0)]
labels.table[,had.sex:=if_else(str_detect(current.cov.string,"Ever Had Sex") == T, 1, 0)]
labels.table <- data.table(pivot_longer(labels.table, -current.cov.string))
setnames(labels.table, "value","has.covar")

labels.table <- data.table(crossing(labels.table,sex=c(1,2)))

get.sig.code <- function(should.check,mod,covar,s) {
  if (should.check == 0) {
    return("")
  } else {
    ## This is dumb, but have to set the right covar names that match the model data.frame:
    to.check = covar
    if (covar == "college") {
      to.check = "completed.college"
    } else if (covar == "infertile") {
      to.check = "fi.fert"
    } else if (covar == "mht") {
      to.check = "mht.binary"
    } else if (covar == "partner") {
      to.check = "partner.in.house"
    } else if (covar == "shet") {
      to.check = "product_sHET"
    }
    curr.model <- figure.4.subset.plot[curr.cov.string == mod & sex == s & variant.type == "LOF_HC",model][[1]]
    if (curr.model[term == to.check,p.value] < (0.05/20)) {
      return("**")
    } else if (curr.model[term == to.check,p.value] < (0.05)) {
      return("*")
    } else {
      return("")
    }
  }
}
labels.table[,is.sig:=get.sig.code(has.covar,current.cov.string,name,sex),by=1:nrow(labels.table)]

labels.plot <- ggplot(labels.table,aes(interaction(sex,current.cov.string), name, fill = interaction(as.factor(has.covar),as.factor(sex)))) + 
  geom_tile(aes(stat = has.covar),size = 0.25, colour = "black") +
  geom_text(aes(label = is.sig),size=4,colour="white",nudge_x = -0.2) +
  scale_fill_manual(values = c("white",male.col,"white",female.col)) +
  geom_vline(xintercept = seq(0.5,17.5,by=2),colour="black",size = 1) +
  geom_hline(yintercept = seq(-1.5,7.5,by=1),colour="black",size = 1) +
  scale_x_discrete(name = "",expand=c(0,0)) +
  scale_y_discrete(name = "", position = "left",limits = c("shet","mht","partner","college","infertile","had.sex"),labels = c(expression(s[het]~Burden),"MH Disorder","Partner At Home","University Degree","Infertility","Had Sex"),expand=c(0,0)) +
  coord_flip() +
  theme.figures + theme(axis.ticks = element_blank(), panel.grid.major = element_blank(),axis.line = element_blank(), axis.text.x = element_text(hjust = 1,size=11),axis.text.y = element_blank())

joint.plot <- labels.plot + ors.plot + inc.r.shet.plot + plot_layout(ncol = 3, guides = "collect", widths = c(0.4,1.3,0.3))
joint.plot

```

#### Final Figure

Remember - Without TDI, Figure 4 is 5.5 x 13

```{r Main Text Figure 3, fig.height=5.5, fig.width=13}

## Don't ask me how this works, it just does. Magic!
left <- (plot_spacer() + plot_spacer() + plot_spacer() +
  labels.plot + ors.plot + inc.r.shet.plot) + plot_layout(nrow = 2, ncol = 3, guides = "collect", heights = c(0.01,0.99), widths = c(0.4, 1.3, 0.3))
right <- (plot_spacer() + fitness.redux) + plot_layout(ncol = 1, nrow = 2, heights = c(0.00001,2),guides = "collect") 

figure.3 <- (left | right) + plot_layout(ncol = 2, widths = c(2,0.7), guides = "collect")

figure.3

ggsave("figures/Figure3.svg",figure.3,height = 5.5, width = 13)

```

## 8C. Extended Data Figures

### Figure 1.

```{r Ext Data Fig 1, fig.height=7, fig.width=10}

## This gives the ggplot default colours for "n" samples
gg_color_hue <- function(n) {
  hues = seq(15, 375, length = n + 1)
  hcl(h = hues, l = 65, c = 100)[1:n]
}

hes.chapters <- c("II","XVIII","XIX","XX","XXI","XXII")

chapter.names.fi <- c("Infectious/Parasitic",
                      "Immune System",
                      "Endocrine, Nutritional,\n& Metabolic",
                      "Mental Health",
                      "Nervous System",
                      "Eye",
                      "Ear",
                      "Circulatory Sys.",
                      "Respiratory Sys.",
                      "Digestive Sys.",
                      "Skin",
                      "Musculoskeletal and\nConnective Tissue",
                      "Genitourinary Sys.",
                      "Pregnancy/Childbirth",
                      "Perinatal Conditions",
                      "Congenital Malformations")

chapter.names.hes <- c("Cancer",
                      "Symtoms, Signs,\nLabortatory Findings",
                      "External Causes",
                      "External Morbidity/\nMortality",
                      "Health Statuses",
                      "Special Codes")

plot.medical <- function(s, l, show.x = F, chod.title = "", hes.title = "") {

  fi.theme <- theme.figures + theme(panel.grid.major.x = element_blank())
  hes.theme <- theme.figures + theme(panel.grid.major.x = element_blank())
  if (s == "MALE") {
    lims = c(15.3,16.4)
    brs = c(15.4,15.7,16.0,16.3)
    p.thresh = 15.75
  } else {
    lims = c(2.7,3.05)
    brs = c(2.75,2.85,2.95,3.05)
    p.thresh = 2.85
  }
  if (show.x == T) {
    fi.theme <- fi.theme + theme(axis.text.x = element_text(colour=gg_color_hue(22)[1:16]))
    hes.theme <- hes.theme + theme(axis.text.x = element_text(colour=gg_color_hue(22)[17:22]))
  } else {
    fi.theme <- fi.theme + theme(axis.text.x = element_blank(), axis.ticks.x = element_blank())
    hes.theme <- hes.theme+ theme(axis.text.x = element_blank(), axis.ticks.x = element_blank())
  }
  
  hes.plot.table <- hes.analysis.table[variant.type == "META" & level == l & chapter %in% hes.chapters & !is.na(var.est)]
  hes.plot.table[,icd.p.log:=if_else(icd.p.log > 100 | icd.p.log == 0,100,icd.p.log)]
  hes.plot.table[,source:="hes"]
  fi.plot.table <- fi.analysis.table[variant.type == "META" & level == l & !chapter %in% hes.chapters & !is.na(var.est)]
  fi.plot.table[,icd.p.log:=if_else(icd.p.log > 100 | icd.p.log == 0,100,icd.p.log)]
  fi.plot.table[,source:="fi"]
  
  plot.table <- rbind(fi.plot.table,hes.plot.table)
  plot.table.x <- plot.table[sex == s]
  setkey(plot.table.x, cols = "chapter","var.p.log")
  plot.table.x[,col:=.I]
  
  plot.fi <- ggplot(plot.table.x[source == "fi"],aes(chapter,var.p.log,colour=chapter,group=col,shape=sex)) + 
    geom_point(position=position_dodge(1)) +
    scale_x_discrete(name = "", labels = chapter.names.fi) +
    scale_y_continuous(name = "",limits = lims, breaks = brs) +
    geom_text(data = plot.table.x[source == "fi" & (var.p.log < p.thresh) ],aes(label=meaning),size=3,hjust=0,nudge_x=-0.3) +
    scale_color_manual(values = gg_color_hue(22)[1:16]) +
    ggtitle(chod.title) +
    theme.figures + fi.theme
    
  plot.hes <- ggplot(plot.table.x[source == "hes"],aes(chapter,var.p.log,colour=chapter,group=col,shape=sex)) + 
    geom_point(position=position_dodge(1)) +
    scale_x_discrete(name = "", labels = chapter.names.hes) +
    scale_y_continuous(name = "",limits = lims, breaks = brs) +
    geom_text(data = plot.table.x[source == "hes" & (var.p.log < p.thresh) ],aes(label=meaning),size=3,hjust=0,nudge_x=-0.3) +
    scale_color_manual(values = gg_color_hue(22)[17:22]) +
    ggtitle(hes.title) +
    theme.figures + hes.theme + theme(axis.text.y=element_blank())

  return(list(plot.fi,plot.hes))
  
}

male.plots <- plot.medical("MALE",3,F,"Complete Health Outcomes Data","Hospital Episode Stats.")
female.plots <- plot.medical("FEMALE",3, T)

y.lab <- grid::textGrob(expression(bold(-log[10]~p~value~`for`~the~effect~of~s[het]~burden==1~on~having~children)), rot=90)

main.plot <- male.plots[[1]] + male.plots[[2]] + female.plots[[1]] + female.plots[[2]] + plot_layout(ncol = 2, nrow = 2, widths = c(16, 6), tag_level = 'new') + plot_annotation(tag_levels = list(c("A","B","C","D")))

extdata.figure.1 <- wrap_elements(y.lab) + main.plot + plot_layout(ncol = 2, nrow = 1, widths = c(0.3,20), tag_level = 'keep')
extdata.figure.1

ggsave("figures/ExtDataFigure1.svg",extdata.figure.1,width=10,height=7)

```

## 8D. Text-based ORs

```{r Text Based ORs}
## sHET on having infertility
infertility.plot <- make.meta.table(results.fertility.MIC.CHOD, F, gene.list = "product_sHET")

infertility.plot[[1]][variant.type == "META",paste0(Sex, " OR=",sprintf("%0.2f",var.beta), " [95% CI ",sprintf("%0.2f",exp(log(var.beta)-(1.96*var.stderr))), "-",sprintf("%0.2f",exp(log(var.beta)+(1.96*var.stderr))),"]", " p=",if_else(var.p<=1e-2,sprintf("%0.1e",var.p),sprintf("%0.2f",var.p)))]

## no male fertility genes
no.male.fertility.plot <- make.meta.table(results.excl.male, F, gene.list = "product_sHET_no_maleInfertilityGenes")

no.male.fertility.plot[[1]][Sex == "Male" & variant.type == "META",paste0(Sex, " Fertility Gene OR=",sprintf("%0.2f",var.beta), " [95% CI ",sprintf("%0.2f",exp(log(var.beta)-(1.96*var.stderr))), "-",sprintf("%0.2f",exp(log(var.beta)+(1.96*var.stderr))),"]", " p=",if_else(var.p<=1e-2,sprintf("%0.1e",var.p),sprintf("%0.2f",var.p)))]

## no mouse fertility genes
mouse.data <- make.meta.table(results.excl.mouse, F, gene.list = "product_sHET_no_mouseInfertilityGenes")

mouse.data[[1]][Sex == "Male" & variant.type == "META",paste0(Sex, " OR=",sprintf("%0.2f",var.beta), " [95% CI ",sprintf("%0.2f",exp(log(var.beta)-(1.96*var.stderr))), "-",sprintf("%0.2f",exp(log(var.beta)+(1.96*var.stderr))),"]", " p=",if_else(var.p<=1e-2,sprintf("%0.1e",var.p),sprintf("%0.2f",var.p)))]

## no pathogenic CNV carriers
no.path.cnvs.plot <- make.meta.table(results.fertility.no.path, F)

no.path.cnvs.plot[[1]][Sex == "Male" & variant.type == "META",paste0(Sex," No Path CNVs OR=",sprintf("%0.2f",var.beta), " [95% CI ",sprintf("%0.2f",exp(log(var.beta)-(1.96*var.stderr))), "-",sprintf("%0.2f",exp(log(var.beta)+(1.96*var.stderr))),"]", " p=",if_else(var.p<=1e-2,sprintf("%0.1e",var.p),sprintf("%0.2f",var.p)))]

## Exclude MH patients
no.mh.patients.plot <- make.meta.table(results.fertility.no.mhq, F)

no.mh.patients.plot[[1]][variant.type == "META",paste0(Sex," No MH Patients OR=",sprintf("%0.2f",var.beta), " [95% CI ",sprintf("%0.2f",exp(log(var.beta)-(1.96*var.stderr))), "-",sprintf("%0.2f",exp(log(var.beta)+(1.96*var.stderr))),"]", " p=",if_else(var.p<=1e-2,sprintf("%0.1e",var.p),sprintf("%0.2f",var.p)))]

## Exclude Disease Genes
no.disease.plot <- make.meta.table(results.excl.disease, F, gene.list = "product_sHET_no_diseaseGenes", ymax = 1.6)

no.disease.plot[[1]][Sex == "Male" & variant.type == "META",paste0(Sex," Disease Gene OR=",sprintf("%0.2f",var.beta), " [95% CI ",sprintf("%0.2f",exp(log(var.beta)-(1.96*var.stderr))), "-",sprintf("%0.2f",exp(log(var.beta)+(1.96*var.stderr))),"]", " p=",if_else(var.p<=1e-2,sprintf("%0.1e",var.p),sprintf("%0.2f",var.p)))]

## Exclude same sex individuals
no.same.sex.plot <- make.meta.table(results.fertility.no.same.sex, F, gene.list = "product_sHET", ymax = 1.6)

no.same.sex.plot[[1]][Sex == "Male" & variant.type == "META",paste0(Sex," Exclude Same Sex OR=",sprintf("%0.2f",var.beta), " [95% CI ",sprintf("%0.2f",exp(log(var.beta)-(1.96*var.stderr))), "-",sprintf("%0.2f",exp(log(var.beta)+(1.96*var.stderr))),"]", " p=",if_else(var.p<=1e-2,sprintf("%0.1e",var.p),sprintf("%0.2f",var.p)))]
```

## 8E. Supplement

### Figures

#### Figure 1.

```{r Supp Fig 1, fig.height=8, fig.width=8}

## Grab relevant columns from phenotype data
birth.stats <- UKBB.phenotype.data[,c("eid","children.fathered","live.births","birth.year","agePulse","sexPulse")]
birth.stats[,dummy:=1]

## Tabualte if individuals had children dependent on their sex
birth.stats <- birth.stats[!is.na(children.fathered) | !is.na(live.births)]
birth.stats[,has.children:=if_else(sexPulse == 1, if_else(children.fathered>0,1,0), if_else(live.births>0,1,0))]

## Calulcate birth stats for...
# men
men <- birth.stats[(children.fathered >= 0 & sexPulse == 1),list(mean(children.fathered),sd(children.fathered),mean(has.children),sd(has.children),sum(dummy)),by="birth.year"]
men[,y.var:="children.fathered"]

# women
women <- birth.stats[(live.births >= 0 & sexPulse == 2),list(mean(live.births),sd(live.births),mean(has.children),sd(has.children),sum(dummy)),by="birth.year"]
women[,y.var:="live.births"]

agePlot <- bind_rows(men,women)
setnames(agePlot,c("V1","V2","V3","V4","V5"),c("mean.births","sd.births","mean.childlessness","sd.childlessness","n"))
agePlot[,ci.births:=1.96*(sd.births/sqrt(n))]
agePlot[,ci.childlessness:=1.96*(sd.childlessness/sqrt(n))]

## Get rid of categories with < 10 individuals
agePlot <- agePlot[n >= 100]

## Invert childlessness for plotting purposes
agePlot[,mean.childlessness:=1-mean.childlessness]

## Factorize sex
agePlot[,sexPulse:=factor(y.var,levels=c("children.fathered","live.births"),labels=c("Male","Female"))]

## Plot of mean number of children
plot.children <- ggplot(agePlot,aes(birth.year,mean.births,group=y.var,colour=sexPulse)) + geom_point(position=position_dodge(0.5)) + geom_errorbar(aes(ymin=mean.births-ci.births,ymax=mean.births+ci.births),width=0,position=position_dodge(0.5)) + xlab("") + ylab("Average Births") + sex.colours.colour + theme.figures.legend + theme(axis.text.x=element_blank())

## Plot of mean childlessness
plot.childlessness <- ggplot(agePlot,aes(birth.year,mean.childlessness*100,group=y.var,colour=sexPulse)) + geom_point(position=position_dodge(0.5)) + geom_errorbar(aes(ymin=(mean.childlessness-ci.childlessness)*100,ymax=(mean.childlessness+ci.childlessness)*100),width=0,position=position_dodge(0.5)) + xlab("") + ylab("% Childless") + sex.colours.colour + theme.figures.legend + theme(axis.text.x=element_blank())

## Plot of birth year cohorts
yearPlot <- birth.stats[,sum(dummy),by=c("birth.year","sexPulse")]
yearPlot[,prop:=if_else(sexPulse == 1, V1/yearPlot[sexPulse == 1,sum(V1)],V1/yearPlot[sexPulse == 2,sum(V1)])]
yearPlot[,prop:=prop*100]
yearPlot[,sexPulse:=factor(sexPulse,levels=c(1,2),labels=c("Male","Female"))]
yearPlot <- yearPlot[V1 >= 100]

plot.birthyears <- ggplot(yearPlot,aes(birth.year,prop,group=sexPulse,fill=sexPulse)) + 
  geom_col(position=position_dodge()) + 
  scale_x_continuous(name = "Birth Year") +
  scale_y_continuous(name = "% of Participants") +
  sex.colours.fill + 
  theme.figures

## Use patchwork to mash them together
children.plots <- plot.children + plot.childlessness + plot.birthyears + plot_layout(nrow = 3, guides="collect") + plot_annotation(tag_levels = 'A')
children.plots

ggsave("figures/supplement/SuppFig1.png",children.plots, dpi = 300, height = 8, width = 8, units = "in")
```

#### Figure 2.

CNVs Per Individual Figures

```{r Supp Fig 2, fig.height=8, fig.width=8}

ukbb.annotated.cnvs.qcd[,dummy:=1]

totals.delhom <- ukbb.annotated.cnvs.qcd[eid %in% samples.UKBB.cnv[,eid] & Copy_Number==0 & filter.0.95.wes.support.score == T,.(num.del.hom=sum(dummy)),by=c("eid")]
totals.delhet <- ukbb.annotated.cnvs.qcd[eid %in% samples.UKBB.cnv[,eid] & Copy_Number==1 & filter.0.95.wes.support.score == T,.(num.del.het=sum(dummy)),by=c("eid")]
totals.duphet <- ukbb.annotated.cnvs.qcd[eid %in% samples.UKBB.cnv[,eid] & Copy_Number==3 & filter.0.95.wes.support.score == T,.(num.dup.het=sum(dummy)),by=c("eid")]
totals.duphom <- ukbb.annotated.cnvs.qcd[eid %in% samples.UKBB.cnv[,eid] & Copy_Number==4 & filter.0.95.wes.support.score == T,.(num.dup.hom=sum(dummy)),by=c("eid")]
totals.len.del <- ukbb.annotated.cnvs.qcd[eid %in% samples.UKBB.cnv[,eid] & Copy_Number<2 & filter.0.95.wes.support.score == T,.(len.del=sum(Length_bp* abs(Copy_Number - 2))),by=c("eid")]
totals.len.dup <- ukbb.annotated.cnvs.qcd[eid %in% samples.UKBB.cnv[,eid] & Copy_Number>2 & filter.0.95.wes.support.score == T,.(len.dup=sum(Length_bp * abs(Copy_Number - 2))),by=c("eid")]
  
totals.filtered <- merge(totals.delhom,totals.delhet,by="eid",all=T)
totals.filtered <- merge(totals.filtered,totals.duphet,by="eid",all=T)
totals.filtered <- merge(totals.filtered,totals.duphom,by="eid",all=T)
totals.filtered <- merge(totals.filtered,totals.len.del,by="eid",all=T)
totals.filtered <- merge(totals.filtered,totals.len.dup,by="eid",all=T)

totals.filtered <- merge(samples.UKBB.cnv,totals.filtered,by="eid",all.x=T)

totals.filtered[is.na(totals.filtered)] <- 0

totals.filtered[,num.del.sites:=num.del.het+num.del.hom]
totals.filtered[,num.dup.sites:=num.dup.het+num.dup.hom]
totals.filtered[,total:=num.del.sites+num.dup.sites]

## Plot to just do histogram of total sites:
print(paste0("Total Sites    : ", sum(totals.filtered[,total])))
print(paste0("Mean Total     : ", sprintf("%0.02f",mean(totals.filtered[,total])), "±", sprintf("%0.02f",sd(totals.filtered[,total]))))
print(paste0("Median Total   : ", median(totals.filtered[,total])))
print("")
print(paste0("Total DEL Sites : ", sum(totals.filtered[,num.del.sites])))
print(paste0("Mean DEL        : ", sprintf("%0.02f",mean(totals.filtered[,num.del.sites])), "±", sprintf("%0.02f",sd(totals.filtered[,num.del.sites]))))
print(paste0("Median DEL      : ", median(totals.filtered[,num.del.sites])))
print(paste0("Mean DEL Len    : ", sprintf("%0.01f",mean(totals.filtered[,len.del])/1000), "±", sprintf("%0.01f",sd(totals.filtered[,len.del])/1000)))
print("")
print(paste0("Total DUP Sites : ", sum(totals.filtered[,num.dup.sites])))
print(paste0("Mean DUP        : ", sprintf("%0.02f",mean(totals.filtered[,num.dup.sites])), "±", sprintf("%0.02f",sd(totals.filtered[,num.dup.sites]))))
print(paste0("Median DUP      : ", median(totals.filtered[,num.dup.sites])))
print(paste0("Mean DUP Len    : ", sprintf("%0.01f",mean(totals.filtered[,len.dup])/1000), "±", sprintf("%0.01f",sd(totals.filtered[,len.dup])/1000)))

ggplot(totals.filtered, aes(total)) + geom_histogram(binwidth=1,fill="grey",colour="black",size=2) + geom_vline(aes(xintercept=mean(totals.filtered[,total])),colour="red",size=2) + xlab("Sites Per Individual") + ylab("# of Individuals") + theme

## DELs
x.lim<-max(totals.filtered[,num.del.sites])
y.lim<-3.0e6
plot1 <- ggplot(totals.filtered,aes(num.del.sites,len.del)) + geom_point(colour=del.line,size=0.5) + xlim(-1,x.lim) + ylim(-100000,y.lim) + xlab("Total Deletion Sites") + ylab("Cumulative Deletion Length") + theme.figures
x.hist <- ggplot(totals.filtered,aes(num.del.sites)) + geom_histogram(binwidth=1, colour=del.line, fill=del.fill) + xlim(-1,x.lim) + ylab("Count") + theme.figures + theme(axis.title.x=element_blank(),axis.text.x=element_blank())
y.hist <- ggplot(totals.filtered,aes(len.del)) + geom_histogram(binwidth=100000, colour=del.line, fill=del.fill) + xlim(-100000,y.lim) + ylab("Count") + coord_flip() + theme.figures + theme(axis.title.y=element_blank(),axis.text.y=element_blank())

## All the "empty" plots here are just used to push the graphs together.
del.plot <- x.hist + plot_spacer() + plot1 + y.hist + plot_layout(ncol = 2, nrow = 2, widths = c(2,1), heights = c(1,2), tag_level="keep")
del.plot

## DUPs
x.lim<-max(totals.filtered[,num.dup.sites])
y.lim<- 1e7
plot1 <- ggplot(totals.filtered,aes(num.dup.sites,len.dup)) + geom_point(colour=dup.line,size=0.5) + xlim(-1,x.lim) + ylim(-500000,y.lim) + xlab("Total Duplication Sites") + ylab("Cumulative Duplication Length") + theme.figures
x.hist <- ggplot(totals.filtered,aes(num.dup.sites)) + geom_histogram(binwidth=1,colour=dup.line,fill=dup.fill) + xlim(-1,x.lim) + ylab("Count") + theme.figures + theme(axis.title.x=element_blank(),axis.text.x=element_blank())
y.hist <- ggplot(totals.filtered,aes(len.dup)) + geom_histogram(binwidth=500000,colour=dup.line,fill=dup.fill) + xlim(-500000,y.lim) + ylab("Count") + coord_flip() + theme.figures + theme(axis.title.y=element_blank(),axis.text.y=element_blank())

dup.plot <- x.hist + plot_spacer() + plot1 + y.hist + plot_layout(ncol = 2, nrow = 2, widths = c(2,1), heights = c(1,2))
dup.plot

## Calculate Singleton/Maf <1e-3 variants like for SNVs:

samp.size <- nrow(samples.UKBB.cnv)
test <- ukbb.annotated.cnvs.qcd[eid %in% samples.UKBB.cnv[,eid] & filter.0.95.wes.support.score == T & ct == "DEL"]

allele.frq <- test[ct=="DEL",sum(gt),by=c("locus")]
allele.frq[,frq:=V1/(samp.size*2)]
setnames(allele.frq,"V1","ac")

test <- merge(test,allele.frq[,c("locus","frq","ac")],by=c("locus"))

cnv.counts.plot <- data.table()

counts <- test[ac == 1 & ct == "DEL", sum(dummy), by = "eid"]
counts <- merge(counts, samples.UKBB.cnv, all.y = T, by = "eid")
counts[,V1:=if_else(is.na(V1),0,V1)]
counts[,AF:="AC1"]
counts[,CSQ:="DEL"]
cnv.counts.plot <- rbind(cnv.counts.plot, counts)

counts <- test[frq <= 1e-3 & ct == "DEL", sum(dummy), by = "eid"]
counts <- merge(counts, samples.UKBB.cnv, all.y = T, by = "eid")
counts[,V1:=if_else(is.na(V1),0,V1)]
counts[,AF:="MAF1e-3"]
counts[,CSQ:="DEL"]
cnv.counts.plot <- rbind(cnv.counts.plot, counts)

samp.size <- nrow(samples.UKBB.cnv)
test <- ukbb.annotated.cnvs.qcd[eid %in% samples.UKBB.cnv[,eid] & filter.0.95.wes.support.score == T & ct == "DUP"]

allele.frq <- test[ct=="DUP",sum(gt),by=c("locus")]
allele.frq[,frq:=V1/(samp.size*2)]
setnames(allele.frq,"V1","ac")

test <- merge(test,allele.frq[,c("locus","frq","ac")],by=c("locus"))

counts <- test[ac == 1 & ct == "DUP", sum(dummy), by = "eid"]
counts <- merge(counts, samples.UKBB.cnv, all.y = T, by = "eid")
counts[,V1:=if_else(is.na(V1),0,V1)]
counts[,AF:="AC1"]
counts[,CSQ:="DUP"]
cnv.counts.plot <- rbind(cnv.counts.plot, counts)

counts <- test[frq <= 1e-3 & ct == "DUP", sum(dummy), by = "eid"]
counts <- merge(counts, samples.UKBB.cnv, all.y = T, by = "eid")
counts[,V1:=if_else(is.na(V1),0,V1)]
counts[,AF:="MAF1e-3"]
counts[,CSQ:="DUP"]
cnv.counts.plot <- rbind(cnv.counts.plot, counts)

cnv.counts.plot[,CSQ:=if_else(CSQ == "DEL","Deletions","Duplications")]
cnv.counts.plot[,AF:=if_else(AF=="AC1","Private Vars.","MAF ≤ 1e-3 Vars.")]
cnv.counts.plot[,AF:=factor(AF,levels=c("Private Vars.","MAF ≤ 1e-3 Vars."))]
setnames(cnv.counts.plot,"V1","count")

del.count.plot <- ggplot(cnv.counts.plot[CSQ == "Deletions"],aes(count,fill=AF)) + geom_histogram(binwidth=1, position = "identity", alpha = 0.5) + scale_y_continuous(name = "Number of Individuals") + scale_x_continuous(name = "# of Deletions.",limits = c(-1,6)) + scale_alpha_continuous(range=c(0,1)) + scale_fill_discrete(guide=guide_legend(title="")) + theme.figures.legend
del.count.plot

dup.count.plot <- ggplot(cnv.counts.plot[CSQ == "Duplications"],aes(count,fill=AF)) + geom_histogram(binwidth=1, position = "identity", alpha = 0.5) + scale_y_continuous(name = "") + scale_x_continuous(name = "# of Duplications",limits = c(-1,6)) + scale_alpha_continuous(range=c(0,1)) + scale_fill_discrete(guide=guide_legend(title="")) + theme.figures.legend
dup.count.plot

bottom <- (del.count.plot | dup.count.plot) / guide_area() + plot_layout(guides="collect", heights=c(4,1))

total.plot <- (del.plot | dup.plot) / (bottom) + plot_layout(heights = c(1.7,1))
total.plot

ggsave("figures/supplement/SuppFig2.png",total.plot,dpi = 300,height = 8, width = 8, units = c("in"))

quant.table <- data.table(table(ukbb.annotated.cnvs.qcd[filter.0.95.wes.support.score==T,ct]))
paste0("Number of CNVs (Unfiltered Indiv)  : ",quant.table[,sum(N)], " (DEL: ", quant.table[V1 == "DEL",N], "; DUP: ", quant.table[V1 == "DUP",N],")")
quant.table <- data.table(table(ukbb.annotated.cnvs.qcd[eid %in% samples.UKBB.cnv[,eid] & filter.0.95.wes.support.score==T,ct]))
paste0("Number of CNVs (Filtered Indiv)    : ",quant.table[,sum(N)], " (DEL: ", quant.table[V1 == "DEL",N], "; DUP: ", quant.table[V1 == "DUP",N],")")
paste0("Number of CNV Loci                 : ",nrow(data.table(table(ukbb.annotated.cnvs.qcd[,locus]))))
rm(quant.table)
```

#### Figure 3.

SNV Counts Per Individual.

```{r Supp Fig 3, fig.height=5, fig.width=8.5}

all.plot <- ggplot(OVERALL.counts[source == "200k"],aes(CSQ,count,colour=AF)) + geom_boxplot() + scale_x_discrete(name = "Variant Class") + scale_y_log10(name = "# of Variants") + scale_colour_discrete(guide=guide_legend(title="")) + theme.figures
all.plot

missense.plot <- ggplot(OVERALL.counts[source == "200k" & CSQ == "Missense"],aes(count,fill=AF)) + geom_histogram(binwidth=1, position = "identity", alpha = 0.5) + scale_y_continuous(name = "Number of Individuals") + scale_x_continuous(name = "# of CADD > 25,\nMPC > 2 Missense Vars.") + scale_alpha_continuous(range=c(0,1)) + scale_fill_discrete(guide=guide_legend(title="")) + theme.figures.legend
missense.plot

ptv.plot <- ggplot(OVERALL.counts[source == "200k" & CSQ == "PTVs"],aes(count,fill=AF)) + geom_histogram(binwidth=1, position = "identity", alpha = 0.5) + scale_y_continuous(name = "") + scale_x_continuous(name = "# of PTVs") + scale_alpha_continuous(range=c(0,1)) + scale_fill_discrete(guide=guide_legend(title="")) + theme.figures.legend
ptv.plot

syn.plot <- ggplot(OVERALL.counts[source == "200k" & CSQ == "Synonymous"],aes(count,fill=AF)) + geom_histogram(binwidth=1, position = "identity", alpha = 0.5) + scale_y_continuous(name = "") + scale_x_continuous(name = "# of Synonymous Vars.") + scale_alpha_continuous(range=c(0,1)) + scale_fill_discrete(guide=guide_legend(title="")) + theme.figures.legend
syn.plot

snv.count.plot <- ((missense.plot | ptv.plot | syn.plot | all.plot) / (guide_area())) + plot_layout(guides = "collect", nrow = 2,heights = c(4,1)) + plot_annotation(tag_levels = 'A')
snv.count.plot

ggsave("figures/supplement/SuppFig3.png",snv.count.plot, dpi = 300, height = 5, width = 8.5, units = "in")
```

#### Figure 4.

```{r Supp Fig 4, fig.height=4, fig.width=8.5}

plottable <- copy(results.fertility)

plottable[,var.ci.upper:=exp(var.beta + (1.96*var.stderr))]
plottable[,var.ci.lower:=exp(var.beta - (1.96*var.stderr))]
plottable[,sig.pos:=if_else(var.beta<0,var.ci.lower-0.1,var.ci.upper+0.1)]

plottable[,var.beta:=exp(var.beta)]
plottable[,Sex:=factor(sex,levels=c("1","2"),labels = c("Male","Female"))]
ylab <- "Odds Ratio"
yline <- 1
y.axis <- expression(bold(Odds~Ratio~`for`~1~Unit~of~Quantified~s[het]))

plot.breaks <- c(seq(1,-0.15,by=-1 * 0.25),seq(1,1.5,by=0.25))
plot.breaks <- plot.breaks[plot.breaks != 0]

plot.all.variants <- ggplot(plottable[maf == 0],aes(variant.type,var.beta,group=Sex,colour=Sex)) +
  geom_hline(aes(yintercept=1),colour="red",linetype=2,size=1) +
  geom_point(aes(size=n.indvs,shape=18),position=position_dodge(0.5)) +
  geom_errorbar(aes(ymin=var.ci.lower,ymax=var.ci.upper),width=0,position=position_dodge(0.5)) +
  geom_text(aes(y = 0.05, label=paste0("p = ", sprintf("%0.2g",var.p))),position=position_dodge(0.5),size=4,hjust=1,show.legend = F) +
  geom_text(aes(y = 0.05, label = if_else(var.p < 0.0025, "*", "")), position=position_dodge(0.65),size=8,hjust=0,show.legend = F) +
  scale_x_discrete(name = "", position = "top",labels=c("Dels","Dups","PTVs","Missense\n(CADD > 25, MPC > 2)","Synonymous")) +
  scale_y_continuous(name=y.axis,limits = c(-0.15,1.6), breaks=plot.breaks) +
  scale_shape_identity() +
  scale_size_area(breaks=c(50000,100000,150000),guide=guide_legend(title="# of Indivs.")) +
  sex.colours.colour.rev +
  coord_flip(clip = "off") +
  theme.figures.legend + theme(panel.grid.major.y = element_blank())

plot.all.variants

ggsave("figures/supplement/SuppFig4.png",plot.all.variants, dpi = 300, height = 4, width = 8.5, units = "in")

```

#### Figure 5.

```{r Supp Fig 5, fig.height=3, fig.width=8.5}

remove.zero <- make.meta.table(results.fertility.zero, T, ymin = -0.9, ymax= 0.1, block = -0.8, p.pos = -0.7)
remove.zero

ggsave("figures/supplement/SuppFig5.png",remove.zero[[2]], dpi = 300, height = 3, width = 8.5, units = "in")
```

#### Figure 6.

```{r Supp Fig 6, fig.height=8, fig.width=10}

tab.sHET <- UKBB.phenotype.data[,c("eid","sexPulse","num.children")]

## Attach sHET burden to each individual
quants <- merge(tab.sHET,variant.counts[type=="LOF_HC" & allele.freq==0,c("sample_id","product_sHET")],by.x="eid",by.y="sample_id",all.x=T)
setnames(quants,"product_sHET","product_sHET_LOF")

quants <- merge(quants,variant.counts[type=="DEL" & allele.freq==0,c("sample_id","product_sHET")],by.x="eid",by.y="sample_id",all.x=T)
setnames(quants,"product_sHET","product_sHET_DEL")

## Only include individuals for which we have CNV or PTV data
quants <- quants[eid %in% samples.UKBB.cnv[,eid] | eid %in% has.wes[has.wes>0,eid]]

## Remove individuals without an sHET score
quants <- quants[!is.na(product_sHET_LOF) | !is.na(product_sHET_DEL)]

plot.fert.dist <- function(v, is.top = T) {
  
    ## The following code is to set axis limits and labels -- purely graphical
    # Bin sizes
    b <- c(-1,((0:4)*0.15),100)
    ## X-axis labels
    if (v == "product_sHET_DEL") {
     label.x <- expression(bold(Deletion~s[het]~Burden))
    } else {
      label.x <- expression(bold(PTV~s[het]~Burden))
    }
    quants[,product_sHET.cut:=cut(get(v),breaks=b)]
    quants[,dummy:=1]
    
    quants <- quants[num.children >= 1 & num.children <= 3]
    
    # Quantify proportion of individuals in each sHET bin we create above
    totals <- quants[!is.na(get(v)),sum(dummy),by=c("product_sHET.cut","sexPulse","num.children")]
    sums <- quants[!is.na(get(v)),sum(dummy),by=c("sexPulse","num.children")]
    setnames(sums,c("sexPulse","num.children"),c("s","n.c"))
    totals[,prop:=V1/sums[s==sexPulse & n.c == num.children,V1],by=1:nrow(totals)]
    totals[,prop.2:=prop*100000]
    totals[,ci:=1.96*sqrt((prop.2*(100000-prop.2))/sums[s==sexPulse & n.c == num.children,V1]),by=1:nrow(totals)]
    totals[,ci.lower:=prop.2 - ci]
    totals[,ci.upper:=prop.2 + ci]
    totals[,ci.lower:=if_else(ci.lower < 1, 1, ci.lower)]
    
    ## Change the ugly formating that is the direct output of 'cut()' to something better for a plot label
    x.axis.labels <- str_replace(str_replace(str_replace(str_replace(totals[,levels(product_sHET.cut)],"\\[",""),"\\]",""),",","-"),"\\(","")
    x.axis.labels.2 <- c()
    for (l in x.axis.labels) {
      if (grepl("-100",l)) {
        l <- str_replace(l,"\\-100","")
        l <- paste0(">",l)
        x.axis.labels.2 <- c(x.axis.labels.2,l)
      } else if (grepl("-1-",l)) {
        l <- str_replace(l,"\\-1-","")
        x.axis.labels.2 <- c(x.axis.labels.2,l)
      } else {
        x.axis.labels.2 <- c(x.axis.labels.2,l)
      }
    }
    
    totals[,sexPulse:=factor(sexPulse,levels=c(1,2),labels = c("Male","Female"))]
    totals <- merge(totals,data.table(crossing(product_sHET.cut=totals[,levels(product_sHET.cut)],
                                               sexPulse = totals[,levels(sexPulse)],
                                               num.children = c(1,2,3))),
                    all.y = T, by =c("product_sHET.cut","sexPulse","num.children"))
    
    
    ## Cumulative Density Plot of joint.product_sHET by sex
    plot.dist.male <- ggplot(totals[sexPulse == "Male"],aes(product_sHET.cut,prop.2,colour=sexPulse,fill=as.factor(num.children),group=as.factor(num.children))) +
      geom_col(position=position_dodge(1)) +
      scale_x_discrete(name = label.x,labels=x.axis.labels.2, drop = F) +
      scale_y_log10(name = "Proportion of Individuals",breaks=c(1,10,100,1000,10000,100000),labels=paste0(c(0.001,0.01,0.1,1,10,100),"%"), limits = c(1, 100000)) +
      geom_errorbar(aes(ymin=ci.lower,ymax=ci.upper),position=position_dodge(1),width=0) +
      scale_fill_grey(guide = guide_legend(title = "Number of Children")) +
      scale_colour_manual(name = "Sex",values=sex.colours, guide=F) +
      ggtitle(if_else(is.top,"Males","")) +
      theme.figures.legend
    plot.dist.female <- ggplot(totals[sexPulse == "Female"],aes(product_sHET.cut,prop.2,colour=sexPulse,fill=as.factor(num.children),group=as.factor(num.children))) +
      geom_col(position=position_dodge(1)) +
      scale_x_discrete(name = label.x,labels=x.axis.labels.2, drop = F) +
      scale_y_log10(name = "Proportion of Individuals",breaks=c(1,10,100,1000,10000,100000),labels=paste0(c(0.001,0.01,0.1,1,10,100),"%"), limits = c(1, 100000)) +
      geom_errorbar(aes(ymin=ci.lower,ymax=ci.upper),position=position_dodge(1),width=0) +
      scale_fill_grey(guide = guide_legend(title = "Number of Children")) +
      scale_colour_manual(name = "Sex",values=sex.colours, guide=F) +
      ggtitle(if_else(is.top,"Females","")) +
      theme.figures.legend
    
    return(list(plot.dist.male,plot.dist.female))
    
}

top.panels <- plot.fert.dist("product_sHET_DEL")
bottom.panels <- plot.fert.dist("product_sHET_LOF")

sHET.dist.num.children <- top.panels[[1]] + top.panels[[2]] + bottom.panels[[1]] + bottom.panels[[2]] + plot_annotation(tag_levels = "A") + plot_layout(guides = "collect", ncol = 2, nrow = 2)
sHET.dist.num.children

ggsave("figures/supplement/SuppFig6.png", sHET.dist.num.children, dpi = 300, height = 8, width = 10, units = "in")

```

#### Figure 7.

```{r Supp Fig 7, fig.height=6, fig.width=8.5}

gene.data.pli <- make.meta.table(results.fertility.genelists[gene.list == "highPLI"], F, gene.list = "highPLI", alt.y.axis = "Odds Ratio for loss of 1 high (≥0.9) pLI gene",ymin=0.4,ymax=1.25,p.pos = 0.5,block = 0.4)
gene.data.pli[[2]] <- gene.data.pli[[2]] + theme(legend.position="blank")

alt.y <- expression(bold(Odds~Ratio~`for`~loss~of~"1"~high~"(" >= 0.15~")"~s[het]~gene))
gene.data.shet <- make.meta.table(results.fertility.genelists[gene.list == "highsHET"], F, gene.list = "highsHET", alt.y.axis = alt.y,ymin=0.4,ymax=1.25,p.pos = 0.5,block = 0.4)
gene.data.shet

combined.gene <- gene.data.pli[[2]] + gene.data.shet[[2]] + plot_layout(nrow = 2, guides = "collect") + plot_annotation(tag_levels = 'A')
combined.gene

ggsave("figures/supplement/SuppFig7.png",combined.gene, dpi = 300, height = 5, width = 8.5, units = "in")
```

#### Figure 8.

```{r Supp Fig 8, fig.height=8, fig.width=8.5}

high.maf.data.all <- data.table()

for (m in c(0, 1e-5, 1e-4, 1e-3)) {

  high.maf.data <- make.meta.table(results.fertility, F, allele.freq = m, b=0.25)
  high.maf.data[[1]][,maf:=m]
  high.maf.data.all <- rbind(high.maf.data.all, high.maf.data[[1]])
  
}

plot.maf <- ggplot(high.maf.data.all,aes(variant.type,var.beta,group=interaction(Sex,as.factor(maf)),colour=Sex, linetype = as.factor(maf))) +
    geom_hline(yintercept=1,colour="red",linetype=2,size=1) +
    geom_point(aes(size=n.indvs,shape=variant.shape),position=position_dodge(0.7)) +
    geom_errorbar(aes(ymin=var.ci.lower,ymax=var.ci.upper),width=0,position=position_dodge(0.7)) +
    geom_text(aes(y = 0.1, label=paste0("p==",if_else(var.p<1e-2,paste0(prefix,"%*%10^",exponent),sprintf("%0.2f",var.p)))), parse = T,position=position_dodge(0.75),size=4,hjust=1,show.legend = F) +
    geom_text(aes(y = 0.1, label = p.sig), position=position_dodge(0.65),size=8,hjust=0,show.legend = F) +
    scale_x_discrete(name = "", position = "top",labels=c("Meta","PTVs","Dels")) +
    scale_y_continuous(name=expression(bold(Odds~Ratio~at~s[het]~burden==1)),limits = c(-0.15,1.25), breaks=seq(0.25,1.25,by = 0.25)) +
    scale_shape_identity() +
    scale_size_area(breaks=c(25000,100000,175000),guide=guide_legend(title="# of Indivs."), limits = c(0, 200000)) +
    sex.colours.colour.rev +
    scale_linetype_discrete(guide=guide_legend(title = "Allele Frequency Cutoff", reverse = T), labels = c("Singletons","< 1e-5","< 1e-4","< 1e-3")) +
    coord_flip(clip = "off") +
    theme.figures.legend + theme(panel.grid.major.y = element_blank())
plot.maf

ggsave("figures/supplement/SuppFig8.png",plot.maf, dpi = 300, height = 8, width = 8.5, units = "in")
```

#### Figure 9.

```{r Supp Fig 9, fig.height=6, fig.width=8.5}

res.age[,Sex:=factor(sex,levels=c(1,2),labels=c("Male","Female"))]
res.age[,agePulse:=factor(age,levels=c("ALL","1960","1950","1940"),labels=c("All Cohorts","1960-1970","1950-1960","1940-1950"))]
res.age[,variantPulse:=factor(variant.type,levels=c("META","LOF_HC","DEL"),labels=c("Meta","PTVs","Dels"))]
res.age[,var.ci.upper:=if_else(var.ci.upper>2.75,2.75,var.ci.upper)] ## Need to set an upper limit for very wide error bars

ylab <- "Odds Ratio"
yline <- 1
y.axis <- expression(bold(Odds~Ratio~`for`~1~Unit~of~Quantified~s[het]))

plot.breaks <- c(seq(1,-0.75,by=-1 * 0.5),seq(1,2.75,by=0.5))
plot.breaks <- plot.breaks[plot.breaks != -0.5]

plot.ages <- ggplot(res.age,aes(agePulse,var.beta,group=interaction(Sex,variantPulse),colour=Sex,linetype=variantPulse)) +
  geom_hline(yintercept=c(1),colour="red",linetype=2,size=1) +
  geom_point(aes(size=n.indvs,shape=18),position=position_dodge(0.65)) +
  geom_errorbar(aes(ymin=var.ci.lower,ymax=var.ci.upper),width=0,position=position_dodge(0.65)) +
  geom_text(aes(y = -0.15, label=paste0("p = ", sprintf("%0.2g",var.p))),position=position_dodge(0.65),size=4,hjust=1,show.legend = F) +
  geom_text(aes(y = -0.15, label = if_else(var.p < 0.0025, "*", "")), position=position_dodge(0.65),size=8,hjust=0,show.legend = F) +
  scale_x_discrete(name = "", position = "top") +
  scale_y_continuous(name=y.axis,limits = c(-0.5,2.75), breaks=plot.breaks) +
  scale_shape_identity() +
  scale_size_area(breaks=c(50000,100000,150000),guide=guide_legend(title="# of Indivs.")) +
  scale_linetype_discrete(guide=guide_legend(reverse = T, title = "Variant Class")) +
  sex.colours.colour.rev +
  coord_flip() +
  theme.figures.legend + theme(panel.grid.major.y = element_blank(), legend.background=element_rect(fill="white"))
plot.ages

ggsave("figures/supplement/SuppFig9.png",plot.ages, dpi = 300, height = 6, width = 8.5, units = "in")
```

#### Figure 10.

```{r Supp Fig 10, fig.height=5, fig.width=10.5}

## Has a Private DEL
plot.testis.del.shet.wilcox <- wilcox.test(log.mean ~ has.del, data = shet.genes.expr[sHET.val >= 0.15],alternative=c("less"))
plot.testis.del.shet <- ggplot(shet.genes.expr[sHET.val >= 0.15], aes(as.factor(has.del), log.mean, group = as.factor(has.del))) +
  geom_boxplot() +
  scale_x_discrete(name = expression(bold(atop(s[het]~"≥"~0.15~gene,with~private~Deletion)))) +
  scale_y_continuous(name = expression(bold(Median~ln~Testis~Expr.))) +
  theme

## Has a Private PTV
plot.testis.ptv.shet.wilcox <- wilcox.test(log.mean ~ has.ptv, data = shet.genes.expr[sHET.val >= 0.15],alternative=c("less"))
plot.testis.ptv.shet <- ggplot(shet.genes.expr[sHET.val >= 0.15], aes(as.factor(has.ptv), log.mean, group = as.factor(has.ptv))) +
  geom_boxplot() +
  scale_x_discrete(name = expression(bold(atop(s[het]~"≥"~0.15~gene,with~private~PTV)))) +
  scale_y_continuous(name = "") +
  theme

## Is a male infertility gene
plot.male.wilcox <- wilcox.test(log.mean ~ male.infertility, data = shet.genes.expr,alternative=c("less"))
plot.male <- ggplot(shet.genes.expr, aes(as.factor(male.infertility), log.mean, group = as.factor(male.infertility))) + 
  geom_boxplot() +
  scale_x_discrete(name = "Is Male Infertility Gene?") +
  scale_y_continuous(name = "") + 
  theme

plot.testis <- plot.testis.del.shet + plot.testis.ptv.shet + plot.male + plot_layout(ncol=3, guides = "collect") + plot_annotation(tag_levels='A')
plot.testis

ggsave("figures/supplement/SuppFig10.png",plot.testis, dpi = 300, height = 5, width = 10.5, units = "in")

```

#### Figure 11.

```{r Supp Fig 11, fig.height=10, fig.width=10}

hes.plot.table <- hes.analysis.table[variant.type == "META" & level == 3]
hes.plot.table[,icd.p.log:=if_else(icd.p.log > 100 | icd.p.log == 0,100,icd.p.log)]

male.icd <- ggplot(hes.plot.table[sex == "MALE"],aes(var.p.log,factor*icd.p.log,colour=chapter)) + 
  geom_point(size=0.5) +
  scale_x_continuous(name = "") +
  scale_y_continuous(name = expression(bold(-log[10]~p~value~`for`~`ICD-10`~code~on~having~children)), limits = c(-100,100), labels = c("p ≥ 100","50","0","50","p ≥ 100")) +
  geom_text(data = hes.plot.table[sex == "MALE" & (var.p.log < 15.75 | icd.p.log > 22)],aes(label=meaning),size=3) +
  ggtitle("Males") +
  theme.figures + theme(plot.title = element_text(hjust = 0.5))

male.icd

female.icd <- ggplot(hes.plot.table[sex == "FEMALE"],aes(var.p.log,factor*icd.p.log,colour=chapter)) + 
  geom_point(size=0.5) + 
  scale_x_continuous(name = "") +
  scale_y_continuous(name = "", limits = c(-100,100)) +
  geom_text(data = hes.plot.table[sex == "FEMALE" & (var.p.log < 2.8 | icd.p.log > 22) & chapter != "XV"],aes(label=meaning),size=3) +
  ggtitle("Females") +
  theme.figures + theme(plot.title = element_text(hjust = 0.5), axis.text.y=element_blank())

female.icd

## First do all codes regardless of age of onset
fi.plot.table <- fi.analysis.table[variant.type == "META" & level == 3]
fi.plot.table[,icd.p.log:=if_else(icd.p.log > 100 | icd.p.log == 0,100,icd.p.log)]

male.fi <- ggplot(fi.plot.table[sex == "MALE"],aes(var.p.log,factor*icd.p.log,colour=chapter)) + 
  geom_point(size=0.5) +
  scale_x_continuous(name = "") +
  scale_y_continuous(name = expression(bold(-log[10]~p~value~`for`~`ICD-10`~code~on~having~children)), limits = c(-100,100)) +
  geom_text(data = fi.plot.table[sex == "MALE" & (var.p.log < 15.75 | icd.p.log > 22)],aes(label=meaning),size=3,nudge_y=0.15) +
  theme.figures + theme(plot.title = element_text(hjust = 0.5))

male.fi

female.fi <- ggplot(fi.plot.table[sex == "FEMALE"],aes(var.p.log,factor*icd.p.log,colour=chapter)) + 
  geom_point(size=0.5) + 
  scale_x_continuous(name = "") +
  scale_y_continuous(name = "", limits = c(-100,100)) +
  geom_text(data = fi.plot.table[sex == "FEMALE" & (var.p.log < 2.8 | icd.p.log > 22) & chapter != "XV"],aes(label=meaning),size=3,nudge_y=0.15) +
  theme.figures + theme(axis.text.y = element_blank(), plot.title = element_text(hjust = 0.5))

female.fi

## Mash everything together
combined.fi.top <- male.icd + female.icd + male.fi + female.fi + plot_layout(nrow = 2, ncol = 2)
combined.fi.bottom <- grid::textGrob(expression(bold(-log[10]~p~value~`for`~the~effect~of~s[het]~burden==1~on~having~children)))

combined.fi <- combined.fi.top / combined.fi.bottom + plot_layout(heights=c(10,0.05)) + plot_annotation(tag_levels = list(LETTERS[1:4]))
combined.fi

ggsave("figures/supplement/SuppFig11.svg",combined.fi,width=10,height=10)
```

#### Figure 12/13.

This plots ALL ICD-10 code associations.

```{r Supp Fig 12/13, fig.height=6, fig.width=8}

male.plots.1 <- plot.medical("MALE",1,F,"Chapters - Complete Health Outcomes Data", "Hospital Episode Stats.")
male.plots.2 <- plot.medical("MALE",2,T,"Disease Groups")

y.lab <- grid::textGrob(expression(bold(-log[10]~p~value~`for`~the~effect~of~s[het]~burden==1~on~having~children)), rot=90)

meta.icd.plot.male <- male.plots.1[[1]] + male.plots.1[[2]] + male.plots.2[[1]] + male.plots.2[[2]] + plot_layout(ncol = 2, nrow = 2, widths = c(16, 6), tag_level = 'new') + plot_annotation(tag_levels = "A")

meta.icd.plot.male <- wrap_elements(y.lab) + meta.icd.plot.male + plot_layout(ncol = 2, nrow = 1, widths = c(0.3,20), tag_level = 'keep')
meta.icd.plot.male

ggsave("figures/supplement/SuppFig12.svg",meta.icd.plot.male,width=8,height=6)

female.plots.1 <- plot.medical("FEMALE",1,F,"Chapters - Complete Health Outcomes Data", "Hospital Episode Stats.")
female.plots.2 <- plot.medical("FEMALE",2,T,"Disease Groups")

meta.icd.plot.female <- female.plots.1[[1]] + female.plots.1[[2]] + female.plots.2[[1]] + female.plots.2[[2]] + plot_layout(ncol = 2, nrow = 2, widths = c(16, 6), tag_level = 'new') + plot_annotation(tag_levels = "A")

meta.icd.plot.female <- wrap_elements(y.lab) + meta.icd.plot.female + plot_layout(ncol = 2, nrow = 1, widths = c(0.3,20), tag_level = 'keep')
meta.icd.plot.female

ggsave("figures/supplement/SuppFig13.svg",meta.icd.plot.female,width=8,height=6)

```

#### Figure 14.

```{r Supp Fig 14, fig.height=17, fig.width=10}

tissue.table <- data.table()
tissue.names <- c("Null",names(expression)[4:56])
gene.removed.table <- data.table()

for (i in c(1:54)) {
  
  curr.list <- tissues.for.regression[i]
  tissue <- tissue.names[i]
  
  meta.dt <- make.meta.table(results.genelists[gene.list == curr.list], F, gene.list = curr.list, ymax = 3)
  meta.dt <- meta.dt[[1]][variant.type == "META"]
  
  if (tissue == "Null") {
    n.genes.removed <- nrow(shet.genes)
  } else {
    n.genes.removed <- nrow(expression[get(tissue) <= 0.5, "hg19.GENE"])
  }
  gene.removed.table <- rbind(gene.removed.table, data.table(count = n.genes.removed, tissue = tissue))
  
  meta.dt[,tissue:=eval(tissue)]
  tissue.table <- rbind(tissue.table,meta.dt)
  
}

setkey(gene.removed.table,count)

tissue.table[,tissue:=str_wrap(tissue,width = 20)]
gene.removed.table[,tissue:=str_wrap(tissue,width = 20)]
gene.removed.table[,tissue:=factor(tissue,levels = gene.removed.table[,tissue])]
tissue.table[,tissue:=factor(tissue,levels = gene.removed.table[,tissue])]

y.axis <- expression(bold(Odds~Ratio~at~s[het]~burden==1))
plot.breaks <- unique(c(seq(1,-0.15,by=-1 * 0.5),1, seq(0.5,3,by=0.5)))
plot.breaks <- plot.breaks[plot.breaks != 0]

tissue.effect.plot <- ggplot(tissue.table, aes(tissue, var.beta, group = Sex, colour = Sex)) +
  geom_hline(aes(yintercept=1),colour="red",linetype=2,size=1) +
  geom_point(aes(shape=variant.shape),position=position_dodge(0.8),size=3) +
  geom_errorbar(aes(ymin=var.ci.lower,ymax=var.ci.upper),width=0,position=position_dodge(0.8)) +
  geom_text(aes(y = 0.1, label=paste0("p==",if_else(var.p<1e-2,paste0(prefix,"%*%10^",exponent),sprintf("%0.2f",var.p)))), parse = T,position=position_dodge(0.85),size=4,hjust=1,show.legend = F) +
  geom_text(aes(y = 0.1, label = p.sig), position=position_dodge(0.65),size=8,hjust=0,show.legend = F) +
  scale_x_discrete(name = "", position = "top") +
  scale_y_continuous(name=y.axis, limits = c(-0.15, 3), breaks = plot.breaks) +
  scale_shape_identity() +
  scale_size_area(breaks=c(25000,100000,175000),guide=guide_legend(title="# of Indivs."), limits = c(0, 200000)) +
  sex.colours.colour.rev +
  coord_flip(clip = "off") +
  theme.figures.legend + theme(panel.grid.major.y=element_blank())

gene.removed.plot <- ggplot(gene.removed.table, aes(tissue, count)) + 
  geom_col(colour = "black", fill = "grey") +
  geom_text(aes(label = count, y = 8000), hjust = 0) +
  scale_x_discrete(name = "") +
  scale_y_continuous(name = "Number of Genes\nIn Model", expand = c(0,0)) +
  coord_flip() +
  theme.figures + theme(axis.text.y = element_blank(), panel.grid.major.y = element_blank())

tissue.plot <- tissue.effect.plot + gene.removed.plot + plot_layout(ncol = 2, nrow = 1, widths = c(0.8,0.2),guides = 'collect')
tissue.plot

ggsave("figures/supplement/SuppFig14.png", tissue.plot, height = 17, width = 15, units = "in", dpi = 450)

```

#### Figure 15.

```{r Supp Fig 15, fig.height=10, fig.width=10}

left <- ggplot(expr.test,aes(factor(tissue), r.sqr.shet)) + 
  geom_point(size = 2) +
  scale_x_discrete(name = "") +
  scale_y_continuous(name = expression(bold(r^2))) +
  ggtitle(expression(bold(A)~s[het] %~~%~expression+gene~length)) +
  coord_flip() +
  scale_alpha_continuous(range = c(0,1)) +
  theme.figures.legend + 
  theme(panel.grid.major.y = element_blank(), axis.text.y = element_text(colour=c(rep("black",7),rep("red",13),rep("black",33))), title = element_text(size = 10), axis.title.x = element_text(size = 12))

right <- ggplot(expr.test,aes(x = factor(tissue), r.sqr.LOF)) + 
  geom_rect(xmin = 7.5, xmax = 20.5, ymin = 0.1899, ymax = 0.220, fill = "red", colour = "black", alpha = 0.01, linetype = 2, size = 1) + 
  geom_point(colour = "blue",size = 2) +
  scale_x_discrete(name = "") +
  scale_y_continuous(name = expression(bold(r^2))) +
  ggtitle(expression(bold(B)~s[het] %*% `#`~of~PTVs %~~%~expression+gene~length)) + 
  coord_flip(clip = "off") +
  theme.figures.legend + 
  theme(panel.grid.major.y = element_blank(), axis.text.y = element_blank(), title = element_text(size = 10), axis.title.x = element_text(size = 12))

expr.corr <- left | right + plot_layout(guides = "collect")
expr.corr

ggsave("figures/supplement/SuppFig15.png", expr.corr, height = 10, width = 10, dpi = 300)

```

#### Figure 16.

```{r Supp Fig 16, fig.height=8, fig.width=8}

plot.betas <- function(x.var, type, add.covars = c(), ymin = 0, ymax = 2) {
  
  data <- data.table(sexPulse=c(1,2))
  data[,c("beta","std.error","p.val","tab","model"):=run.lm(sexPulse,x.var,add.covars = add.covars, inc.PCs = T, return.model = F),by=1:nrow(data)]
  
  data <- copy(data)
  data[,var.ci.upper:=exp(beta + (1.96*std.error))]
  data[,var.ci.lower:=exp(beta - (1.96*std.error))]
  data[,sig.pos:=if_else(beta<1,var.ci.lower-0.1,var.ci.upper+0.1)]
  data[,beta:=exp(beta)]
  data[,sexPulse:=factor(sexPulse,levels=c(1,2),labels=c("Male","Female"))]
  data[,N:=nrow(tab[[1]]),by=1:nrow(data)]

  ylab <- "Effect on having children (Odds Ratio)"
  
  plot <- ggplot(data,aes(sexPulse,beta,colour=sexPulse)) +
    geom_hline(aes(yintercept=1),colour="red",linetype=2) +
    geom_point() +
    geom_errorbar(aes(ymin=var.ci.lower, ymax=var.ci.upper,colour=sexPulse),width=0) +
    scale_x_discrete(name = "") +
    scale_y_continuous(name = ylab, limits = c(ymin,ymax)) +
    sex.colours.colour.rev +
    coord_flip() +
    theme.figures.legend + 
    ggtitle(type) + 
    theme(panel.grid.major.y=element_blank())
    
  return(list(data,plot))
  
}

partner.beta.plot <- plot.betas("partner.in.house", "Having a Partner at Home",ymax=6)
cog.beta.plot <- plot.betas("fluid.intel", "Fluid Intelligence")
ea.beta.plot <- plot.betas("completed.college", "Completing University")
mhq.beta.plot <- plot.betas("mht.binary", "Having a Severe MH Disorder")
hhi.beta.plot <- plot.betas("household.income", "Household Income", c("partner.in.house","partner.in.house*household.income"))
tdi.beta.plot <- plot.betas("townsend.index", "Townsend Dep. Index")
same.sex.beta.plot <- plot.betas("same.sex", "Engaging in Same Sex Sexual Behaviour")

## Have to do had.sex seperately because the effect is so strong and the function above doesn't plot within it's bounds
data <- data.table(sexPulse=c(1,2))
data[,c("beta","std.error","p.val","tab","model"):=run.lm(sexPulse,"had.sex",add.covars = c(), inc.PCs = T, return.model = F),by=1:nrow(data)]
data[,sexPulse:=factor(sexPulse,levels=c(1,2),labels=c("Male","Female"))]
data[,var.ci.upper:=exp(beta + (1.96*std.error))]
data[,var.ci.lower:=exp(beta - (1.96*std.error))]
data[,sig.pos:=if_else(beta<1,var.ci.lower-0.1,var.ci.upper+0.1)]
data[,beta:=exp(beta)]
data[,N:=nrow(tab[[1]]),by=1:nrow(data)]

ylab <- "Effect on having children (Odds Ratio)"

plot <- ggplot(data,aes(sexPulse,beta,colour=sexPulse)) +
  geom_hline(aes(yintercept=1),colour="red",linetype=2) +
  geom_point() +
  geom_errorbar(aes(ymin=var.ci.lower, ymax=var.ci.upper,colour=sexPulse),width=0) +
  scale_x_discrete(name = "") +
  scale_y_continuous(name = ylab) +
  sex.colours.colour.rev +
  coord_flip() +
  theme.figures.legend + 
  ggtitle("Ever Had Sex") + 
  theme(panel.grid.major.y=element_blank())

had.sex.beta.plot <- list(data, plot)

## Do final plot
plot.betas <- partner.beta.plot[[2]] +
  had.sex.beta.plot[[2]] +
  ea.beta.plot[[2]] +
  mhq.beta.plot[[2]] +
  hhi.beta.plot[[2]] +
  cog.beta.plot[[2]] +
  tdi.beta.plot[[2]] +
  same.sex.beta.plot[[2]] + 
  plot_layout(ncol = 2, guides="collect") + plot_annotation(tag_levels = 'A')

plot.betas

ggsave("figures/supplement/SuppFig16.png",plot.betas, dpi = 300, height = 8, width = 8, units = "in")

rm(data, plot)
```

#### Figure 17.

```{r Supp Fig 17, fig.height=3, fig.width=8.5}

same.sex <- make.meta.table(results.same.sex, F, ymin = -0.7, ymax= 4.5, block = -0.5, p.pos = -0.05, b = 0.5)
same.sex

ggsave("figures/supplement/SuppFig17.png",same.sex[[2]], dpi = 600, height = 3, width = 8.5, units = "in")
```

#### Figure 18.

```{r Supp Fig 18, fig.height=3, fig.width = 8.5}

fig.tdi <- make.meta.table(results.townsend, T, ymin = -1.0, ymax= 3.2, block = -1, p.pos = -0.4, b = 1)
fig.tdi

ggsave("figures/supplement/SuppFig18.png", fig.tdi[[2]], dpi = 600, height =3, width = 8.5, units = "in")

```

#### Figure 19.

```{r Supp Fig 19, fig.height = 3, fig.width = 8.5}

## Cognition
plot.iq <- ggplot(model.cog, aes(x = shet, group = sexPulse, fill = sexPulse)) + geom_line(aes(y = expected.iq_mid, colour = sexPulse), size=2) + geom_ribbon(aes(ymin= expected.iq_lower, ymax = expected.iq_upper), alpha = 0.5) + scale_alpha_continuous(range = c(0,1)) + sex.colours.colour + sex.colours.fill + geom_abline(intercept=1,slope=-1,linetype=2) + scale_x_continuous(name=expression(bold(s[het])), limits = c(0,1), expand = c(0,0)) + scale_y_continuous(name = "Predicted IQ") + theme.figures.legend

plot.fert.iq <- ggplot(model.cog, aes(x = shet, group = sexPulse, fill = sexPulse)) + geom_line(aes(y = ratio_mid, colour = sexPulse), size=2) + geom_ribbon(aes(ymin= ratio_lower, ymax = ratio_upper), alpha = 0.5) + scale_alpha_continuous(range = c(0,1)) + sex.colours.colour + sex.colours.fill + geom_abline(intercept=1,slope=-1,linetype=2) + scale_x_continuous(name=expression(bold(s[het])), limits = c(0,1), expand = c(0,0)) + scale_y_continuous(name = "Predicted Fitness", limits = c(0.9,1.05)) + theme.figures.legend

plot.child.iq <- ggplot(model.cog, aes(x = shet, group = sexPulse, fill = sexPulse)) + geom_line(aes(y = mean.childlessness_mid, colour = sexPulse), size=2) + geom_ribbon(aes(ymin= mean.childlessness_lower, ymax = mean.childlessness_upper), alpha = 0.5) + scale_alpha_continuous(range = c(0,1)) + sex.colours.colour + sex.colours.fill + scale_x_continuous(name=expression(bold(s[het])), limits = c(0,1), expand = c(0,0)) + scale_y_continuous(name = "Predicted Childlessness", limits = c(0.1,0.3)) + theme.figures.legend

cognition.fertility.plot <- plot.iq + plot.child.iq + plot.fert.iq + ## Cognition
  plot_layout(nrow = 1, ncol = 3, guides = "collect") + plot_annotation(tag_levels = 'A')

cognition.fertility.plot

ggsave("figures/supplement/SuppFig19.png",cognition.fertility.plot,width=8.5,height=3,units = "in", dpi = 300)
```

#### Figure 20.

```{r Supp Fig 20, fig.height=4, fig.width=8}

cog.fit.plot <- ggplot(cog.raw) + geom_ribbon(data = cog.raw[newiq<120],aes(x=newiq, ymin=Mean-ci,ymax=Mean+ci),colour="grey",alpha=0.3) + geom_line(data = cog.raw[newiq<120],aes(x=newiq, y=Mean),colour="black", size = 1) + geom_line(aes(x = newiq, y=pred.log),colour="red", size = 0.8, linetype = 2) + scale_x_continuous(name = "IQ", limits=c(0,140)) + scale_y_continuous(name = "Average Children", limits=c(-0.1,2)) + scale_alpha_continuous(range=c(0,1)) + theme.figures.legend

child.fit.plot <- ggplot(childless.raw) + geom_ribbon(data = childless.raw[iq<120],aes(x= iq,ymin=ci.lower,ymax=ci.upper),colour="grey",alpha=0.3) + geom_line(data = childless.raw[iq<120],aes(x=iq, y=inc.childlessness),colour="black", size = 1) + geom_line(aes(x = iq, y=pred.log.inv),colour="red", size = 0.8, linetype = 2) + scale_x_continuous(name = "IQ", limits=c(0,140)) + scale_y_continuous(name = "Increased Childlessness from Baseline", limits=c(-0.1,1)) + scale_alpha_continuous(range=c(0,1)) + theme.figures.legend

cog.fit.plots <- cog.fit.plot + child.fit.plot + plot_layout(nrow=1, guides = "collect") + plot_annotation(tag_levels = 'A')

cog.fit.plots

ggsave("figures/supplement/SuppFig20.png",cog.fit.plots, dpi = 300, height = 5, width = 8, units = "in")
```

#### Figure 21.

```{r Supp Fig 21, fig.height=9, fig.width=8.5}

email.data <- make.meta.table(results.email, F, b=0.25, ymax = 2,title = "Has Email?")
email.data

answered.mhq.data <- make.meta.table(results.answered.mhq, F, b=0.25, ymax = 2, title = "Answered MH Questionnaire?")
answered.mhq.data

has.gp.data <- make.meta.table(results.has.CHOD, F, ymax = 2, b=0.25, title = "Has GP Records?")
has.gp.data

bias.plots <- email.data[[2]] + answered.mhq.data[[2]] + has.gp.data[[2]] + plot_layout(guides = "collect", nrow = 3) + plot_annotation(tag_levels = 'A')
bias.plots

ggsave("figures/supplement/SuppFig21.png",bias.plots, dpi = 300, height = 9, width = 8.5, units = "in")
```

#### Figure 22.

```{r Supp Fig 22, fig.height=4, fig.width=8.5}

res <- data.table()

for (trait in c("fi.developmental_disorder","fi.asd","fi.add","fi.scizo","fi.bipolar","hes.developmental_disorder","hes.asd","hes.add","hes.scizo","hes.bipolar")) {
  plot.meta <- make.meta.table(results.mht[y.var == trait], F, b = 100, title = trait, ymin = 1e-4, ymax = 10000, p.pos = 1e-4)
  res.meta <- plot.meta[[1]]
  info <- str_split(trait,"\\.")[[1]]
  res.meta[,source:=info[1]]
  res.meta[,trait:=info[2]]
  res <- rbind(res, plot.meta[[1]])
}

res[,trait:=factor(trait,levels=rev(c("developmental_disorder","asd","add","scizo","bipolar","binary")))]

plot.separate <- ggplot(res[source == "fi" & variant.type == "META"],aes(trait,var.beta,group=Sex,colour=Sex)) +
  geom_hline(aes(yintercept = 1),colour="red",linetype=2,size=1) +
  geom_point(aes(size=n.indvs,shape=variant.shape),position=position_dodge(0.5)) +
  geom_errorbar(aes(ymin=var.ci.lower,ymax=var.ci.upper),width=0,position=position_dodge(0.5)) +
  geom_text(aes(y = 5.0e-5, label=paste0("p = ", sprintf("%0.2g",var.p))),position=position_dodge(0.5),size=4,hjust=1,show.legend = F) +
  geom_text(aes(y = 5.0e-5, label = if_else(var.p < 0.0025, "*", "")), position=position_dodge(0.65),size=8,hjust=0,show.legend = F) +
  scale_x_discrete(name = "", position = "top", labels = rev(c("Developmental/\nIntellectual Disability","Autism", "Attention Deficit\nHyperactivity Disorder", "Schizophrenia","Bipolar\nDisorder"))) +
  scale_y_log10(name=expression(bold(Odds~Ratio~at~s[het]~burden==1~of~having~given~disorder)),limits = c(2e-6,10000),breaks = c(1/(10^c(4:1)),10^c(0:4)),labels = c("0.0001","0.001","0.01","0.1","1","10","100","1000","10000")) +
  scale_shape_identity() +
  scale_size_area(breaks=c(50000,100000,150000),guide=guide_legend(title="# of Indivs.")) +
  sex.colours.colour.rev +
  coord_flip() +
  theme.figures.legend + theme(panel.grid.major.y = element_blank(),axis.text.y = element_text(size=10))
plot.separate

ggsave("figures/supplement/SuppFig22.svg", plot.separate, width = 8.5, height = 4, dpi = 300)

```

#### Figure 23.

```{r Supp Fig 23, fig.height=12, fig.width=10}

final.results.matrix[,curr.cov.string:=factor(curr.cov.string, levels = final.results.matrix[,unique(curr.cov.string)])]

ors.plot <- ggplot(final.results.matrix[variant.type == "META"], aes(curr.cov.string, or, group = sexPulse, colour = sexPulse)) +
  geom_hline(yintercept = 1, colour = "red", linetype = 2, size = 1) +
  geom_point(size = 4, position = position_dodge(0.6)) +
  geom_errorbar(aes(ymin = var.ci.lower, ymax = if_else(var.ci.upper > 1.25, 1.25, var.ci.upper)), width = 0, position = position_dodge(0.6)) +
  geom_text(aes(y = -0.13, label = paste0("p = ",if_else(var.p<=1e-2,sprintf("%0.1e",var.p),sprintf("%0.2f",var.p)))), position = position_dodge(0.90), hjust = 0,size = 4) +
  scale_x_discrete(name = "") +
  scale_y_continuous(name = expression(bold(atop(atop(Odds~Ratio~at~s[het]~burden==1,on~having~children),atop("","")))),limits = c(-0.15, 1.21), breaks = seq(0.2,1.2,by=0.2), expand = c(0,0)) +
  scale_size_area(breaks = c(25000,100000,175000),guide=guide_legend(title="# of Indivs."), limits = c(0, 200000)) +
  sex.colours.colour +
  coord_flip() +
  theme.figures.legend + theme(panel.grid.major.y = element_blank(),panel.grid.minor.y = element_line(colour = "grey",size = 2), axis.text.y = element_blank(), axis.title.x = element_text(size = 14),axis.text.x = element_text(size = 12),rect = element_rect(fill = "transparent"))

r2.plot.table <- final.results.matrix[grepl("Fluid Intelligence", curr.cov.string) == F & variant.type == "LOF_HC"]
r2.plot.table[,val:=if_else(sexPulse == "Male",
                            inc.r.shet/r2.plot.table[sex == 1 & curr.cov.string == "NULL",inc.r.shet],
                            inc.r.shet/r2.plot.table[sex == 2 & curr.cov.string == "NULL",inc.r.shet])]
r2.plot.table[,val:=(val)*100]

inc.r.shet.plot <- ggplot(r2.plot.table,aes(curr.cov.string, val, group = sexPulse, fill = sexPulse)) +
  geom_hline(yintercept=100,colour="red",linetype = 2, size = 1) +
  geom_col(position = position_dodge(0.75), size = 1, width = 0.75, colour = "black") +
  scale_y_continuous(name = expression(bold(atop(atop("","Proportion of Variance Explained by"),atop(s[het]~"Compared to the Null Model:",has.children %~% y.axis.covars + control.covariates))))) +
  scale_x_discrete(name = "") +
  sex.colours.fill +
  coord_flip() + 
  theme.figures.legend + theme(axis.text.y = element_blank(),panel.grid.major.y=element_blank(), axis.title.x = element_text(size = 14))

labels.table <- data.table(current.cov.string=final.results.matrix[variant.type == "META",unique(curr.cov.string)])
labels.table[,shet:=1]
labels.table[,mht:=if_else(str_detect(current.cov.string,"Has MHT") == T, 1, 0)]
labels.table[,partner:=if_else(str_detect(current.cov.string,"Has Partner") == T, 1, 0)]
labels.table[,college:=if_else(str_detect(current.cov.string,"Completed College") == T, 1, 0)]
labels.table[,infertile:=if_else(str_detect(current.cov.string,"ICD-10") == T, 1, 0)]
labels.table[,had.sex:=if_else(str_detect(current.cov.string,"Ever Had Sex") == T, 1, 0)]
labels.table <- data.table(pivot_longer(labels.table, -current.cov.string))
setnames(labels.table, "value","has.covar")

labels.table <- data.table(crossing(labels.table,sex=c(1,2)))

get.sig.code <- function(should.check,mod,covar,s) {
  if (should.check == 0) {
    return("")
  } else {
    ## This is dumb, but have to set the right covar names that match the model data.frame:
    to.check = covar
    if (covar == "college") {
      to.check = "completed.college"
    } else if (covar == "infertile") {
      to.check = "fi.fert"
    } else if (covar == "mht") {
      to.check = "mht.binary"
    } else if (covar == "partner") {
      to.check = "partner.in.house"
    } else if (covar == "shet") {
      to.check = "product_sHET"
    }
    curr.model <- final.results.matrix[curr.cov.string == mod & sex == s & variant.type == "LOF_HC",model][[1]]
    if (curr.model[term == to.check,p.value] < (0.05/20)) {
      return("**")
    } else if (curr.model[term == to.check,p.value] < (0.05)) {
      return("*")
    } else {
      return("")
    }
  }
}
labels.table[,is.sig:=get.sig.code(has.covar,current.cov.string,name,sex),by=1:nrow(labels.table)]

labels.plot <- ggplot(labels.table,aes(interaction(sex,current.cov.string), name, fill = interaction(as.factor(has.covar),as.factor(sex)))) + 
  geom_tile(aes(stat = has.covar),size = 0.25, colour = "black") +
  geom_text(aes(label = is.sig),size=4,colour="white",nudge_x = -0.2) +
  scale_fill_manual(values = c("white",male.col,"white",female.col)) +
  geom_vline(xintercept = seq(-1.5,17.5,by=2),colour="black",size = 1) +
  geom_hline(yintercept = seq(-1.5,6.5,by=1),colour="black",size = 1) +
  scale_x_discrete(name = "",expand=c(0,0)) +
  scale_y_discrete(name = "", position = "right",limits = c("shet","mht","partner","college","infertile","had.sex"),labels = c(expression(s[het]~Burden),"MH Disorder","Partner At Home","University Degree","Infertility","Had Sex"),expand=c(0,0)) +
  coord_flip() +
  theme.figures + theme(axis.ticks = element_blank(), panel.grid.major = element_blank(),axis.line = element_blank(), axis.text.x = element_text(hjust = 0),axis.text.y = element_blank())

joint.plot <- labels.plot + ors.plot + inc.r.shet.plot + plot_layout(ncol = 3, guides = "collect", widths = c(0.25,1,0.4))
joint.plot
ggsave("figures/supplement/SuppFig23.png", joint.plot, width = 10, height = 12, dpi = 300)

```

#### Figure 24.

```{r Supp Fig 24, fig.height=3, fig.width=4}

plot.child.fert <- ggplot(model.fertility, aes(x = shet, group = sexPulse, fill = sexPulse)) + geom_line(aes(y = mean.childlessness_mid, colour = sexPulse), size=2) + geom_ribbon(aes(ymin= mean.childlessness_lower, ymax = mean.childlessness_upper), alpha = 0.5) + scale_alpha_continuous(range = c(0,1)) + sex.colours.colour + sex.colours.fill + scale_x_continuous(name=expression(bold(s[het])), limits = c(0,1), expand = c(0,0)) + scale_y_continuous(name = "Predicted Childlessness", limits = c(0.1,0.6)) + theme.figures.legend

plot.child.fert

ggsave("figures/supplement/SuppFig24.png",plot.child.fert,width=4,height=3,units = "in", dpi = 300)

```

#### Figure 25.

```{r Supp Fig 25, fig.height=7.5, fig.width=8.5}

cassa.data <- make.meta.table(results.fertility.cassa, F, allele.freq = 0, b=0.25, gene.list = "product_sHET_old")

del.shet.r2 <- sprintf("%0.3f",summary(lm(product_sHET ~ product_sHET_old,data = variant.counts[type == "DEL" & allele.freq == 0]))$r.squared)

del.shet.plot <- ggplot(variant.counts[type == "DEL" & allele.freq == 0],aes(product_sHET, product_sHET_old)) + 
  geom_point(size=0.25) + 
  scale_x_continuous(name = expression(bold(s[het]~burden~Weghorn~et~al.)),limits=c(0,1)) +
  scale_y_continuous(name = expression(bold(s[het]~burden~Cassa~et~al.)),limits=c(0,1)) +
  annotate("text",label = bquote(r^2 == .(del.shet.r2)), x = 0.12,y=0.9,size=3.5,colour="red") + 
  ggtitle(expression(bold(Deletion~s[het]))) +
  theme.figures + theme(plot.title = element_text(hjust = 0.5))

ptv.shet.r2 <- sprintf("%0.3f",summary(lm(product_sHET ~ product_sHET_old,data = variant.counts[type == "LOF_HC" & allele.freq == 0]))$r.squared)

ptv.shet.plot <- ggplot(variant.counts[type == "LOF_HC" & allele.freq == 0],aes(product_sHET, product_sHET_old)) + 
  geom_point(size=0.25) + 
  scale_x_continuous(name = expression(bold(s[het]~burden~Weghorn~et~al.)),limits=c(0,1)) +
  scale_y_continuous(name = expression(bold(s[het]~burden~Cassa~et~al.)),limits=c(0,1)) +
  annotate("text",label = bquote(r^2 == .(ptv.shet.r2)), x = 0.12,y=0.9,size=3.5,colour="red") + 
  ggtitle(expression(bold(PTV~s[het]))) +
  theme.figures + theme(plot.title = element_text(hjust = 0.5))

top <- del.shet.plot + ptv.shet.plot + plot_layout(ncol = 2, nrow = 1)
bottom <- cassa.data[[2]] + plot_spacer() + plot_layout(ncol = 2, nrow = 1, widths = c(1,0.001))

shet.comp.plot <- (top / bottom) + plot_layout(heights = c(2,1)) + plot_annotation(tag_levels = 'A')
shet.comp.plot

ggsave("figures/supplement/SuppFig25.png",shet.comp.plot, dpi = 300, height = 7.5, width = 8.5, units = "in")

```

#### Figure 26.

```{r Supp Fig 26, fig.height=10, fig.width=7}

pc.effect.plot.common <- ggplot(pc.effects.table[pc.type=="common"],aes(pc.num,var.beta)) + 
  geom_point() + 
  geom_errorbar(aes(ymin=var.ci.lower,ymax=var.ci.upper),width=0) + 
  scale_x_continuous(name = "Common Ancestry Principal Component", limits = c(1,40)) +
  scale_y_continuous(name = "OR for the effect of PC\non having children") +
  theme.figures

pc.num.plot.common <- ggplot(results.pcs.meta[pc.type == "common"],aes(PCs,log.p)) + 
  geom_point(size = 2) + 
  scale_x_discrete(name = "Common PCs Included in Regression", limits=results.pcs.meta[pc.type == "common",PCs]) + 
  scale_y_continuous(name = expression(bold(atop(-log[10]~p~from,meta~analysis))), limits = c(0,20)) +
  theme.figures

pc.effect.plot.rare <- ggplot(pc.effects.table[pc.type=="rare"],aes(pc.num,var.beta)) + 
  geom_point() + 
  geom_errorbar(aes(ymin=var.ci.lower,ymax=var.ci.upper),width=0) + 
  scale_x_continuous(name = "Rare Ancestry Principal Component", limits = c(1,100)) +
  scale_y_continuous(name = "OR for the effect of PC\non having children") +
  theme.figures

pc.num.plot.rare <- ggplot(results.pcs.meta[pc.type == "rare"],aes(PCs,log.p)) + 
  geom_point(size = 2) + 
  scale_x_discrete(name = "Rare PCs Included in Regression", limits=results.pcs.meta[pc.type == "rare",PCs], breaks =results.pcs.meta[pc.type == "rare"][c(1,seq(16,100,by=25)),PCs]) + 
  scale_y_continuous(name = expression(bold(atop(-log[10]~p~from,meta~analysis))), limits = c(0,20)) +
  theme.figures

pc.plots <- pc.effect.plot.common + pc.num.plot.common + pc.effect.plot.rare + pc.num.plot.rare + plot_annotation(tag_levels = 'A') + plot_layout(nrow = 4, heights = c(2,1,2,1))
pc.plots

ggsave("figures/supplement/SuppFig26.png",pc.plots, dpi = 300, height = 10, width = 7, units = "in")

```

#### Figure 27.

```{r Supp Fig 27, fig.height=7, fig.width=8.5}

plot.fruit <- make.meta.table(results.fruit, T, ymin = -0.7, ymax= 0.4, block = -1, p.pos = -0.6, title = "Fresh Fruit Intake\nPer Day")
plot.hands <- make.meta.table(results.handedness, F, b = 0.25, title = "Is Left Handed?", ymax = 3.0)
plot.hair <- make.meta.table(results.hair, F, b = 0.25, title = "Blonde Hair?", ymax = 3.0)

plot.neutrals <- plot.fruit[[2]] + plot.hands[[2]] + plot.hair[[2]] + plot_layout(nrow = 3, ncol = 1, guides = 'collect') + plot_annotation(tag_levels = 'A')
plot.neutrals

ggsave("figures/supplement/SuppFig27.png",plot.neutrals, dpi = 600, height = 7, width = 8.5, units = "in")
```

#### Figure 28.

```{r Supp Fig 28, fig.height = 6, fig.width = 8.5}

plot.mh <- ggplot(model.mhq, aes(x = shet, group = sexPulse, fill = sexPulse)) + geom_line(aes(y = mean.has.disorder_mid, colour = sexPulse), size=2) + geom_ribbon(aes(ymin= mean.has.disorder_lower, ymax = mean.has.disorder_upper), alpha = 0.5) + scale_alpha_continuous(range = c(0,1)) + sex.colours.colour + sex.colours.fill + scale_x_continuous(name=expression(bold(s[het])), limits = c(0,1), expand = c(0,0)) + scale_y_continuous(name = "", limits = c(0,0.125)) + ggtitle("Combined") + theme.figures + theme(plot.title = element_text(hjust=0.5, size = 10))

plot.fert.mh <- ggplot(model.mhq, aes(x = shet, group = sexPulse, fill = sexPulse)) + geom_line(aes(y = ratio_mid, colour = sexPulse), size=2) + geom_ribbon(aes(ymin= ratio_lower, ymax = ratio_upper), alpha = 0.5) + scale_alpha_continuous(range = c(0,1)) + sex.colours.colour + sex.colours.fill + geom_abline(intercept=1,slope=-1,linetype=2) + scale_x_continuous(name=expression(bold(s[het])), limits = c(0,1), expand = c(0,0)) + scale_y_continuous(name = "Predicted Fitness", limits = c(0.8,1.05)) + theme.figures.legend

inc.plot.scizo <- ggplot(inc.mht[condition == "scizo"], aes(x = shet, group = sexPulse, fill = sexPulse)) + geom_line(aes(y = mid, colour = sexPulse), size=2) + geom_ribbon(aes(ymin= lower, ymax = upper),alpha=0.5) + scale_alpha_continuous(range = c(0,1)) + sex.colours.colour + sex.colours.fill + scale_x_continuous(name=expression(bold(s[het])), limits = c(0,1), expand = c(0,0)) + scale_y_continuous(name = "Incidence", limits = c(0,0.125)) + theme.figures + ggtitle ("Schizophrenia") + theme(plot.title = element_text(hjust=0.5, size = 10))

inc.plot.asd <- ggplot(inc.mht[condition == "asd"], aes(x = shet, group = sexPulse, fill = sexPulse)) + geom_line(aes(y = mid, colour = sexPulse), size=2) + geom_ribbon(aes(ymin= lower, ymax = upper),alpha=0.5) + scale_alpha_continuous(range = c(0,1)) + sex.colours.colour + sex.colours.fill + scale_x_continuous(name=expression(bold(s[het])), limits = c(0,1), expand = c(0,0)) + scale_y_continuous(name = "", limits = c(0,0.125)) + theme.figures + ggtitle ("ASD") + theme(plot.title = element_text(hjust=0.5, size = 10))

inc.plot.bipolar <- ggplot(inc.mht[condition == "bipolar"], aes(x = shet, group = sexPulse, fill = sexPulse)) + geom_line(aes(y = mid, colour = sexPulse), size=2) + geom_ribbon(aes(ymin= lower, ymax = upper),alpha=0.5) + scale_alpha_continuous(range = c(0,1)) + sex.colours.colour + sex.colours.fill + scale_x_continuous(name=expression(bold(s[het])), limits = c(0,1), expand = c(0,0)) + scale_y_continuous(name = "", limits = c(0,0.125)) + ggtitle ("Bipolar Disorder") + theme.figures + theme(plot.title = element_text(hjust=0.5, size = 10))

mh.fertility.plot.top <- inc.plot.scizo + inc.plot.asd + inc.plot.bipolar + plot.mh + plot_layout(nrow = 1, ncol = 4, guides = "collect") ## MH Incidence
mh.fertility.plot.bottom <- plot_spacer() + plot.fert.mh + guide_area() + plot_layout(nrow = 1, ncol = 3, guides = "collect") ## MH

mh.fertility.plot <- mh.fertility.plot.top / mh.fertility.plot.bottom + plot_annotation(tag_levels = 'A') + plot_layout(guides = "collect")
mh.fertility.plot

ggsave("figures/supplement/SuppFig28.png",mh.fertility.plot, dpi = 300, height = 6, width = 8.5, units = "in")

```

### Tables

#### Table 1

This table was manually created. Not necessary to automate the process.

#### Table 2

```{r Supp table 2}

## CHOD Data
supptable2a <- data.table(pivot_wider(fi.analysis.table[variant.type == "META" & level <= 3,c("sex","coding","meaning","chapter","level","var.or","var.err","var.p","icd.or","icd.err","icd.p","N")],names_from=sex, values_from = starts_with(c("var","icd","N")), names_sep = "."))

col.order <- c(names(supptable2a)[grepl("var|icd|N",names(supptable2a)) == F],names(supptable2a)[grepl("\\.MALE",names(supptable2a),perl = T)],names(supptable2a)[grepl("FEMALE",names(supptable2a))])

setcolorder(supptable2a, col.order)

write.table(supptable2a, "figures/supplement/SuppTable2a.tsv",col.names=T,row.names=F,quote=F,sep="\t")

## HES Data
supptable2b <- data.table(pivot_wider(hes.analysis.table[variant.type == "META",c("sex","coding","meaning","chapter","level","var.or","var.err","var.p","icd.or","icd.err","icd.p","N")],names_from=sex, values_from = starts_with(c("var","icd","N")), names_sep = "."))

col.order <- c(names(supptable2b)[grepl("var|icd|N",names(supptable2b)) == F],names(supptable2b)[grepl("\\.MALE",names(supptable2b),perl = T)],names(supptable2b)[grepl("FEMALE",names(supptable2b))])

setcolorder(supptable2b, col.order)

write.table(supptable2b, "figures/supplement/SuppTable2b.tsv",col.names=T,row.names=F,quote=F,sep="\t")
```

#### Table 3

This table was manually created. Not necessary to automate the process.

#### Table 4

Tabulate ORs and Effect Sizes estimated in this manuscript:

```{r Supp Table 4}

format.table <- function(t, var, is.OR, rel.fig) {
  
  t <- copy(t)
  t <- t[,c("variant.type","var.beta","var.stderr","var.p","n.indvs","Sex")]
  t[,variant.type:=if_else(variant.type=="LOF_HC","PTV",
                           if_else(variant.type=="DEL","Deletion",
                                   if_else(variant.type=="DUP","Duplication",
                                           if_else(variant.type=="MIS","Missense",
                                                   if_else(variant.type=="SYN","Synonymous",
                                                           if_else(variant.type=="META","Meta-analysis (PTV+Deletion)","ERR"))))))]
  setnames(t,names(t),c("Variant Type","Effect (OR/Beta)","std. err.","p. value","N","Sex"))
  t[,`Effect Type`:=if_else(is.OR==T,"OR","beta")]
  t[,Phenotype:=var]
  t[,`Relevant Figure`:=rel.fig]
  return(t)
  
}

# shet on phenotype
## Figure 1
supptable4a <- format.table(plot.a[[1]],"Num. of Children",F,"Figure1A") ## Main fertility
supptable4a <- bind_rows(supptable4a,format.table(plot.b[[1]],"Childlessness",T,"Figure1B")) ## Main fertility

## Figure 2
# We use supplemental table 2 for these instead

## Figure 3
supptable4a <- bind_rows(supptable4a,format.table(fig.partner[[1]],"Partner At Home",T,"Figure3A")) ## Partner at home
supptable4a <- bind_rows(supptable4a,format.table(fig.had.sex[[1]],"Ever Had Sex",T,"Figure3B")) ## EA
supptable4a <- bind_rows(supptable4a,format.table(fig.ea[[1]],"University Degree",T,"Figure3C")) ## EA
supptable4a <- bind_rows(supptable4a,format.table(fig.mht[[1]],"Has MHT",T,"Figure3D")) ## MHT
supptable4a <- bind_rows(supptable4a,format.table(fig.hhi[[1]],"Household Income",F,"Figure3E")) ## HHI
supptable4a <- bind_rows(supptable4a,format.table(fig.cog[[1]],"Fluid Intel.",F,"Figure3F")) ## Cognition

## Sup Fig 4
supptable4a <- bind_rows(supptable4a,format.table(plottable[maf == 0],"All Variant Classes",T,"SupFig4")) ## All variant classes

## Sup Fig 5
supptable4a <- bind_rows(supptable4a,format.table(remove.zero[[1]],"Indv. w/children only",F,"SupFig5")) ## no childless individuals

## Supp Fig 7
supptable4a <- bind_rows(supptable4a,format.table(gene.data.pli[[1]],"high pLI",T,"SupFig7A")) ## pli on childlessness
supptable4a <- bind_rows(supptable4a,format.table(gene.data.shet[[1]],"high sHET",T,"SupFig7B")) ## shet on childlessness

## Sup Fig 8
supptable4a <- bind_rows(supptable4a,format.table(high.maf.data.all[maf == 1e-3],"MAF <1e-3",T,"SupFig8")) ## maf ≤1e-3 variants
supptable4a <- bind_rows(supptable4a,format.table(high.maf.data.all[maf == 1e-4],"MAF <1e-4",T,"SupFig8")) ## maf ≤1e-3 variants
supptable4a <- bind_rows(supptable4a,format.table(high.maf.data.all[maf == 1e-5],"MAF <1e-5",T,"SupFig8")) ## maf ≤1e-3 variants

## Sup Fig 9
supptable4a <- bind_rows(supptable4a,format.table(res.age[age==1940],"Birth Cohort 1940-50",T,"SupFig9")) ## cohort 1940-50
supptable4a <- bind_rows(supptable4a,format.table(res.age[age==1950],"Birth Cohort 1950-60",T,"SupFig9")) ## cohort 1950-60
supptable4a <- bind_rows(supptable4a,format.table(res.age[age==1960],"Birth Cohort 1960-70",T,"SupFig9")) ## cohort 1960-70

## Sup Fig 16
supptable4a <- bind_rows(supptable4a,format.table(same.sex[[1]],"Same Sex Sexual Behaviour",T,"SupFig16")) ## Same sex sexual behaviour

## Sup Fig 17
supptable4a <- bind_rows(supptable4a,format.table(fig.tdi[[1]],"Townsend Deprivation Index",F,"SupFig17"))

## Sup Fig 20
supptable4a <- bind_rows(supptable4a,format.table(email.data[[1]],"Has Email",T,"SupFig20A")) ## email
supptable4a <- bind_rows(supptable4a,format.table(answered.mhq.data[[1]],"Answered MHQ",T,"SupFig20B")) ## answered MHQ
supptable4a <- bind_rows(supptable4a,format.table(has.gp.data[[1]],"Has GP Data",T,"SupFig20C")) ## answered MHQ

## Sup Fig 21
supptable4a <- bind_rows(supptable4a,format.table(res[trait == "developmental_disorder" & source == "fi"],"Has DD/ID",T,"SupFig21")) ## DD/ID
supptable4a <- bind_rows(supptable4a,format.table(res[trait == "asd" & source == "fi"],"Has ASD",T,"SupFig21")) ## ASD
supptable4a <- bind_rows(supptable4a,format.table(res[trait == "add" & source == "fi"],"Has ADHD",T,"SupFig21")) ## ADHD
supptable4a <- bind_rows(supptable4a,format.table(res[trait == "scizo" & source == "fi"],"Has Schizophrenia",T,"SupFig21")) ## Schizo.
supptable4a <- bind_rows(supptable4a,format.table(res[trait == "bipolar" & source == "fi"],"Has Bipolar Disorder",T,"SupFig21")) ## Bipolar Dis.

## Sup Fig 24
supptable4a <- bind_rows(supptable4a,format.table(cassa.data[[1]],"Cassa sHET",T,"SupFig24C")) ## answered MHQ

## Sup Fig 26
supptable4a <- bind_rows(supptable4a,format.table(plot.fruit[[1]],"Fresh Fruit Intake",T,"SupFig26A")) ## answered MHQ
supptable4a <- bind_rows(supptable4a,format.table(plot.hands[[1]],"Left Handed",T,"SupFig26B")) ## answered MHQ
supptable4a <- bind_rows(supptable4a,format.table(plot.hair[[1]],"Blonde Hair Colour",T,"SupFig26C")) ## answered MHQ

## Text-based
supptable4a <- bind_rows(supptable4a,format.table(infertility.plot[[1]],"Effect of sHET on Having Infertility",T,"TextOnly")) ## sHET on having infertility
supptable4a <- bind_rows(supptable4a,format.table(no.male.fertility.plot[[1]],"No Male Infertility Genes",T,"TextOnly")) ## result w/o male infertility genes
supptable4a <- bind_rows(supptable4a,format.table(mouse.data[[1]],"No Mouse Infertility Genes",T,"TextOnly")) ## result w/o mouse male infertility genes
supptable4a <- bind_rows(supptable4a,format.table(no.path.cnvs.plot[[1]],"No Pathogenic CNV carriers",T,"TextOnly")) ## no pathogenic CNV carriers
supptable4a <- bind_rows(supptable4a,format.table(no.mh.patients.plot[[1]],"No MH Patients",T,"TextOnly")) ## no MH patients
supptable4a <- bind_rows(supptable4a,format.table(no.disease.plot[[1]],"No Disease Genes",T,"TextOnly")) ## no Disease genes
supptable4a <- bind_rows(supptable4a,format.table(mouse.data[[1]],"No Mouse Infertility",T,"TextOnly")) ## no mouse infertility
supptable4a <- bind_rows(supptable4a,format.table(no.same.sex.plot[[1]],"No Same Sex Individuals",T,"TextOnly")) ## no same sex individuals

format.table <- function(t, var, rel.fig) {
  
  t <- copy(t)
  t <- t[,c("beta","std.error","p.val","N","sexPulse")]
  setnames(t,names(t),c("Effect (OR/Beta)","std. err.","p. value","N","Sex"))
  t[,`Variant Type`:='NA']
  setcolorder(t,c("Variant Type","Effect (OR/Beta)","std. err.","p. value","N","Sex"))
  t[,`Effect Type`:="OR"]
  t[,Phenotype:=var]
  t[,`Relevant Figure`:=rel.fig]
  
  return(t)
  
}

supptable4b <- format.table(partner.beta.plot[[1]], "Partner  At Home","SupFig15A")
supptable4b <- bind_rows(supptable4b,format.table(had.sex.beta.plot[[1]], "Ever Had Sex","SupFig15B"))
supptable4b <- bind_rows(supptable4b,format.table(ea.beta.plot[[1]], "Educational Attainment","SupFig15C"))
supptable4b <- bind_rows(supptable4b,format.table(mhq.beta.plot[[1]], "Mental Health Traits","SupFig15D"))
supptable4b <- bind_rows(supptable4b,format.table(hhi.beta.plot[[1]], "Household Income","SupFig15E"))
supptable4b <- bind_rows(supptable4b,format.table(cog.beta.plot[[1]], "Fluid Intel.","SupFig15F"))
supptable4b <- bind_rows(supptable4b,format.table(tdi.beta.plot[[1]], "Townsend Deprivation Index","SupFig15G"))
supptable4b <- bind_rows(supptable4b,format.table(same.sex.beta.plot[[1]], "Same Sex Sexual Behaviour","SupFig15H"))

write.table(supptable4a,"figures/supplement/SuppTable4a.tsv",col.names=T,row.names=F,sep="\t",quote=F)
write.table(supptable4b,"figures/supplement/SuppTable4b.tsv",col.names=T,row.names=F,sep="\t",quote=F)
```

#### Table 5

```{r Supp Table 5}

supptable5 <- merge(cog.raw[,c("newiq","Obs","Mean","SD","pred.log")],childless.raw[,c("iq","inc.childlessness","std. err.","pred.log.inv")],by.x="newiq",by.y="iq")
supptable5[,sd_childlessness:=`std. err.`*sqrt(Obs)]
supptable5[,`std. err.`:=NULL]

setnames(supptable5,names(supptable5),c("IQ","n_individuals","observed_fertility_mean","observed_fertility_sd","predicted_fertility","observed_increased_childlessness","predicted_increased_childlessness","observed_increased_childlessness_sd"))
setcolorder(supptable5,c("IQ","n_individuals","observed_fertility_mean","observed_fertility_sd","predicted_fertility","observed_increased_childlessness","observed_increased_childlessness_sd","predicted_increased_childlessness"))

write.table(supptable5,"figures/supplement/SuppTable5.tsv",col.names=T,row.names=F,sep="\t",quote=F)

```

#### Table 6

```{r Supp table 6}

supptable6 <- modeling[,c("or","or.upper","n.indv","sex.ratio","ratio","trait","sex","mean.children","incidence","or.lower")]
setnames(supptable6,names(supptable6),c("OR.ganna","OR.upper.ganna","N.power","sex.ratio.power","fertility.ratio.power","trait","sex","mean.children.power","incidence.power","OR.lower.ganna"))
setcolorder(supptable6,c("sex","trait","OR.ganna","OR.lower.ganna","OR.upper.ganna","N.power","sex.ratio.power","incidence.power","fertility.ratio.power","mean.children.power"))
write.table(supptable6,"figures/supplement/SuppTable6.tsv",col.names=T,row.names=F,quote=F,sep="\t")

```

## 8F. Numbers Catalogue

This documents all numbers in the manuscript printed in rough order. These are just replicated from above for the sake of my sanity when adding them into the text.

```{r Numbers Catalogue}

## Median age and range
paste0("Median age                                     : ", UKBB.phenotype.data[,median(agePulse)], "; range: ", UKBB.phenotype.data[,min(agePulse)], "-", UKBB.phenotype.data[,max(agePulse)],"; birth years: ", UKBB.phenotype.data[,min(birth.year)], "-", UKBB.phenotype.data[,max(birth.year)])
paste0("")

## Number of Individuals per datatype:
paste0("Number of individuals with CNV data            : ", length(unique(variant.counts[type == "DEL" & allele.freq == 0,sample_id])))
paste0("Number of individuals with SNV data            : ", length(unique(variant.counts[type == "LOF_HC" & allele.freq == 0,sample_id])))
paste0("")

## Sex Burden of sHET:
format.sex.burden.DEL
format.sex.burden.PTV
paste0("")

## Linear Fertility Model (# Children ~ sHET Burden):
plot.a[[1]][variant.type == "META",paste0(Sex, "s have ",sprintf("%0.2f",abs(var.beta))," fewer children", " [95% CI ",sprintf("%0.2f",(var.beta+(1.96*var.stderr))), "-",sprintf("%0.2f",(var.beta-(1.96*var.stderr))),"]", " p=",if_else(var.p<=1e-2,sprintf("%0.1e",var.p),sprintf("%0.2f",var.p)))]
paste0("")

## Logistic Fertility Model (Childless ~ sHET Burden):
plot.b[[1]][ variant.type == "META",paste0("Primary Result ", Sex, " OR=",sprintf("%0.2f",var.beta), " [95% CI ",sprintf("%0.2f",exp(log(var.beta)-(1.96*var.stderr))), "-",sprintf("%0.2f",exp(log(var.beta)+(1.96*var.stderr))),"]", " p=",if_else(var.p<=1e-2,sprintf("%0.1e",var.p),sprintf("%0.2f",var.p)))]
paste0("")

## Sex of participants:
paste0("Percentage of females                          : ", sprintf("%0.0f", (nrow(UKBB.phenotype.data[sexPulse == 2]) / nrow(UKBB.phenotype.data))*100), "%")
paste0("")

## Mean fertility in UKBB separated by sex:
paste0("Mean fertility in the UK  : See description in section 6B for how we calculated this number.")
paste0("Mean children for males   : ", sprintf("%0.2f", base.fertilities[sex == 1, fertility]))
paste0("Mean children for females : ", sprintf("%0.2f", base.fertilities[sex == 2, fertility]))
paste0("")

## No Disease Genes:
no.disease.plot[[1]][variant.type == "META",paste0(Sex," No Disease Gene OR=",sprintf("%0.2f",var.beta), " [95% CI ",sprintf("%0.2f",exp(log(var.beta)-(1.96*var.stderr))), "-",sprintf("%0.2f",exp(log(var.beta)+(1.96*var.stderr))),"]", " p=",if_else(var.p<=1e-2,sprintf("%0.1e",var.p),sprintf("%0.2f",var.p)))]
paste0("")

## sHET on having Infertility Code
paste0("OR of sHET on having infertility: ", infertility.plot[[1]][variant.type == "META",paste0(Sex, " OR=",sprintf("%0.2f",var.beta), " [95% CI ",sprintf("%0.2f",exp(log(var.beta)-(1.96*var.stderr))), "-",sprintf("%0.2f",exp(log(var.beta)+(1.96*var.stderr))),"]", " p=",if_else(var.p<=1e-2,sprintf("%0.1e",var.p),sprintf("%0.2f",var.p)))])
paste0("")

## No Male Infertility Genes:
print(paste0("Number of male infertility genes: ", nrow(male.infertility.genes)))
no.male.fertility.plot[[1]][Sex == "Male" & variant.type == "META",paste0(Sex, " Fertility Gene OR=",sprintf("%0.2f",var.beta), " [95% CI ",sprintf("%0.2f",exp(log(var.beta)-(1.96*var.stderr))), "-",sprintf("%0.2f",exp(log(var.beta)+(1.96*var.stderr))),"]", " p=",if_else(var.p<=1e-2,sprintf("%0.1e",var.p),sprintf("%0.2f",var.p)))]
paste0("")

## Male Infertility -- CHOD
fi.analysis.table[coding == "N46" & sex == "MALE" & variant.type == "META", paste0("Male Infertility CHOD (N46) OR=",sprintf("%0.2f",var.or), " [95% CI ",sprintf("%0.2f",exp(log(var.or)-(1.96*var.err))), "-",sprintf("%0.2f",exp(log(var.or)+(1.96*var.err))),"]", " p=",if_else(var.p<=1e-2,sprintf("%0.1e",var.p),sprintf("%0.2f",var.p)), " n=",N)]
paste0("Proportion of individuals with N46 - HES : ", sprintf("%0.2f", prop.icd10.coding[condition == "infertility", prop.hes]), "%")
paste0("Proportion of individuals with N46 - CHOD + HES  : ", sprintf("%0.2f",prop.icd10.coding[condition == "infertility", prop.fi.hes]), "%")
paste0("")

## No Mouse Infertility Genes:
print(paste0("Number of mouse infertility genes: ", nrow(mouse.infertility.genes)))
mouse.data[[1]][ variant.type == "META",paste0(Sex, " OR=",sprintf("%0.2f",var.beta), " [95% CI ",sprintf("%0.2f",exp(log(var.beta)-(1.96*var.stderr))), "-",sprintf("%0.2f",exp(log(var.beta)+(1.96*var.stderr))),"]", " p=",if_else(var.p<=1e-2,sprintf("%0.1e",var.p),sprintf("%0.2f",var.p)))]
paste0("")

# P. value reduction
paste0("OR for base model    : ", sprintf("%0.2f",plot.b[[1]][variant.type == "META" & Sex == "Male", var.beta]))
paste0("OR for model w/DD:ID : ", sprintf("%0.2f",fi.analysis.table[variant.type == "META" & coding == "F81" & sex == "MALE", var.or]))
paste0("")

## Living with a Partner and having children:
partner.beta.plot[[1]][sexPulse == "Male",paste0(sexPulse, " Partner at Home OR=",sprintf("%0.2f",beta), " [95% CI ",sprintf("%0.2f",exp(log(beta)-(1.96*std.error))), "-",sprintf("%0.2f",exp(log(beta)+(1.96*std.error))),"]", " p=",if_else(p.val<=1e-2,if_else(p.val <= 1e-100,"p ≤ 1e-100",sprintf("%0.1e",p.val)),sprintf("%0.2f",p.val)))]
paste0("")

## sHET on ever having sex
fig.had.sex[[1]][variant.type == "META",paste0(Sex, " Ever having sex OR=",sprintf("%0.2f",var.beta), " [95% CI ",sprintf("%0.2f",exp(log(var.beta)-(1.96*var.stderr))), "-",sprintf("%0.2f",exp(log(var.beta)+(1.96*var.stderr))),"]", " p=",if_else(var.p<=1e-2,sprintf("%0.1e",var.p),sprintf("%0.2f",var.p)))]
paste0("")

## No Partner HES
hes.analysis.table[meaning == "Z60.2 Living alone" & variant.type == "META", paste0("Z60.2 ", sex, " OR=",sprintf("%0.2f",icd.or), " [95% CI ",sprintf("%0.2f",exp(log(icd.or)-(1.96*icd.err))), "-",sprintf("%0.2f",exp(log(icd.or)+(1.96*icd.err))),"]", " p=",if_else(icd.p<=1e-2,sprintf("%0.1e",icd.p),sprintf("%0.2f",icd.p)))]
paste0("")

## Same Sex
same.sex.beta.plot[[1]][,paste0(sexPulse, " Same Sex Phenotype OR=",sprintf("%0.2f",beta), " [95% CI ",sprintf("%0.2f",exp(log(beta)-(1.96*std.error))), "-",sprintf("%0.2f",exp(log(beta)+(1.96*std.error))),"]", " p=",if_else(p.val<=1e-2,if_else(p.val <= 1e-100,"p ≤ 1e-100",sprintf("%0.1e",p.val)),sprintf("%0.2f",p.val)))]
same.sex[[1]][Sex == "Male" & variant.type == "META",paste0(Sex, " Same Sex sHET OR=",sprintf("%0.2f",var.beta), " [95% CI ",sprintf("%0.2f",exp(log(var.beta)-(1.96*var.stderr))), "-",sprintf("%0.2f",exp(log(var.beta)+(1.96*var.stderr))),"]", " p=",if_else(var.p<=1e-2,sprintf("%0.1e",var.p),sprintf("%0.2f",var.p)))]
no.same.sex.plot[[1]][Sex == "Male" & variant.type == "META",paste0(Sex," Exclude Same Sex OR=",sprintf("%0.2f",var.beta), " [95% CI ",sprintf("%0.2f",exp(log(var.beta)-(1.96*var.stderr))), "-",sprintf("%0.2f",exp(log(var.beta)+(1.96*var.stderr))),"]", " p=",if_else(var.p<=1e-2,sprintf("%0.1e",var.p),sprintf("%0.2f",var.p)))]
paste0("")

##Fluid Intel Contrib to Fitness:
paste0("Number of Fluid Intel Indv: ", sum(fig.cog[[1]][variant.type == "META",n.indvs]), " (Male: ",fig.cog[[1]][Sex == "Male" & variant.type == "META",n.indvs],"; Female: ", fig.cog[[1]][Sex == "Female" & variant.type == "META",n.indvs], ")")
paste0("Contribution of Cognition to Fitness: ",
       sprintf("%0.0f",(((1 - model.cog[shet == 1 & sex == 1,ratio_mid]) / (1 - model.fertility[shet == 1 & sex == 1,ratio_mid]))*100)),
       "% (",
       sprintf("%0.0f",(((1 - model.cog[shet == 1 & sex == 1,ratio_upper]) / (1 - model.fertility[shet == 1 & sex == 1,ratio_upper]))*100)),
       " - ",
       sprintf("%0.0f",(((1 - model.cog[shet == 1 & sex == 1,ratio_lower]) / (1 - model.fertility[shet == 1 & sex == 1,ratio_lower]))*100)),
       "%)")
paste0("")

## UKBB Biases
email.data[[1]][ variant.type == "META",paste0(Sex, " Email OR=",sprintf("%0.2f",var.beta), " [95% CI ",sprintf("%0.2f",exp(log(var.beta)-(1.96*var.stderr))), "-",sprintf("%0.2f",exp(log(var.beta)+(1.96*var.stderr))),"]", " p=",if_else(var.p<=1e-2,sprintf("%0.1e",var.p),sprintf("%0.2f",var.p)))]
answered.mhq.data[[1]][ variant.type == "META",paste0(Sex, " MHQ OR=",sprintf("%0.2f",var.beta), " [95% CI ",sprintf("%0.2f",exp(log(var.beta)-(1.96*var.stderr))), "-",sprintf("%0.2f",exp(log(var.beta)+(1.96*var.stderr))),"]", " p=",if_else(var.p<=1e-2,sprintf("%0.1e",var.p),sprintf("%0.2f",var.p)))]
paste0("")

## MH Childlessness:
mhq.beta.plot[[1]][,paste0(sexPulse, " MH OR=",sprintf("%0.2f",beta), " [95% CI ",sprintf("%0.2f",exp(log(beta)-(1.96*std.error))), "-",sprintf("%0.2f",exp(log(beta)+(1.96*std.error))),"]", " p=",if_else(p.val<=1e-2,if_else(p.val <= 1e-100,"p ≤ 1e-100",sprintf("%0.1e",p.val)),sprintf("%0.2f",p.val)))]
paste0("")

## Path CNV Counts/Tests:
paste0("CNV Carriers account for ",sprintf("%0.1f",(length(unique(path.cnv.counts[,eid]))/nrow(samples.UKBB.cnv)*100)),"% (", length(unique(path.cnv.counts[,eid])),") of individuals.)")
no.path.cnvs.plot[[1]][Sex == "Male" & variant.type == "META",paste0(Sex," No Path CNVs OR=",sprintf("%0.2f",var.beta), " [95% CI ",sprintf("%0.2f",exp(log(var.beta)-(1.96*var.stderr))), "-",sprintf("%0.2f",exp(log(var.beta)+(1.96*var.stderr))),"]", " p=",if_else(var.p<=1e-2,sprintf("%0.1e",var.p),sprintf("%0.2f",var.p)))]
paste0("")

## MH Contribution to Fitness:
no.mh.patients.plot[[1]][variant.type == "META",paste0(Sex," No MH Patients OR=",sprintf("%0.2f",var.beta), " [95% CI ",sprintf("%0.2f",exp(log(var.beta)-(1.96*var.stderr))), "-",sprintf("%0.2f",exp(log(var.beta)+(1.96*var.stderr))),"]", " p=",if_else(var.p<=1e-2,sprintf("%0.1e",var.p),sprintf("%0.2f",var.p)))]
paste0("Contribution of MHTs to Fitness: ",
       sprintf("%0.1f",(((1 - model.mhq[shet == 1 & sex == 1,ratio_mid]) / (1 - model.fertility[shet == 1 & sex == 1,ratio_mid]))*100)),
       "% (",
       sprintf("%0.1f",(((1 - model.mhq[shet == 1 & sex == 1,ratio_lower]) / (1 - model.fertility[shet == 1 & sex == 1,ratio_lower]))*100)),
       " - ",
       sprintf("%0.1f",(((1 - model.mhq[shet == 1 & sex == 1,ratio_upper]) / (1 - model.fertility[shet == 1 & sex == 1,ratio_upper]))*100)),
       "%)")
paste0("")

paste0(r2.plot.table[n.terms==5, paste0("Contribution of all Covars to Fitness in ", sexPulse, "s : ",sprintf("%0.0f%%",100-val))])
paste0(r2.plot.table[curr.cov.string == "Has Partner, Ever Had Sex", paste0("Contribution of Partner & Had Sex to Fitness in ", sexPulse, "s : ",sprintf("%0.0f%%",100-val))])
paste0("")

## Overall contribution to Fitness:
paste0("Contribution of sHET to Fitness (sex averaged): ",
       sprintf("%0.1f",((1-model.fertility[shet == 1,mean(ratio_mid)])*100)),
       "% (",
       sprintf("%0.1f",((1-model.fertility[shet == 1,mean(ratio_upper)])*100)),
       " - ",
       sprintf("%0.1f",((1-model.fertility[shet == 1,mean(ratio_lower)])*100)),
       "%)")
paste0("")

## Female OR when corrected for EA:
final.results.matrix[curr.cov == "completed.college" & sex == 2 & variant.type == "META",paste0(sexPulse," sHET corrected by EA OR=",sprintf("%0.2f",exp(var.beta)), " [95% CI ",sprintf("%0.2f",var.ci.lower), "-",sprintf("%0.2f",var.ci.upper),"]", " p=",if_else(var.p<=1e-2,sprintf("%0.1e",var.p),sprintf("%0.2f",var.p)))]
paste0("")

## Number of Individuals
paste0("Number of Broadly Euro Indiv                   : ", table(fread("rawdata/phenofiles/ukbb_phenotypes.txt")[,`22006-0.0`]))
paste0("Number of Individuals after filtering relateds : ", nrow(UKBB.phenotype.data))
paste0("")

## Total CNVs:
paste0("Number of raw CNVs: ",nrow(ukbb.annotated.cnvs.qcd[eid %in% has.cnv.data[has_cnvs==1,eid]]))
quant.table <- data.table(table(ukbb.annotated.cnvs.qcd[filter.0.95.wes.support.score==T,ct]))
paste0("Number of CNVs (Unfiltered Indiv)  : ",quant.table[,sum(N)], " (DEL: ", quant.table[V1 == "DEL",N], "; DUP: ", quant.table[V1 == "DUP",N],")")
quant.table <- data.table(table(ukbb.annotated.cnvs.qcd[eid %in% samples.UKBB.cnv[,eid] & filter.0.95.wes.support.score==T,ct]))
paste0("Number of CNVs (Filtered Indiv)    : ",quant.table[,sum(N)], " (DEL: ", quant.table[V1 == "DEL",N], "; DUP: ", quant.table[V1 == "DUP",N],")")
paste0("Number of CNV Loci                 : ",nrow(data.table(table(ukbb.annotated.cnvs.qcd[,locus]))))
paste0("")
rm(quant.table)

## Total SNVs:
paste0("Number of SNV/INDel Variants: ", UKBB.counts.200k[AF == "AF0.1", sum(count)])
paste0("Number of individuals with WES data: ", nrow(UKBB.phenotype.data[has.wes>0]))
paste0("")

## Total sHET Genes:
paste0("Total number of genes with sHET value                       : ", nrow(shet.genes))
paste0("Total number of genes with sHET value in both hg19 and hg38 : ", nrow(merge(shet.genes,gene.translate)))
paste0("")

## Total Disease Genes:
print(paste0("Number of male infertility genes: ", nrow(male.infertility.genes)))
print(paste0("Number of mouse male infertility genes: ", nrow(mouse.infertility.genes)))
print(paste0("Number of disease genes: ", nrow(disease.genes)))
paste0("")

## Num individuals w/o rare ancestry PCs:
print(paste0("Number of individuals without rare PCs: ", nrow(UKBB.phenotype.data[is.na(scaled.rare.PC1)])))
paste0("")

## Total Indiv Lost with >3 filter:
paste0("Number of DEL Individuals Lost, pLI ≥ 0.9  : ", results.fertility[variant.type == "DEL" & maf == 0,sum(n.indvs)] - results.fertility.genelists[variant.type == "DEL" & maf == 0 & gene.list == "highPLI",sum(n.indvs)])
paste0("Number of DEL Individuals Lost, sHET ≥ 0.9 : ", results.fertility[variant.type == "DEL" & maf == 0,sum(n.indvs)] - results.fertility.genelists[variant.type == "DEL" & maf == 0 & gene.list == "highsHET",sum(n.indvs)])

paste0("Number of PTV Individuals Lost, pLI ≥ 0.9  : ", results.fertility[variant.type == "LOF_HC" & maf == 0,sum(n.indvs)] - results.fertility.genelists[variant.type == "LOF_HC" & maf == 0 & gene.list == "highPLI",sum(n.indvs)])
paste0("Number of PTV Individuals Lost, sHET ≥ 0.9 : ", results.fertility[variant.type == "LOF_HC" & maf == 0,sum(n.indvs)] - results.fertility.genelists[variant.type == "LOF_HC" & maf == 0 & gene.list == "highsHET",sum(n.indvs)])
paste0("")

## CHOD Biases
has.gp.data[[1]][variant.type == "META",paste0(Sex," Has CHOD data OR=",sprintf("%0.2f",var.beta), " [95% CI ",sprintf("%0.2f",exp(log(var.beta)-(1.96*var.stderr))), "-",sprintf("%0.2f",exp(log(var.beta)+(1.96*var.stderr))),"]", " p=",if_else(var.p<=1e-2,sprintf("%0.1e",var.p),sprintf("%0.2f",var.p)))]
paste0("")

## Median high pLI Gene:
paste0("high pLI genes (≥ 0.9) have a mean sHET value of : ",sprintf("%0.3f",mean.highpLI))
paste0("")

## Suplementary Note 1 Numbers:
paste0("Supp Note 1")
paste0("Base Childlessness Male             : ", sprintf("%0.1f",base.childlessness.male*100))
paste0("Base Has Children Male              : ", sprintf("%0.1f",(1-base.childlessness.male)*100))
paste0("Odds ratio for Male                 : ", sprintf("%0.3f",plot.b[[1]][variant.type == "META" & Sex == "Male", var.beta]))
paste0("Numerator equation 1.6              : ", sprintf("%0.3f",base.childlessness.male/(1-base.childlessness.male)))
num <- base.childlessness.male/(1-base.childlessness.male)
paste0("Solved equation 1.6                 : ", sprintf("%0.3f",(num) / ((plot.b[[1]][variant.type == "META" & Sex == "Male", var.beta]) + (num))))
paste0("Base Children Among Males w/Child   : ", sprintf("%0.3f",base.fertilities[sex == 1 & inc.zero == F,fertility]))
paste0("Solved equation 1.7                 : ", sprintf("%0.3f",((1-(num) / ((plot.b[[1]][variant.type == "META" & Sex == "Male", var.beta]) + (num)))*base.fertilities[sex == 1 & inc.zero == F,fertility])))
paste0("Base Children Among Males           : ", sprintf("%0.3f",base.fertilities[sex == 1 & inc.zero == T,fertility]))
paste0("Solved equation 1.8 for males       : ", sprintf("%0.3f",((1-(num) / ((plot.b[[1]][variant.type == "META" & Sex == "Male", var.beta]) + (num)))*base.fertilities[sex == 1 & inc.zero == F,fertility]) / base.fertilities[sex == 1 & inc.zero == T,fertility]))
num.female <- base.childlessness.female/(1-base.childlessness.female)
paste0("Solved equation 1.8 for females     : ", sprintf("%0.3f",((1-(num.female) / ((plot.b[[1]][variant.type == "META" & Sex == "Female", var.beta]) + (num.female)))*base.fertilities[sex == 2 & inc.zero == F,fertility]) / base.fertilities[sex == 2 & inc.zero == T,fertility]))
paste0("Sex averaged                        : ",sprintf("%0.3f",((((1-(num) / ((plot.b[[1]][variant.type == "META" & Sex == "Male", var.beta]) + (num)))*base.fertilities[sex == 1 & inc.zero == F,fertility]) / base.fertilities[sex == 1 & inc.zero == T,fertility]) + (((1-(num.female) / ((plot.b[[1]][variant.type == "META" & Sex == "Female", var.beta]) + (num.female)))*base.fertilities[sex == 2 & inc.zero == F,fertility]) / base.fertilities[sex == 2 & inc.zero == T,fertility])) / 2))
paste0("Sex averaged -1                     : ",sprintf("%0.3f",1 - ((((1-(num) / ((plot.b[[1]][variant.type == "META" & Sex == "Male", var.beta]) + (num)))*base.fertilities[sex == 1 & inc.zero == F,fertility]) / base.fertilities[sex == 1 & inc.zero == T,fertility]) + (((1-(num.female) / ((plot.b[[1]][variant.type == "META" & Sex == "Female", var.beta]) + (num.female)))*base.fertilities[sex == 2 & inc.zero == F,fertility]) / base.fertilities[sex == 2 & inc.zero == T,fertility])) / 2))
paste0("")

## Change in cognition:
paste0("Supp Note 2")
paste0("SD change in COG for 1 sHET            : ", sprintf("%0.2f", fig.cog[[1]][Sex == "Male" & variant.type == "META", var.beta]))
paste0("Predicted drop in IQ for sHET = 1 male : " ,sprintf("%0.2f", model.cog[shet == 1 & sex == 1,100 - expected.iq_mid]))
paste0("Drop in IQ - 100                       : " ,sprintf("%0.2f", model.cog[shet == 1 & sex == 1,expected.iq_mid]))
paste0("sHET = 1 Children Among Males (num)    : ", sprintf("%0.2f",base.fertilities[sex == 1 & inc.zero == T,fertility]))
paste0("Base Children Among Males (denom)      : ", sprintf("%0.2f",model.cog[shet == 1 & sex == 1,ratio_mid * base.fertilities[sex == 1 & inc.zero == T,fertility]]))
paste0("Solved formula 2.4                     : ", sprintf("%0.3f",model.cog[shet == 1 & sex == 1,ratio_mid]))
paste0("Solved formula 2.5                     : ", sprintf("%0.1f%%",100 * (1 - model.cog[shet == 1 & sex == 1,ratio_mid]) / (1-((1-(num) / ((plot.b[[1]][variant.type == "META" & Sex == "Male", var.beta]) + (num)))*base.fertilities[sex == 1 & inc.zero == F,fertility]) / base.fertilities[sex == 1 & inc.zero == T,fertility])))
paste0("")

```