Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split Markdown cell at heading by default? #130

Closed
matthew-brett opened this issue Nov 15, 2018 · 10 comments
Closed

Split Markdown cell at heading by default? #130

matthew-brett opened this issue Nov 15, 2018 · 10 comments
Milestone

Comments

@matthew-brett
Copy link
Contributor

I'm having an excellent time using Jupytext, thanks again.

I notice that I often write text, with headings. If I forget to put a double carriage return before the heading, the resulting Markdown cell looks a bit goofy, with a heading in the middle. Do you think it would be sensible to split text cells automatically at headings? Or provide an option to do that?

@mwouts
Copy link
Owner

mwouts commented Nov 15, 2018

Thanks Matthew, that's much appreciated!

From the above I guess you are using the md or Rmd format, is that correct? And you would prefer the following text:

A paragraph

# H1 Header

## H2 Header

Another paragraph

to be rendered to a total of four cells, is that right? (or three, the last one starting at the H2 header?)

Maybe we could think of an option that renders markdown to cells using a single blank line as the cell separator. This should (hopefully) still preserve round trip conversions, but split every markdown cell with a blank line into multiple ones. Would that be fine with you? Or you prefer something closer to your suggestion - create a new cell when a header follows a blank line?

@matthew-brett
Copy link
Contributor Author

I find I often want a cell that has blank lines, so I would not use an option that split on one blank line, I think.

But - I think I always want to break a cell at a heading. So I never want a cell that goes:

Some text.

## A heading

More text

Yes, I am proposing that this cell be split into two:

Some text
## A heading

More text

- so, yes, in your example, the last cell starting at H2 header.

mwouts added a commit that referenced this issue Nov 28, 2018
@mwouts
Copy link
Owner

mwouts commented Nov 28, 2018

Hello @matthew-brett, I have started working on this in this branch. Would you like to give it a try?

Still pending: an update of the documentation, and the test mode of Jupytext, which should now include the cell break on headings among the expected modifications on the Jupyter notebook, when represented as a (R) markdown file.

@matthew-brett
Copy link
Contributor Author

Sorry to be slow. Yes, that branch does exactly what I was hoping, thanks.

@mwouts
Copy link
Owner

mwouts commented Jan 29, 2019

Hello @matthew-brett , this is on its way for Jupytext 1.0. Could you please test the release candidate:

pip install jupytext --pre --upgrade

and review the documentation ? Thanks !

@matthew-brett
Copy link
Contributor Author

Thanks, yes, it works as you describe. The option was a good idea, I can see it would be confusing otherwise.

I added a PR for a more verbose description of the option, please feel free to ignore if it's too verbose.

Meanwhile, I found I could not edit my previous .Rmd files with the RC, with this message.

File ones_zeros.Rmd is in format/version=rmarkdown/1.1 (current version is 1.0). It would not be safe to override the source of ones_zeros.Rmd with that file. Please remove one or the other file.

Should that get a doc update too? I wasn't immediately sure how to fix it. Deleting the .ipynb didn't help.

@mwouts
Copy link
Owner

mwouts commented Jan 29, 2019

Thanks @matthew-brett for your feedback, and for the doc update.

The issue you encounter is caused by the version bump on the Rmd format that only existed in development version 0.9.0, when we considered implementing this by default. But, as it is finally implemented as an option it was not required to increment the Rmd format version number.

I recommend that you remove the version information from all your markdown files. That may be a case where the new --update-metadata command can help. Would you like to give a try to:

jupytext --update-metadata '{"jupytext":{"text_representation": null, "split_at_heading":true}}' ones_zeros.Rmd

@mwouts mwouts closed this as completed Feb 5, 2019
@abalter
Copy link

abalter commented May 5, 2020

@mwouts I tried this, and unfortunately, it did not work.

(base) balter@spectre:~/bsta513$ jupytext --update-metadata '{"jupytext":{"text_representation": null, "split_at_heading":true}}' bsta_513_midterm_notes.Rmd
[jupytext] Reading bsta_513_midterm_notes.Rmd
[jupytext] Updating notebook metadata with '{"jupytext": {"text_representation": null, "split_at_heading": true}}'
[jupytext] Writing bsta_513_midterm_notes.Rmd in format Rmd (destination file replaced)

Something else to try?

---
jupyter:
  jupytext:
    cell_metadata_filter: -all
    split_at_heading: true
    text_representation:
      extension: .Rmd
      format_name: rmarkdown
      format_version: '1.2'
      jupytext_version: 1.4.2
  kernelspec:
    display_name: R
    language: R
    name: ir
---

```{r}
---
title: BSTA 513 Midterm Notes
author: Ariel Balter
output:
  html_notebook:
  toc: true
  toc_float: true
  theme: simplex
---
```

```{r}
library(rmarkdown)
library(tidyverse)
library(magrittr)
library(ggplot2)
library(beeswarm)
```


# Variables

The three main types of data are _categorical_, _continuous_, and _count_. These are mostly well-defined. In some cases one can argue about how to categorize a particular variable. 

## Continuous Variables
Continuous variables take on numbers that can be integers or real numbers. Integer continuous variables live in a grey area with _count_ data.

Examples of continuous, real-valued numbers are: *weight*, *age*, *concentration*, *volume*, *mass*, *time*, etc.

Continuous, integer-valued numbers are numbers that are sequential, but do not accumulate the way _count_ data does. Sometimes these could be equivalent to counts. Sometimes they could be equivalent to _ordinal_ variables. Main examples would be numbers that have been rounded: *age in years*, *number of weeks*, etc. Another example could be a record of events that happen at regular intervals but without any fixed bound such as *first*, *second*, *third*, *fourth*.

## Categorical Variables
Categorical variables are _polychotomous_, meaning they take on a finite set of discrete values. In general, there are two types of categorical variabgles: _nominal_, and _ordinal_.

### Nominal
Nominal variables generally _name_ something such as *race*, *sex*, *gender*, *color*, *tissue type*, *genotype*, etc.

### Ordinal
Ordinal variables are discrete values that also have a natural order. Examples are:

* Likert scale choices or ratings such as {*none*, *little*, *some*, *a lot*}, or "rate on a scale of 1 to 5.".

* Continuous variables broken into chunks. Examples *age* {*young*, *middle*, *old*}, *weight* {*underweight* , *normal*, *overweight*, *obese*}, *height* {*short*, *medium*, *tall*}. These are discrete, but have a natural order.

* Numbers representing the order in which something happened. Examples: *order of children* {*first*, *second*, ...}, {*before treatment*, *after treatment*}, 

## Count Data
Count data usually record an _accumulation_ of items or observations through events. However, some would consider anything you count as being count data, rather than integer, continuous data. Counts can arrive randomly or continuously. Examples of count data are *number of seizurs*, *electrical pulses on a nerve*, *number of sequences for a gene detected in a sample*, *how many bacteria have grown vs. time*, *number of children*.

# Contingency Table

## Definition
A contingency table records the counts of two categorical variables. In general, it can contain any number of cells. However the two-by-two table is extremely common. A table with $R$ rows and $C$ columns is called an $R \times C$ table. This would represent one categorical variable that has $R$ possible values and another that has $C$ possible values.

Looking at the units of measurement in your data, typically patients or experimental subjects, you count the number of units that have the value $i$ of the variable along the rows and value $j$ along the columns, and you write that number in the $i,j$ cell of the table.

Example: a group has 10 people. 5 are male sex and 4 are female sex, and one is intersex (XXY). Of the 5 male, 4 identify as male gender and one as binary. Of the 4 female, 3 identify as female gender, and one as male. The intersex person identifies as male.


```{r}
sex_gender = data.frame(
  sex = c(rep("Male", 5), rep("Female", 4), "Intersex(XXY)"),
  gender = c("Male", "Male", "Male", "Male", "Binary", "Female", "Female", "Female", "Male", "Male")
)

table(sex_gender) %>% addmargins()
```


The row and column totals are called "marginal" values as they fall in the margins of the (extended) table.

## Association
- Know the name and distribution for test of general association for a 2 X 2 contingency table. Be able to write down the null and
alternative hypothesis. Know how to compute the test statistic for a
2 X 2 table. Know the distribution of the test statistic under the
null hypothesis. Know how to interpret R and SAS output for the
test.

- Know when to use Pearson’s chi-square test and when to use Fisher’s
exact test.

## Homogeneity

- Know the test of association for contingency table from a
matched-pair study. Be able to write down the null and alternative
hypothesis. Know how to compute the test statistic for a 2 X 2
table. Know the distribution of the test statistic under the null
hypothesis. Know how to interpret R and SAS output for the test.

- Know the name and distribution for test of general association for a
R X C contingency table. Understand under what situation such a test
is appropriate. Be able to write down the null and alternative
hypothesis. Know the distribution of the test statistic under the
null hypothesis. Know how to interpret R and SAS output for the
test.

## Trend
- Know the name and distribution for test of trend for a 2 X C ( or R
                                                                 X 2) contingency table. Understand under what situation such a test
is appropriate. Be able to write down the null and alternative
hypothesis. Know the distribution of the test statistic under the
null hypothesis. Know how to interpret R and SAS output for the
test.

- Know the name and distribution for test of trend for a R X C
contingency table. Understand under what situation such a test is
appropriate. Be able to write down the null and alternative
hypothesis. Know the distribution of the test statistic under the
null hypothesis. Know how to interpret R and SAS output for the
test.

## Sample Size
- Know how to compute by hand for required sample size for
one-proportion, two-proportion, and matched-pair studies.

# Bsic Analysis Procedures (Steps)
## 1. Present Data
* Tables
* Graphs or charts

## 2. Descriptive analysis
* Summary data such as mean, standard deviation, median, min, max, etc.
* Proportions or percentages
* Can include estimation meaning confidence intervals are included.

## 3. Univariate analysis
### Categorical Response and Categorical Explanatory Variables
#### Tests of association:
* Chi-Square
* Fisher's Exact

#### Level of association
* Relative Risk
* Odds Ratio

### Categorical Response and Categorical Explanatory Variables
**Univariate Testing**
* T-test
* Wilcoxon/Mann-Whitney/Kruskal-Wallace
* AN(C)OVA
* Report the overall mean, standard deviation, median, min, max
and the distribution as well in each categorized group (if
necessary).

## 4. Model Based Methods (Regression, Classification, Machine Learning)
* Evaluate association between explanatory variables and a response variable
* If the dependent variable is categorical from **cohort** or **cross-sectional** study, use Logistic regression.
* If the dependent variable is categorical from **case-control** study, use conditional logistic regression.
* If the dependent variable is **counts data**, may use Poisson regression or negative binomial regression.
* If the dependent variable is categorical that are measured multiple times, generalized estimating equation (GEE) method may be used.

# Types of Study
## Cohort Study
• A cohort study is a type of observational study that
follows a group of people who do not have the disease.
– The goal is to find risk factors that are associated with the
disease.
• A prospective cohort study follows subjects over time.
• A retrospective cohort study takes a look back at
events that already have taken place.
• The methodology of prospective and retrospective
cohort studies is fundamentally the same, but the
retrospective study is performed post-hoc, as the
cohort is followed retrospectively.
* Cohort study
  – Prospective cohort study
  – Retrospective cohort study
* Case-control study
* Cross-sectional study
* Experimental study

## Case-Control Study
• A case-control study is also an observational
study, but looks at two existing groups with
and without a certain disease.
– The group with the disease is called case group,
and the group without the disease is called the
control group.
– The goal is to identify factors that may contribute
to the disease.

## Cross-Sectional Study
• A cross-sectional study dose not follow
subjects over time. In stead, it evaluates
subjects at one specific point time.
– The goal is often to provide data on the entire
population, for example, the prevalence of an
illness.

## Experimental Study
• An experimental study intentionally introduce
one or more treatments, procedures, or
programs, and then observe the outcome.
– The goal is to assess the effect of intervention

(treatments) on clinical outcomes.
– It often involves random allocation of subjects to
different interventions.
• Example: Clinical Trials

## Matched Pair

# Distributions
## Continuous
### Normal
### T
### Fisher
### Chi-Square

- Binomial (normal approximation)
- normal
- t
- chi-square

## Types
Types
Parts of 
Models
Test
Uses



## Distributions
- Binomial (normal approximation)
- normal
- t
- chi-square

## Hypothesis testing and inference

- Understand inference and hypothesis testing for proportion, for both
one-sample and two-sample studies.

- Understand likelihood ratio test.

List of tests and uses

## Association

## Homogeneity

Lecture 3: Measurement of association for contingency table (I)

# Odds Ratio and Risk

## Risk Difference

## Relative Risk
- Know how to compute and interpret estimated risk difference,
relative risk and odds ratio with their associated 95% confidence
interval from a 2 X 2 contingency table.

- Know when OR and/or RR is estimable and when RR is not estimable.
Understand the relationship between RR and OR.

- Know how to compute and interpret estimated odds ratio with its
associated 95% confidence interval from a matched-pair study.

- Know how to compute and interpret adjusted OR using Mantel-Haenszel
method. Know when it is appropriate to report Mantel-Haenszel
adjusted OR. Understand Breslow-Day test of homogeneity. Be able to
write down the null and alternative hypothesis. Know the
distribution of the test statistic under the null hypothesis. Know
how to interpret R and SAS output for the test.

- Understand Cochran-Mantel-Haenszel chi-square test. Be able to write
down the null and alternative hypothesis. Know the distribution of
the test statistic under the null hypothesis. Know how to interpret
R and SAS output for the test.

## Attributable Risk
- Know how to compute and interpret sensitivity, specificity, the
positive predictive value (PV+), and the negative
predictive value (PV-). Understand the relationship
between (PV+, PV-) and (sensitivity,
                                              specificity).


# Inter-rater agreement
- Understand Cohen’s Kappa statistics as a measure of agreement among
raters. Know how to compute and test for Cohen’s kappa. Be able to
write down the null and alternative hypothesis. Know the
distribution of the test statistic under the null hypothesis. Know
how to interpret R and SAS output for the test.

# Confounding
- What is confounding? Understand the difference between a positive
confounder and a negative confounder.

  - What is population attributable risk? Know how to compute and
interpret population attributable risk for cohort studies and
case-control studies.

## Sensitivity and Specificity

- What is a ROC curve? What is the widely used summary of the overall
diagnostic accuracy of a test in a ROC curve analysis? How to
interpret that?

# Maximum Likliehood Estimation

# Logistic Regression

## Definition and use
- Understand the three components of generalized linear models.

- Understand why ordinary linear regression is not suitable for a
categorical outcome variable.

- Know the logistic regression function. Know how to write a logistic
regression model.

- Know how the logistic regression model can be expressed in the three
components of GLM.

## Hypothesis Testing and Inference

### MLE
- Know MLE and its nice properties. Be able to write down the
likelihood function for a logistic regression model.

### Tests of Significance
- Know the three tests for the significance of the coefficients.

### Devience
- Know what deviance is. Know how to compute likelihood ratio test
from outputs of SAS and R. Be able to write down the null and
alternative hypothesis. Know the distribution of the test statistic
under the null hypothesis. Know how to interpret STATA and SAS
output for the test.

### Confidence Intervals
#### Types

- Know how to compute and interpret the confidence intervals for
estimated coefficients in simple logistic regression.

- Know that an alternative confidence interval to Wald confidence
interval is the profile likelihood confidence interval, and know the
difference of that from the Wald CI.

## Analyzing Results

- Know how to compute the predicted probability and its associated 95%
confidence interval using R or SAS output.

- Understand when and how to create design variables. Know there are
different ways of defining design variables. Computing estimated
odds ratio from the estimated coefficients depends on how the design
variables are defined.

- Understand likelihood ratio test and the Wald test in the multiple
logistic regression model setting. Know how to use the tests to
decide which variable(s) should stay in the model.

- Know how to compute and interpret predicted probability and the
corresponding confidence interval in multiple logistic regression
model, including both univariate and multivariable logistic
regression.

- Know how to interpret the coefficients for fitted logistic
regression models, including dichotomous independent variables. Know
how to compute estimated odds ratio from the estimated coefficients.


```{r}

```

@mwouts
Copy link
Owner

mwouts commented May 7, 2020

Hello @abalter , well I gave it a try, and indeed I found two things that do not work as expected:

  1. I'd like to set the split_at_heading option with just jupytext notebook.Rmd --opt split_at_heading=true, but that does not work in version 1.4.2
  2. With the split_at_heading: true option in the jupytext section, like you have in the document, headers create new cells when they are proceeded by a blank line, but not otherwise. For instance, take this part of your document:
    image

I'll fix the first point, but maybe I'd prefer not to change the behavior of the split_at_heading option. Which means that you should have a blank line before the titles if you want them to start a new cell. What do you think?

@abalter
Copy link

abalter commented May 7, 2020

I think the line before a heading would probably make a document easier to read anyway.

Sounds great!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants