Skip to content
/ c200 Public

CCJS 200 - Statistics for Criminology & Criminal Justice

Notifications You must be signed in to change notification settings

rwb/c200

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 

Repository files navigation

CCJS 200 Statistics for Criminology & Criminal Justice

Course Syllabus

  • Catalog description: Introduction to descriptive and inferential statistics, graphical techniques, and the computer analysis of criminology and criminal justice data. Basic procedures of hypothesis testing, correlation and regression analysis, and the analysis of continuous and binary dependent variables. Emphasis upon the examination of research problems and issues in criminology and criminal justice. Prerequisite: CCJS100 or CCJS105; and 1 course with a minimum grade of C- from (STAT100, MATH107, MATH111, MATH120, MATH130, MATH135, MATH140). Restriction: Must be in Criminology and Criminal Justice program; or permission of BSOS-Criminology & Criminal Justice department.
  • Meetings and Contact Information: This course is scheduled to meet on Tuesdays and Thursdays from 11-12:15pm in LeFrak 2205 and I plan to hold office hours on Wednesdays from 10-12pm (or by appointment). My office is 2139 LeFrak Hall. My email address is rbrame at umd.edu. My preference is that you contact me on ELMS but you can email me if needed. Please do not email me from non-University email accounts.
  • Discussion sections: This course has 6 weekly 50-minute discussion sections that meet on Fridays at various times (9am, 10am, 11am, 12 noon, 1pm, and 2pm). Please sign up for the time that works best for you (up to the enrollment limits) and plan to attend the section for which you are registered. Discussion section attendance is monitored.
  • Graduate teaching assistants: Marshae Capers (office hours on Thursday: 12:30-2:00) and Xinyi (Sammy) Situ (office hours: Tuesday 12:30-2:00) will be assisting with the class this semester and will be responsible for overseeing the discussion sections. You can contact them on ELMS. Sammy will be teaching the Friday discussion sections at 9, 11, and 1; Marshae will be teaching at 10, 12, and 2. Their office is located in 2163 LeFrak Hall.
  • Course-related policies: In all matters, the class will follow University guidance as outlined here.
  • Accessibility accommodations: If you think you might need one or more academic accommodations, please contact the Accessibility and Disability Service Office (link) for guidance and assistance. Please contact me on ELMS to set up an appointment to discuss any accommodations that are authorized.
  • Required textbook: Weisburd, David and Chester L. Britt (2007). Statistics in Criminal Justice (3rd Edition). New York: Springer-Verlag. This book is available as a free pdf download from the University of Maryland Library (link).
  • Class notes and out-of-class assignments will be posted on this webpage.
  • Letter grades: At the end of the semester, letter grades will be assigned on a 100-point scale (A+ = 98 and higher; A = 92-97; A- = 90-91; B+ = 88-89; B = 82-87; B- = 80-81; C+ = for 78-79; C = 72-77; C- = 70-71; D+ = 68-69; D = 62-67; D- = 60-61; and F = any grade less than 60). All numeric grades (including the final numeric grade in the class at the end of the semester) will be rounded off to the nearest 1 point (for example, a 78.5 would be rounded to a 79 and a 78.4 would be rounded to a 78).
  • Numeric grades in this class will be based on 3 in-class exams and 3 out-of-class assignments and will all be graded on a 100-point scale. The final numeric grade calculation at the end of the semester will be: 0.25 × Exam 1 + 0.25 × Exam 2 + 0.25 × Exam 3 + 0.25 × Average Assignment Grade.
  • Assignment submission rules: assignments are due on ELMS at 11:59pm on the due date and must be submitted as a pdf file. If you submit your assignment in any other form than a pdf file, you will receive an automatic 10-point deduction. It is your responsibility to make sure the pdf file you submit can be opened and read. File submissions that cannot be opened or read will receive a grade of 0. In the unlikely event that there is an ELMS or Canvas problem at the deadline time, you can submit your assignment to me by email (rbrame@umd.edu) but the time stamp of the email will be marked as the submission time. Such emails must originate from your UMD email account or they will not be accepted.
  • Late submission of assignments: If you submit your assignment after the 11:59pm deadline, there will be a 5-point grade penalty for each hour the submission is late (so, for a submission arriving at midnight up through 12:59am, the penalty is 5 points; if the submission arrives at 1:00am up through 1:59am, the penalty is 10 points, etc.). If an emergency arises and you are unable to submit your assignment, you are expected to notify me no later than the time the assignment is due and to supply appropriate documentation to support your excuse. The documentation must demonstrate incapacity to work on the assignment for a substantial fraction of the time between when the assignment is posted and the time it is due. If the emergency is such that you are unable to notify me by the time the assignment is due, your excuse must also include documentation to justify the delay in notification. Once I have reviewed the documentation, I will make a judgment about whether an extension is warranted.
  • Attendance expectations for classes and discussion sections: My expectation is that you will attend all of the class and discussion sessions. If you have to miss a class or a discussion session, I encourage you to work with other people in the class to get caught up on your notes and contact me or the TAs if you need clarification. Discussion section attendance will be monitored.
  • Attendance expectations for exams: If you have to miss an exam, we will follow the notification and make-up policies specified by the University (see this webpage). Those guidelines require that such notifications be timely and supported by appropriate documentation. The definition of "timely" used in this course is notification before the exam begins. Valid excuses are those listed in the University policies linked above and accompanied by appropriate documentation. Notification after the exam begins will be considered but there must be a compelling reason for the notification delay. Any student who has 3 exams scheduled on the same day may request an accommodation.
  • Key dates: (1) first class day - Tuesday 1/28; (2) spring break - 3/17-3/21; (3) last day of class - Tuesday 5/13; and (4) scheduled final exam period - Thursday 5/15 from 10:30am-12:30pm. Note: exam and assignment due dates are listed in the course outline below.
  • Statistical software: As noted in the catalog description, there is a computer component to this course. We will be using R, which is free for you to install on your own computer and is widely available on University computers across campus.
  • Academic integrity: the guiding principle in this class is that the work you submit should be your own work. For in-class exams, this means you will not look at other students' papers or unauthorized materials when you are working on your exam. For assigments, this means you are not permitted to discuss assignment problems with other students, share R code to solve assignment problems with other students, or use artificial intelligence to obtain R code or to solve assignment problems. If credible evidence of a rule violation materializes, it will be reported to the Office of Student Conduct in accordance with the University's Academic Integrity Policy (website). Please note that an Academic Integrity referral does not mean that you have been found responsible for a violation or that I have concluded you are responsible for a violation. It means that an issue of concern has been noted and that the Office of Student Conduct will examine it carefully. A determination of responsibility will only be reached after the Office of Student Conduct has completed its investigation with appropriate due process; as an instructor, my role in the process is a limited one.

Frequently Asked Questions

  • Why do I have to take this course?

The University of Maryland has a General Education curriculum (link) which all baccalaureate-level degree recipients must complete. Part of this curriculum is called Fundamental Studies and successful completion of this course (CCJS 200) satisfies the Fundamental Studies Analytic Reasoning (FSAR) requirement. So, when you see FSAR labels associated with this course, that is what those labels refer to. The Analytic Reasoning requirement expresses a belief among the University faculty that analytic reasoning is an essential part of what it means to have received a liberal arts education.

  • What does UMD expect us to cover in a class that satisfies the University's analytic reasoning requirement?

According to the Gen-Ed website, "[c]courses in Analytic Reasoning foster a student's ability to use mathematical or formal methods or structured protocols and patterns of reasoning to examine problems or issues by evaluating evidence, examining proofs, analyzing relationships between variables, developing arguments, and drawing conclusions appropriately. Courses in this category advance and build upon the skills that students develop in Fundamental Mathematics."

  • Why is this course taught in the CCJS department?

The disciplinary focus of this course reflects a belief by the faculty that analytic reasoning skills can be usefully conveyed through the disciplinary lens of the student's major. The alternative would be to teach a course like this in a math or statistics department and, indeed, this is what is done at some universities. There is no right or wrong way to deliver this kind of course; what we have at Maryland reflects the views of the faculty who work here.

  • I don't like math and I'm apprehensive about taking this course. Can you help me feel better about this?

It just so happens that when criminologists do research and evaluation work, they often rely on quantitative data and so it came to pass that we have this class which emphasizes "statistical analysis." Even though you will be performing some calculations and working with some numeric data in this class, the calculations you will be doing are probably best viewed as arithmetic. This is to emphasize that all of the calculating in this class will be focused on addition, substraction, multiplication, and division. Discussion sections and office hours are the appropriate venue for getting help with the practice problems. The University also offers the Math Success Program which may be helpful to you.

  • What will I be able to do when this course is over?

These are the course learning outcomes which I have defined as your being able to: (1) explain the meaning and limitations of commonly used statistics related to crime and criminal justice; (2) describe the concepts of point estimation, interval estimation, and hypothesis testing; (3) analyze certain kinds of quantitative criminological evidence, considering both the strengths and weaknesses of methodological approaches often used in our discipline; and (4) perform basic statistical calculations related to criminologically interesting phenomena.

  • Will there be homework?

Yes, I will regularly give you example questions and problems to complete outside of class. Some will come from the textbook and I will create some myself.

  • What will the assignments be like in this class?

You will have 3 out-of-class assignments in this course; each assignment will consist of substantive and computer-related applications of key concepts presented in class. If you have a question about an assignment, you are free to ask me or the TAs, up until 12 hours before the assignment is due. If a question we receive seems to be of interest to the entire class, we will post that question and the answer to the question on the course webpage so everyone can see the question and the answer.

  • What will the exams be like?

Exams will involve pencil and paper questions, problems, and calculations that will be similar to those you work on outside of class. For these exams, you will need to have a calculator which can take square roots. The exams are closed-book but a formula sheet will be provided. Before each exam, I will set aside class time for review. On exam days, the full class period will be devoted to the exam. The exams in this class are not designed to be cumulative but the material we cover later in the class does build on concepts learned earlier, so please keep that in mind.

  • Do I have to show my work to get credit?

When complex or multi-step calculations are involved, you must show your work (including intermediate steps), to receive full credit on an exam or assignment question. If you get an incorrect answer on an assignment or an exam, you may still be able to get partial credit if we judge that important intermediate steps were done correctly.

  • What happens in discussion sections?

Marshae and Sammy will be the CCJS 200 teaching assistants this semester and they will oversee the discussion sections each Friday. They are both Ph.D. students in our program who are well versed in the concepts and materials we are studying this semester. During the discussion sections, Marshae and Sammy will be answering lecture- and assignment-related questions. They will also go over some of the practice problems at the end of the textbook chapters. They are a valuable resource for you this semester and I encourage you to fully engage with the discussion section each week. Please note that the discussion sections are designed to be smaller class sizes where you can get help with homework problems and questions pertaining to the class so it is important that you attend the section for which you registered. This Note Added on 2/5/25: If an unavoidable conflict arises, it is permissible to change discussion sections but you must send a written request to your TA on ELMS so that we have a record of the change. I do ask you to keep such changes to a minimum because we want the discussion sections to be small so everyone can get the help they need.

  • How do I obtain and use R?

R is available at this website. You can read about the history of R (it was originally developed at Bell Labs and used to be called S and then S+) at the Wikipedia page. To work with R, you will need to use a plain text editor (Notepad on Windows or TextEdit on Mac OS). You can use R on your own computer or you may access it on the OACS workstations in the basement of LeFrak Hall (website). The R system serves as a reminder that some important things in science and in life are still free.

Outline

The various topics I plan to discuss in this course are listed below (expected dates are in parentheses next to each topic). Here are the expected assignment distribution and due dates:

  • Assignment #1: distributed on Thursday 2/13; due on Wednesday 2/19 Thursday 2/20 at 11:59pm (ET)
  • Assignment #2: distributed on Thursday 3/13; due on Wednesday 3/26 at 11:59pm (ET)
  • Assignment #3: distributed on Thursday 4/24; due on Wednesday 4/30 at 11:59pm (ET)

And, here are the expected exam dates:

Note: I'm not designing the assignments and exams to be cumulative but some level of cumulation is inherent, so please keep that in mind. Also, I will try to stick to the schedule that is described here; if changes need to be made, I will notify you as soon as possible. Notice that we skip over chapter 3 in the outline below (graphs and charts) but we will be enountering some graphing and charting issues throughout the semester and I will refer to parts of chapter 3 as appropriate when those issues arise.

Chapter 1: Introduction

  1. introductory material (1/28)
  2. some example criminology and criminal justice statistics (1/30)

Chapter 2: Measurement

  1. levels of measurement (2/4)
  2. validity of measurement (2/6)
  3. reliability of measurement (2/6)

Chapter 4: Central Tendency

  1. mode (2/11)
  2. median (2/11)
  3. mean (2/13)
  4. least squares (2/13)
  5. skewness (2/13)

Chapter 5: Dispersion

  1. proportion in the modal category (2/18)
  2. percentage in the modal category (2/18)
  3. variation ratio (2/18)
  4. index of qualitative variation (2/18)
  5. range (2/18)
  6. variance (2/20)
  7. standard deviation (2/20)
  8. coefficient of relative variation (2/20)
  9. mean absolute deviation (2/20)

Note: for topic 13, there are some comments in the chapter on the bounds for the variation ratio. This commentary will not be covered on the exam.

Chapter 6: Statistical Inference

  1. samples and populations (2/25)
  2. statistics and parameters (2/25)
  3. research questions (2/25)
  4. research hypotheses (2/25)
  5. nondirectional and directional hypotheses (2/27)
  6. null hypothesis (2/27)
  7. errors in hypothesis testing (2/27)
  8. statistical significance (2/27)

Chapter 7: Binomial Distribution

  1. coin flipping (3/6)
  2. sampling distribution (3/6)
  3. probability distribution (3/6)
  4. multiplication rule (3/11)
  5. ordering and arrangement (3/11)
  6. permutations and combinations (3/11)
  7. binomial distribution (3/11)

Chapter 8: Statistical Tests

  1. measurement type (3/13)
  2. assumptions about the population (3/13)
  3. sampling methodology (3/13)
  4. hypotheses (3/13)
  5. specifying the sampling distribution (3/25)
  6. rejection region (3/25)
  7. making a decision (3/25)

Chapter 9: Hypothesis Testing with Categorical Data

  1. checking on equality of frequencies across categories (3/27)
  2. checking on whether 2 categorical variables are statistically independent (3/27-4/1)
  3. analyses with ordinal data (3/27-4/1)
  4. chi-square tests with small sample sizes (3/27-4/1)

Chapter 10: Normal Distribution

  1. overview of normal distributions (4/1)
  2. the standard normal distribution (4/3)
  3. z-scores (4/3)
  4. percentiles of the standard normal distribution (4/3)
  5. standard error of the estimated sample mean (4/3)
  6. single-sample z-test (4/8)
  7. central limit theorem (4/10)
  8. single-sample test with a skewed variable (different from the book; 4/10)
  9. 1-sample t-test (4/17)

Note: I am not presenting the proportion examples in Chapter 10.

Chapter 20: Confidence Intervals

  1. point and interval estimation (4/22)
  2. margin of error (4/22)
  3. confidence intervals for sample proportions (4/24)

Note: I am not covering confidence intervals for the other statistics mentioned in this chapter.

Chapter 11: Comparing Means Between 2 Samples

  1. 2-sample t-test (4/29)
  2. pooled vs. separate variances (4/29)
  3. dependent-samples t-test (5/1)

Note: I am not covering the material on proportions in this chapter.

Chapter 14: Correlation

  1. covariation and Pearson correlation (5/6-5/8)
  2. scatterplots (5/13)
  3. linearity and influential cases (5/13)

Note: I am not covering the material on Spearman correlations in this chapter.

Lesson 1 - Tuesday 1/28/25 (topic 1 - introductory material)

  • 1.1: is criminology and criminal justice a scientific discipline?
  • 1.2: defining the terms "criminology" and "criminal justice"
  • 1.3: line between these isn't always clear
  • 1.4: normative and empirical statements
  • 1.5: on page 3, your book argues that "statistics" can help us "simplify and clarify"; how so?
  • 1.6: examples of key "criminology" and "criminal justice" statistics (more on this on Thursday).
  • 1.7: criminal justice example: waiting time on death row before execution
  • 1.8: criminology example: the relationship between age and crime
  • 1.9: page 4-5: striking a balance between simplicity and accuracy.
  • 1.10: substance of a statement by Einstein: "things should be as simple as possible -- but no simpler."
  • 1.11: pp. 5-6: statistics and measurements should use the information that is available in the data.
  • 1.12: p. 6: pay attention to unusual (and I would add, missing) cases
  • 1.13: pp. 6-7: be transparent about the science (promotes understanding and reproducibility)
  • 1.14: science of statistics: point estimation, interval estimation, hypothesis testing
  • 1.15: science of probability: measuring and quantifying uncertainty
  • 1.16: pp. 7-10: descriptive and inferential statistics

Lesson 2 - Thursday 1/30/25

Announcement: This is a note I received from the Office of Accessibility and Disability Services (ADS):

Do you take well-organized, comprehensive notes? Do you have good penmanship or do you currently
type your notes? Why not get paid to share your notes with classmates who are eligible to
receive course lecture notes? 

If you are interested in providing this much needed service to a fellow student, please go to
https://go.umd.edu/adsNoteTakers to apply. If you are selected by an eligible student, the Accessibility
and Disability Service (ADS) will compensate you with a one-time payment at the end of the semester. 

Staff at ADS are available to answer any questions you may have. Feel free to contact us at adsnotetaking@umd.edu.

Steps to Apply:

* Go to adsonline.umd.edu and click on Current Students
* Select the Note Takers icon
* You will be directed to the Central Authentication Service (CAS); sign in with your UMD Directory ID and Password
* Set up your Note Taker Profile to apply to be a Note Taker.
* If previously completed, skip to step 6 by clicking Courses/Notes tab in upper left menu
* Read and acknowledge (check box) the Confidentiality Agreement
* Select in which course(s) you would like to serve as a Note Taker for the semester
* Upload your Sample Notes – your application will not be complete until this step is finished
* It is important to understand, if you are applying to serve as a Note Taker in more than one class,
  sample notes are required for each individual class. 
* After ADS reviews your application and it is determined to be complete, you will be considered an eligible Note Taker.
* If selected by an ADS student who is seeking a Note Taker, you will receive an email from ADS that you have been chosen
  and should immediately begin uploading notes. The ADS student will see your name and email address. You will not have
  their information unless they choose to contact you.
* After receiving the email confirming you are chosen as a Note Taker, please complete the paperwork listed at
  https://go.umd.edu/adsNoteTakers to receive your compensation following the conclusion of the semester.
  • 1.17: p. 9: samples and populations: often we can't study an entire population so we study a sample.
  • 1.18: pp. 10-11: multivariate statistics: measuring cause and effect often requires that we pay attention to multiple factors at the same time. Example: job attachment and crime (need to account for age)

Topic 2: Example Criminology/Criminal Justice Statsitics

  • 2.1: # of murders for Philadelphia in 2018 (link)
  • 2.2: we can use the information in 2.1 to calculate the Philadelphia murder rate in 2018.

  • 2.3: now, let's look at the murders in Philadelphia for the year 2019 (link):

  • 2.4: Participation in the FBI's Uniform Crime Reporting (UCR) program is voluntary.
  • 2.5: Each year, the Bureau of Justice Statistics (BJS) publishes a set of crime victimization statistics for the nation.
  • 2.6: These statistics are based on the National Crime Victimization Survey (NCVS).
  • 2.7: The NCVS is survey data based on a sample of U.S. households.
  • 2.8: An important feature of the NCVS is its measure of crimes reported to the police.
  • 2.9: Here is a link to the most recent (2023) report.
  • 2.10: The crimes reported to the police table is on page 6 of the report.
  • 2.11: Notice the change in the percentage of robberies reported to the police from 2022 to 2023.

  • 2.12: From time to time, the Bureau of Justice Statistics publishes reports documenting recidivism patterns among people being released from prison. The table below summarizes the recidivism patterns for people released from prison in 15 states during the year 1994 (link):

Week 1 Practice Questions (no questions at the end of Chapter 1 so I'm making up a few)

  • The death penalty is morally wrong. (empirical or normative?)
  • Death penalty states have higher murder rates than non-death penalty states (empirical or normative?)
  • When a new neighborhood police station opened in neighborhood X, reported robberies declined by 10% (empirical or normative?)
  • Average age of first criminal offense (criminology or criminal justice?)
  • Percentage of cases in a domestic violence court resolved with a plea bargain (criminology or criminal justice?)
  • Suppose we survey every 5th house on a 3-mile street. Is this a sample or a population?
  • Is the National Crime Victimization Survey (NCVS) based on a sample or a population?
  • Consider the following dataset of El Paso murders:
Year N =
2014 21
2015 17
2016 17
2017 19
2018 23
2019 40
2020 23

Which of these years seems to be an outlier? How does it affect our ability to summarize the yearly number of murders in El Paso during this particular 7-year time period? How should we address it in an analysis?

  • Consult the Uniform Crime Reports for Charlotte-Mecklenburg North Carolina in 2018 and 2019. Calculate the murder rate for each year. What conclusion do you draw about the change in the murder rate for Charlotte from 2018 to 2019.
  • Review the National Crime Victimization Survey Report for 2019 (link). Find the table presenting crime victimizations reported to the police. Describe the change you see in the percentage of robberies reported to the police from 2018 to 2019.

Lesson 3 - Tuesday 2/4/25

  • 2.13: When reviewing the recidivism table presented last class, two key patterns are evident: (1) how recidivism rates accumulate over time; and (2) how recidivism rates vary depending on the measure of recidivism (arrest, conviction, imprisonment). The chart below presents the same information graphically:

  • 2.14: For all of the crime statistics I have presented so far, the emphasis has been on point estimation; that is, estimation of a single number or a small set of numbers. Interval estimates provide a range of uncertainty for the point estimate. This is a more advanced topic which we will address later in the course.
  • 2.15: Let's look at Table 8 in the same report. Among other things, this table shows us the relationship between age and recidivism. What can we conclude about that relationship based on the information in the table?

  • 2.16: Murders in Oklahoma City during the 1990's:
Year N =
1992 61
1993 80
1994 65
1995 227
1996 67
1997 59
1998 56

Clearly, 1995 is an outlier (due to the bombing of a federal building in April 1995). How should we deal with it? We could calculate an average both with and without the outlier included:

Note: this is a case where it would be reasonable to present 2 answers instead of 1.

  • 2.17: Crime clearances are a significant research topic; they relate to accountability, justice, police performance, and police-community relations. Let's look at murder clearances in Baltimore for the year 2023. (Note: these data are based on information compiled by the FBI and tabulated by https://crimedatatool.com).

The total number of murders was 233 and the number of clearances was 82. This produces a clearance rate of 82/233 $\approx$ 35.193 clearances per 100 murders.

  • 2.18: Clearances are regularly tabulated for so-called Part I index crimes: murder, rape, robbery, aggravated assault, burglary, larceny, motor vehicle theft, and arson. We'll take a look at the thefts and theft clearances in Buffalo, NY in 2017 (again, tabulated by https://crimedatatool.com):

The yearly clearance rate cannot be calculated from these numbers since two months are missing clearance information. It is important to note that a murder or theft occurring in one year might not be cleared until the next year so it is not accurate to think of the clearance rate as a fraction.

Note: these are all examples of descriptive statistics: they tell us what is typical and they tell us how much variability there is. When we try to learn what is typical, we use measures of central tendency. When we try to learn about the variation in the data, we use measures of dispersion (these terms are discussed on pp. 7-9 in your textbook).

Chapter 2/Topic 3 (Levels of Measurement)

Criminologists often converse about fairly broad concepts. Terms like crime rates, recidivism, homicide, delinquency, incarceration, and plea bargaining are often used in fairly loose and informal ways. To do research, we must become more specific in defining the terms that we use. Toward this end, chapter 3 considers the idea of measurement: the process by which we transition from abstract concepts to actual research exercises like classification and counting.

  • 3.1: Positivism: the empirical study of scientifically interesting phenomena; as opposed to advocacy or opinions (normative reasoning).
  • 3.2: Measurement involves placing study units into categories or on a continuum so that we can see the variation that exists.
  • 3.3: Variables are traits, characteristics, or attributes that can be used to differentiate study units.
  • 3.4: A unit or unit of analysis is the main focus of a study.
  • 3.5: Example: the unit of analysis in a study describing the variation in crime across neighborhoods within a city is the neighborhood.
  • 3.6: Example: the unit of analysis in a study examining the probability that an individual person has been a victim of a crime within the past year is the individual person.
  • 3.7: The categories or distinct values of some variables represent qualitative differences or differences in kind.
  • 3.8: The categories or distinct values of other variables represent quantitative differences or differences in degree.
  • 3.9: This leads us to the topic of levels of measurement which corresponds to the meaning we can assign to the differences between categories.
  • 3.10: 4 levels of measurement: nominal, ordinal, interval, and ratio.
  • 3.11: A nominal scale variable represents purely qualitative differences; the arrangement or ordering of the categories is not important.
  • 3.12: Examples of nominal scale variables: sex, race, religious affiliation.
  • 3.13: An ordinal scale variable involves qualitative differences but the ordering of the categories is significant. An ambiguity in ordinal variables is that the distance between the categories is not well defined.
  • 3.14: Examples of ordinal scale variables: Likert-type scales (1=strongly agree, 2=agree, 3=disagree, 4=strongly disagree); severity of head injuries in motorcycle accidents (Weiss, 1992:49; link).

  • 3.15: Interval scale variables have many ordered values but no true zero.
  • 3.16: Examples of interval scale variables: IQ scores, certain kinds of risk assessment scores (for example, the Federal Bureau of Prisons Risk Assessment, link):

Note: this particular example illustrates how we can collapse or reduce an interval level variable to an ordinal level of measurement.

Note: A problem is that the dividing line between an ordinal variable and an interval variable can be somewhat ambiguous (i.e., how many categories do we need to have before we decide that a variable is no longer an ordinal variable and is now an interval variable?).

  • 3.17: Ratio scale variables have many ordered values plus there is a true zero (which represents the absence of the phenomenon).
  • 3.18: Examples of ratio scale variables: speed at which a car is driving; the number of months an offender is sentenced to prison; age at which a person is arrested for the first time.
  • 3.19: The scale of measurement is, therefore, based on the amount of information we have plus the nature of the phenomenon (variable) that is being studied.
  • 3.20: Please see the other examples in Tables 2.1-2.3 of your textbook.
  • 3.21: The textbook (pp. 22-23) urges us to collect data at the highest level possible. For example, if we collect data at the interval or ratio level, we can always combine categories to get an ordinal scale; but if the data are collected at the ordinal level, we cannot increase the level of measurement to the interval or ratio level.

Lesson 4 - Thursday 2/6/25

Topic 4: Validity

  • 4.1: Validity (pp. 23-24): does the measurement accurately reflect the concept that is being measured (acronym: MARC). So validity pertains to accuracy of measurement.
  • 4.2: Textbook example: measuring criminal offending: (1) self-report surveys; or RAP (record of arrest and prosecution) sheet information including - (2) arrest records; (3) conviction records; and (4) incarceration records.
  • 4.3: On page 24, the textbook argues that "self-report surveys are generally considered to provide the most valid measure of frequency of offending." Yet this measure is not perfect. What could go wrong?
  • 4.4: Would a measure based on arrests or convictions be better? Why or why not?
  • 4.5: Let's look at an example problem:

  • 4.6: Consider the following research question: how many murders/homicides occurred in Maryland in the year 2019?
  • 4.7: There are two major data sources for measuring murders/homicides in the United States: the Federal Bureau of Investigation (FBI) measure of murder and non-negligent manslaughter; and the Centers for Disease Control (CDC) measure of death by assault (homicide).
  • 4.8: Here is the number of murder/non-negligent manslaughter incidents based on the FBI data:

  • 4.9: Here is the number of homicides based on the Centers for Disease Control (CDC) data:

  • 4.10: What can account for the difference?
  • 4.11: Different kinds of validity: (1) content and construct validity; (2) criterion-related and multiple measures validity.
  • 4.12: Here is the long-term national comparison of the two data series.

Topic 5: Reliability

  • 5.1: Reliability (pp. 24-25): repeatability or consistency of the measurement (example: bathroom scale).
  • 5.2: Example: suppose I sit near an intersection with a 4-way stop sign for 3 hours. I count the number of cars that go through the intersection and I also count the number of cars that fail to come to a complete stop. Do you have any concerns about the reliability of the measurement? How could we improve it?
  • 5.3: Relationship between reliability and validity: a measure can be reliable but not valid; if a measure is unreliable it is also invalid.

Introduction to R

  • Here is where you go to get R software for your computer: link.
  • If you don't have a computer, you can use R in the Office of Academic Computing Services Lab on the bottom floor of LeFrak Hall.
  • R is also widely available on computers across campus.
  • Once you have R downloaded on your computer, you are ready to launch the application.
  • You should write your code in a plain text editor (Notepad on Windows or TextEdit on MacOS).
  • Do not use Microsoft Word or any other word processor to write R code. It will cause problems!
  • You can paste your output into a word processor as you prepare your assignments so I can see your work and your results.
  • Let's revisit the Oklahoma City murder data above and enter this data into R.
murders=c(61,80,65,227,67,59,56)
year=1992:1998
data.frame(year,murders)
  • Here is the output we get:
> murders=c(61,80,65,227,67,59,56)
> year=1992:1998
> data.frame(year,murders)
  year murders
1 1992      61
2 1993      80
3 1994      65
4 1995     227
5 1996      67
6 1997      59
7 1998      56
>
  • Now suppose we want to know the total number of murders that occurred over the entire 7-year period:
sum(murders)

which gives:

> sum(murders)
[1] 615
>
  • Now, let's suppose I want to calculate the average yearly number of murders:
sum(murders)/7

which gives:

> sum(murders)/7
[1] 87.85714
>
  • The average, or mean, is a measure of typicality or central tendency.
  • You should verify that this is the same number we got before.
  • We could also calculate the average removing the outlier:
(sum(murders)-227)/6

which gives us this result:

> (sum(murders)-227)/6
[1] 64.66667
>
  • Finally, let's suppose that we want to calculate the murder rate for each of the 7 years.
  • Let's start a new R session.
  • To calculate the rate, we need to include the population for each of the 7 years.
  • We go to crimedatatool.com to get the data.
year=1992:1998
murders=c(61,80,65,227,67,59,56)
population=c(454255,457448,461271,466232,469632,472046,463637)
rate=(murders/population)*100000
data.frame(year,murders,population,rate)
> year=1992:1998
> murders=c(61,80,65,227,67,59,56)
> population=c(454255,457448,461271,466232,469632,472046,463637)
> rate=(murders/population)*100000
> data.frame(year,murders,population,rate)
  year murders population     rate
1 1992      61     454255 13.42858
2 1993      80     457448 17.48833
3 1994      65     461271 14.09150
4 1995     227     466232 48.68821
5 1996      67     469632 14.26649
6 1997      59     472046 12.49878
7 1998      56     463637 12.07841
>
  • Which year has the highest murder rate?
  • Which year has the lowest murder rate?
  • Considering the 7 years, which year has the median murder rate?
sort(rate)

which gives us the following output:

> sort(rate)
[1] 12.07841 12.49878 13.42858 14.09150 14.26649 17.48833
[7] 48.68821
>
  • Notice that the rates are sorted in ascending order.
  • The median murder rate is the middle score -- another measure of typicality or central tendency.
  • With 7 scores, the middle score is the 4th score (3 scores below the 4th score and 3 scores above the 4th score).
  • So, the median is 14.09150.
  • And 14.09150 was the murder rate for the year 1994.

Week 2 Practice Questions

  • Consider the following dataset of the number of murders in New York City reported to the FBI by the New York City Police Department (NYPD):
Year N =
1997 770
1998 633
1999 671
2000 673
2001 3472
2002 587
2003 597

The year 2001 is an obvious outlier due to the terrorist attacks of September 11, 2001. Calculate and report the average yearly number of murders in New York City during this period including and excluding the outlier.

  • What is the unit of analysis for the New York City table?
  • Examine the BJS recidivism study (link) for people released from prison in 1994. Looking at Table 12 of the report, we see evidence of a relationship between the number of prior arrests and the fraction of people who recidivated within 1 and 3 years. Based on the information in this table, what pattern do you see?
  • Calculate the theft clearance rate (# of clearances per 100 cases) for Buffalo for the first 4 months (combined) of 2017. Then, calculate the theft clearance rate for the last 6 months (combined) of 2017. Which of the two clearance rates is greater?
  • Problems 2.1-2.8 at the end of chapter 2.

R Practice Assignment (I will go over this in class on Tuesday 2/11/25)

  • Enter the New York City murders (1997-2003) and years from the table above.
  • Go to crimedatatool.com to get the population sizes for New York City for each of the years.
  • Enter the population sizes.
  • Calculate the murder rate for each of the years.
  • Calculate the average murder rate.
  • Find the median murder rate.
  • Which year corresponds to the median murder rate?
  • Calculate the murder rate for each of the years (using a calculator).
  • Calculate the average murder rate (using a calculator)
  • Verify that your calculations match those from R out to the 3rd decimal place.
  • Remember that you should only round the final answer (not intermediate calculations).
  • For this analysis, do you think the mean or the median would be a better measure of central tendency? Why?

Lesson 5 - Tuesday 2/11/25

Note: First assignment will be distributed on Thursday 2/13/24. It will be due on Wednesday 2/19/24 Thursday 2/20/25 at 11:59pm (ET).

Chapter 4 Begins Here

  • A primary purpose of descriptive statistics is to transmit information about what is typical.
  • Considering a specific variable, we summarize typicality through measures of central tendency (mode, median, and mean)

Topic 6: Mode

  • 6.1: First measure of central tendency: the mode (pp. 60-62). The modal category is the most frequently occurring category, level, or value of a variable.
  • 6.2: Mode Example: Sanction for a sample of 100 people cited for littering: (1) Fine (N=22); (2) Civics class (N=25); (3) Community Service (N=53). What is the mode?
- the mode is the most frequently occurring category, level or value of a variable.
- in this case, the community service category has the largest number of cases.
- therefore, the mode is "community service"
  • 6.3: A common mistake in the previous example is to say that the mode is 53. It is correct to say that the mode is community service and the number of cases in the modal category is 53.
  • 6.4: It is most natural to use the mode as a measure of central tendency for nominal variables (categorical variables with no logical way to order the categories).
  • 6.5: The mode can be used with ordinal, interval, and ratio variables but it is often not a particularly meaningful exercise to do so.

Topic 7: Median

  • 7.1: For variables with order, the median (pp. 62-68) will often be more useful than the mode.
  • 7.2: Suppose we ask a sample of 97 people to answer a survey question about how safe they feel walking in their neighborhood in the early evening. Among the 97 people, here are the answers we get back: (1) very unsafe (N=16); (2) somewhat unsafe (N=21); (3) somewhat safe (N=38); (4) very safe (N=22). What is the median safety assessment?
- If the number of cases is odd, then the median observation is (n+1)/2
- So, the median observation for our sample is (97+1)/2 = 49
- 16+21 (the sum of the number of people in the first two categories) is 37; 37 < 49
- 16+21+38 (the sum of the number of people in the first three categories) is 75; 75 > 49
- So, the 49th observation lies within the 3rd category (somewhat safe).
- The median score in this example is "somewhat safe".
  • 7.3: Let's say we conduct a survey of 50 people and we present them with the following statement: Capital punishment (the death penalty) should be a punishment option for the crime of first degree murder. We receive the following responses: (1) strongly disagree (N=7); (2) disagree (N=9); (3) neutral (N=15); (4) agree (N=11); (5) strongly agree (N=8). What is the median level of agreement with the statement in our sample?
- The number of cases is even so there is no single median observation (p. 64).
- We can still use the median observation formula: (n+1)/2
- Since (50+1)/2 = 25.5, we recognize that the median lies between the 25th and 26th observations.
- 7+9 = 16 is the sum of the number of people in the first 2 categories; 16 < 25
- 7+9+15 = 31 is the sum of the number of people in the first 3 categories: 31 > 26
- Since the total number of people in the first 3 categories is greater than 26, the median agreeement is "neutral."
  • 7.4: Notice that our median calculations in the previous 2 examples presumes that there is a logical ordering of the categories.
  • 7.5: In both of these examples, the mode would also have been a meaningful number (and would have been equal to the median; convince yourself!).
  • 7.6: The median can also be used as a measure of central tendency or typicality for interval and ratio level variables (pp. 64-68).
  • 7.7: Suppose we study the last 7 people to be released from the local prison; our goal is to calculate the median time served in prison (in years) before release.
- Data: 7,3,9,5,2,2,8
- Sorted data: 2,2,3,5,7,8,9
- Median Observation: (n+1)/2 = (7+1)/2 = 8/2 = 4
- This means the score of the (sorted) 4th observation is the median: 5
- So the median time served in prison for our sample is 5 years.
- Notice that the mode of this distribution is 2. 
  • 7.8: A DWI checkpoint was set up for 1 hour last Friday night. The blood alcohol content levels for the 8 people who were cited for DWI were recorded. Calculate the median blood alcohol content level for the sample.
- Data: 0.12,0.09,0.15,0.13,0.09,0.1,0.11,0.12
- Sorted data: 0.09,0.09,0.1,0.11,0.12,0.12,0.13,0.15
- Median observation: (8+1)/2 = 9/2 = 4.5
- Since there is no 4.5th observation, we have to look at both the 4th and 5th observations: 0.11 and 0.12.
- Standard practice is to find the midpoint between the two middle observations.
- In this case, the midpoint would be (0.11+0.12)/2 = 0.23/2 = 0.115.
- So, the median blood alcohol level would 0.115.
  • 7.9: Notice that in this example, there is no single mode since 0.09 occurs 2 times and 0.12 occurs two times.
  • 7.10: This means the distribution is bimodal.

Solution to Last Week's Practice R Exercise

  • Enter the New York City murders (1997-2003) and years from the table above.
murders=c(770,633,671,673,3472,587,597)
year=1997:2003
data.frame(year,murders)
> murders=c(770,633,671,673,3472,587,597)
> year=1997:2003
> data.frame(year,murders)
  year murders
1 1997     770
2 1998     633
3 1999     671
4 2000     673
5 2001    3472
6 2002     587
7 2003     597
>
  • Go to crimedatatool.com to get the population sizes for New York City for each of the years.
  • Enter the population sizes.
population=c(7320477,7357745,7429263,8008278,8023018,8084693,8098066)
  • Calculate the murder rate for each of the years.
murder.rate=(murders/population)*100000
data.frame(year,murders,population,murder.rate)
> murder.rate=(murders/population)*100000
> data.frame(year,murders,population,murder.rate)
  year murders population murder.rate
1 1997     770    7320477   10.518440
2 1998     633    7357745    8.603179
3 1999     671    7429263    9.031851
4 2000     673    8008278    8.403804
5 2001    3472    8023018   43.275486
6 2002     587    8084693    7.260634
7 2003     597    8098066    7.372131
>
  • Calculate the average murder rate.
sum(murder.rate)/7
> sum(murder.rate)/7
[1] 13.49508
>
  • Find the median murder rate.
sort(murder.rate)
> sort(murder.rate)
[1]  7.260634  7.372131  8.403804  8.603179  9.031851
[6] 10.518440 43.275486
> 
- The median observation is in the 4th position (3 scores above, 3 scores below).
- So the median murder rate is 8.603179.
  • Which year corresponds to the median murder rate?
The year corresponding to the median murder rate is 1998.
  • Calculate the murder rate for each of the years (using a calculator).
  • Calculate the average murder rate (using a calculator)
  • Verify that your calculations match those from R out to the 3rd decimal place.
  • Remember that you should only round the final answer (not intermediate calculations).

  • For this analysis, do you think the mean or the median would be a better measure of central tendency? Why?

Lesson 6 - Thursday 2/13/25

Topic 7 (Continued)

  • 7.11: I want to make one last point about the sample mode and median. There are times when we have incomplete data as in Problem 4.7 in this week's practice questions. It is still possible in some instances to make progress in estimating the mode and median when some of the data are missing. Here is an example dataset with measures of the speed with which 9 cars passed a speeding camera. One of the observations was missing, however.
- Data: 57,52,63,48,52,52,59,58, ?
- Sorted Data: 48,52,52,52,57,58,59,63, ?
- Notice that it doesn't matter what the missing value is. The mode will still be 52.
- The middle position of the distribution of 10 cases is (9+1)/2 = 10/2 = 5.
- To estimate the median, we need to identify the 5th observation in the sorted dataset.
- If the missing case has a low value, say 45, the median would be 52.
- If the missing case has a high value, say 70, the median would be 57.
- This means that the median could be no lower than 52 and no greater than 57.
- We don't know the exact number but we know it must lie within that range.

Topic 8: Mean/Average

  • 8.1: The arithmetic average or mean (pp. 68-71) is an important measure of central tendency.
  • 8.2: The mean considers the precise value of each score in estimating the "typical" score.
  • 8.3: An average is based on the sum of the scores divided by the number of scores (equation 4.2 on page 68).
  • 8.4: Consider the following example of a set of recidivism risk scores:
Data: 7.7, 8.1, 4.5, 3.2, 7.5, 8.8, 5.1
Sum of the Scores: 7.7+8.1+4.5+3.2+7.5+8.8+5.1 = 44.9
Number of Scores: 7
Average or Mean Score: 44.9/7 = 6.414286 (or 6.414)

Note: see discussion of rounding on page 71. The rounding rule we will follow this semester is that intermediate calculations should be rounded as little as possible. Some rounding will be inevitable due to the limits of your calculators but intermediate calculations should be as precise as possible. The issue is that if you're doing multiple calculations, the rounding error starts to become a significant problem. Your final result when solving a problem can be rounded (usually to the 3rd decimal place).

  • 8.5: There are two different ways to write the equation used to calculate the mean, $\overline{X}$. The book emphasizes the first way but both are acceptable and both are regularly used:

Note: the examples we have used so far rely on the first equation (which is what is used in the book). However, here is the example solved above but now based on the second formula:

Data: 7.7, 8.1, 4.5, 3.2, 7.5, 8.8, 5.1
Sum of the Scores: 7.7+8.1+4.5+3.2+7.5+8.8+5.1 = 44.9
Number of Scores: 7
Average or Mean Score: 1/7 x 44.9 = 6.414286 (or 6.414).
  • 8.6: It is often useful to compare the mean and the median. Let's calculate the median on the same dataset:
Data: 7.7, 8.1, 4.5, 3.2, 7.5, 8.8, 5.1
Sorted Data: 3.2, 4.5, 5.1, 7.5, 7.7, 8.1, 8.8
Number of Scores: 7
Middle Position: (7+1)/2 = 8/2 = 4
Score at the 4th position: 7.5 (this is the median)
  • 8.7: We return briefly to the example discussed above in 7.11. In this example, there was a missing observation; we determined that it was possible to still learn something about the mode and the median even if one of the observations was missing. However, this is only possible for the mean if the variable being studied has a logical lower bound and a logical upper bound. In the case of the example in 7.11, there is a logical lower bound (a car cannot go slower than 0 mph) but there is no logical upper bound. So, the mean value is unknown even if only one of the data points is missing.
  • 8.8: An edge case presented in Practice Problem 4.7d, raises the question of identifying a single missing observation if we know the overall mean score. I will illustrate how this works with a new dataset showing the number of murders in a local county in the last 5 months:
- Data: 15,17,21,10,?
- The 5th month is missing but we know that the monthly average (mean) is 17.6.
- Can we figure out the missing observation? -- Yes!
- (15+17+21+10+?)/5 = 17.6
- (63+?)/5 = 17.6
- Multiply both sides of the equation by 5.
- (63+?) = 17.6 x 5
- (63+?) = 88
- Subtract 63 from both sides
- ? = 88-63 = 25
- So, the missing observation is 25.

Topic 9: Least Squares Property

  • 9.1: One way to formalize our understanding of the mean's ability to summarize the information in the data is to study its least squares property (pp. 74-76).
  • 9.2: Let's think about a difference score between each individual data point, $x_i$, and a measure of central tendency, $CT$.

  • 9.3: This difference score makes complete sense for an individual observation. When we use this statistic for all of the cases, however, we will usually have some $\Delta$ values that are positive and some that are negative (when the mean is used these differences will always sum to zero as discussed on page 74).
  • 9.4: And we don't really care about the sign of the differences (positive or negative) as much as we care about the magnitude of the differences. For this reason, we usually square the differences and sum them across all of the observations. This gives us:

  • 9.5: We can calculate this equation for each of our measures of central tendency.
  • 9.6: Let's go back to our speeding example from 7.11 and let's say we are now able to see the speeds for all the cars (the previously missing car had a speed of 59):
- Data: 57,52,63,48,52,52,59,58,59
- Sorted Data: 48,52,52,52,57,58,59,59,63
- Mode: 52
- Median: 57
- Sum of the speeds: 500
- Number of speeds: 9
- Mean: 500/9 = 55.5555556
  • 9.7: What is $\Delta_{\mbox{sq}}$ for the mode? Note: when you see x^2 in the notes below, that means $x^2$ in arithmetic notation.
(57-52)^2 =  25
(52-52)^2 =   0
(63-52)^2 = 121
(48-52)^2 =  16
(52-52)^2 =   0
(52-52)^2 =   0
(59-52)^2 =  49
(58-52)^2 =  36
(59-52)^2 =  49

sum of the squares = 25+0+121+16+0+0+49+36+49 = 296
or
sum of the squares = 25+(3x0)+121+16+(2x49)+36 = 296
  • 9.8: What is $\Delta_{\mbox{sq}}$ for the median?
(57-57)^2 =   0
(52-57)^2 =  25
(63-57)^2 =  36
(48-57)^2 =  81
(52-57)^2 =  25
(52-57)^2 =  25
(59-57)^2 =   4
(58-57)^2 =   1
(59-57)^2 =   4

sum of the squares = 0+25+36+81+25+25+4+1+4 = 201
or
sum of the squares = 0+(3x25)+36+81+(2x4)+1 = 201
  • 9.9: What is $\Delta_{\mbox{sq}}$ for the mean?
(57-55.5555556)^2 =   2.086419
(52-55.5555556)^2 =  12.64198
(63-55.5555556)^2 =  55.41975
(48-55.5555556)^2 =  57.08642
(52-55.5555556)^2 =  12.64198
(52-55.5555556)^2 =  12.64198
(59-55.5555556)^2 =  11.864197
(58-55.5555556)^2 =   5.975308
(59-55.5555556)^2 =  11.864197

sum of the squares =  2.086419+
                     12.64198+
                     55.41975+
                     57.08642+
                     12.64198+
                     12.64198+
                     11.864197+
                      5.975308+
                     11.864197 = 182.2222
  • 9.10: Notice that the mean minimizes the sum of the squared differences. This is a demonstration of the mean's least squares property.
  • 9.11: A friendly reminder from your earlier arithmetic classes: when you have parentheses, do the operations within the parentheses first and then do other things afterward.

Topic 10: Skewness

  • 10.1: The mean and median will often be very close to each other -- but sometimes not.
  • 10.2: New Term: A distribution refers to the way the cases or units in a study are allocated across the categories. For example, in a sample of people sentenced to prison, we might have a variable that measures the most serious offense for which each offender was convicted. The values for this variable could be: (1) violent; (2) property; (3) drug; or (4) other. When we consider the number of cases in each category is the variable's distribution. The bar chart below displays the distribution of the most serious conviction offense variable for a sample of 1000 inmates.

Note: this is a technical point but a bar chart is not really a summary. A bar chart shows the actual number of cases in each category or group. As such, bar charts are only useful for variables with a small number of populated categories.

  • 10.3: If there are outliers (pp. 73-74) or the data are skewed (page 77), the variable's distribution will be affected. In these situations, there will be larger differences between the two statistics. Generally, in these situations, the median is regarded as a better measure of central tendency or typicality.
  • 8.10: We have already discussed outliers but we have not yet considered the issue of skewness. Let's consider 3 histograms. A histogram is a type of chart (pp. 34-40 in Chapter 3 discusses histograms) that summarizes the distribution of a variable:

Note: a histogram is a summary. Histograms are used to display the distributions of interval or ratio level variables where relatively small number of cases can have the exact same score or when there are many categories to display (so the categories can be grouped in ways that effectively summarize the pattern). The reason a histogram is considered a summary is because we cannot usually see the exact score that an individual case has; we can only see the approximate score. So, we have a summary when some of the information in the raw data is lost.

  • 10.4: The histogram on the left shows a symmetric distribution (IQ scores famously have symmetric distributions); symmetric means that the left half of the distribution is a mirror image of the right half. The other two charts reveal distributions that are skewwed in different ways.
  • 10.5: When the distribution is symmetric, the mean and median will be the same value. When the distribution is skewed the mean and median will be different (as shown in the figure above).

Example: this website discusses the issue of whether to use the mean or the median as a measure of central tendency for income which has a distribution that looks like the far-right histogram above.

  • 10.6: As a general rule, the mean is the preferred measure of central tendency with interval or ratio level data (your book has a brief statement about using the mean on an ordinal scale on page 76 but this practice needs to be justified). When the data are highly skewed or there are outliers that induce a large difference between the mean and the median, we have to make a judgment call about which measure is to be preferred. Here is what your textbook says about this issue on page 78: "How should you decide when a distribution is so skewed that it is preferable to use the median as opposed to the mean? You should begin by comparing the mean and the median. When there is a very large difference between them, it may be the result of skewness. In such cases, you should look at the distribution of the scores to see what is causing the mean and median to differ widely. But there is no solid boundary line to guide your choice."

Week 3 Practice Questions

  • You should work on problems 4.1-4.10 at the back of Chapter 4.

Assignment #1 - Due on ELMS at 11:59pm ET on Wednesday 2/19/25 Thursday 2/20/25

Instructions: Please complete each of the problems listed below. You are required to submit your assignment as a single pdf file on ELMS. Please review all assignment guidelines and rules in the syllabus above. We will accept questions about the assignment up until 11:59am ET on Wednesday 2/19/25. Please note that any questions we judge to be of interest to the entire class will be posted on this webpage so everyone has access to the same information in a timely fashion. Assignments can be submitted beginning at 12:01am (ET) on Wednesday 2/19/25.

  1. Go to crimedatatool.com and look at the Charlotte-Mecklenburg Police Department (North Carolina) yearly murders and population sizes from 2013-2019. Enter the data into R and print out a table showing each year, the number of murders, and the population size (15pts).
  2. Go to crimedatatool.com and look at the Nashville Metro Police Department (Tennessee) yearly murders and population sizes from 2013-2019. Enter the data into R and print out a table showing each year, the number of murders, and the population size (15pts).
  3. Use R to calculate the murder rate for each year for Charlotte (10pts).
  4. Use R to calculate the murder rate for each year for Nashville (10pts).
  5. Calculate the mean murder rate by hand and in R for Charlotte; verify your calculations agree to the third decimal place (15pts).
  6. Calculate the mean murder rate by hand and in R for Nashville; verify your calculations agree to the third decimal place (15pts).
  7. Calculate the median murder rate for Charlotte (10pts).
  8. Calculate the median murder rate for Nashville (10pts).

Questions from Students:

  1. I'd like to know if we had to do the questions for Assignment #1 in order. For example, if I wanted to do all the Charlotte, NC questions first, then Nashville, TN after, could I do that, or should it be in numerical order from questions 1-8?

Answer: I don't have any concern about the order you do the work in but when it is submitted it needs to be in the order of the questions on the assignment.

  1. I have screenshots of some of my R work and photographs of some of my written work. Do I put all of this into a single pdf file?

Answer: Yes. Please organize all of your materials so that they are in the same order as the questions on the assignment -- in a single pdf file. Remember to check your file closely before you submit it -- to make sure everything is in its proper place.

  1. I'm not sure whether I'm formatting my document correctly. Can you review it for me?

Answer: I can't preview/pregrade documents. What I can tell you is that as long as your document is in a pdf file that we can open and read, we can read what you did, and it's organized according to the order of the items in the assignment there should not be any problems.

  1. For Questions 5/6, are the uppermost values in the datasets extreme enough to be considered outliers? If so, it is necessary that we present the adjusted mean murder rates for outliers in addition to simply the mean murder rate, or was that not an expectation for this particular assignment?

Answer: I would recommend that you stay focused on exactly what the questions are asking you to do.

  1. Note: There is an option on the assignment for you to submit a revised version of the assignment (up to 3 revisions will be accepted) if you wish to do so (for example, if you submit the assignment and you find a mistake before the submission deadline). If you choose to submit a revision, the revision will be the one that is graded. Also, if you submit a revision after the submission deadline, it will be subject to the late submission point reduction rules. So I would suggest not submitting a revision after the deadline has passed unless you think the improvement will be worth the point loss penalty for a late submission.

  2. For questions 3 and 4 on the homework should the murder rate be added to the table from the previous questions or is having the information in a separate set ok?

Answer: I didn't specify this in the question so it is up to you.

Lesson 7 - Tuesday 2/18/25

  • Reminder: your first assignment is due at 11:59pm (ET) on Wednesday 2/19/25 Thursday 2/20/25.
  • We will accept questions about the assignment up until 11:59am tomorrow. After that point, we will not be able to answer any further questions.
  • If you submit your assignment and then you decide you need to make a change, you can resubmit up to 2 additional times. The last version you submit will be the one that is graded.

Chapter 5 (Dispersion) and Topic 11: Proportion in Modal Category

  • 11.1: As a reminder, the mode is the most frequently occurring category, level, or value, of a variable.
  • 11.2: Recall the example from Topic 6.2: Sanction for a sample of 100 people cited for littering: (1) Fine (N=22); (2) Civics class (N=25); (3) Community Service (N=53). What is the mode?
- the mode is the most frequently occurring category, level or value of a variable.
- in this case, the community service category has the largest number of cases.
- therefore, the mode is "community service"
  • 11.3: As noted, the mode is "community service" (not 53). The number of cases in the modal category as a fraction of the total number of cases provides us with a sense of the variation in the data. Your book (p. 88) refers to this concept as "the proportion in the modal category."
  • 11.4: Calculation:
- Proportion in modal category = Number of cases in modal category / Total number of cases
- In our example, we have 53 cases in the modal category
- We have 100 total cases
- The proportion of cases in the modal category is 53/100 = 0.53

Topic 12: Percentage in Modal Category

  • We might choose to express the proportion in the modal categoary as a percentage instead (pp. 89-90). To do this, we would simply multiply 0.53 by 100 which would lead us to the conclusion that 53% of the cases were in the modal category.

Topic 13: Variation Ratio

  • 13.1: The proportion/percentage of cases in the modal category gives us a sense of how common it is for our study cases to be in the most commonly occurring category. The other side of this coin -- the variation ratio -- is the proportion of cases that are not in the modal category.
  • 13.2: So, we can obtain the variation ratio by subtracting the proportion of cases in the modal category from 1.
  • 13.3: For our littering sanctions example, we get the variation ratio by calculating 1-0.53 = 0.47.
  • 13.4: Another example:

  • 13.5: An issue with the variation ratio is that the upper bound will in general be less than 1 (p. 91).
  • Note: I have decided that I'm not going to test you on the bounds on the variation ratio. This means you are not responsible for the material after the first sentence on page 91. I just want you to be aware of point 13.5 so you know that the variation ratio is sensitive to the unique features of the data on which it is based.

Topic 14: Index of Qualitative Variation

  • 14.1: Index of qualitative variation (IQV) describes variation for a nominal or ordinal (categorical) variable on a 0 to 100 scale (pp. 92-94).
  • 14.2: The basic question this measure answers is how much variation exists in our data compared to the maximum dispersion that could have existed (if cases were equally spread out among the categories).
  • 14.3: The formula in the textbook is too complicated; here is another approach that is easier to calculate.
  • 14.4: We will work the example in the textbook.
Fear of Crime Among Students:

- 1: Not Concerned at All: N=3
- 2: A Little Concerned: N=4
- 3: Quite Concerned: N=6
- 4: Very Concerned: N=7
- Total: N=20

- To calculate the IQV, we first turn the N's in this table into p's:

- p1 = 3/20 = 0.15 
- p2 = 4/20 = 0.20
- p3 = 6/20 = 0.30
- p4 = 7/20 = 0.35

- Next, we square each of the p's:

- p1sq = 0.15*0.15 = 0.0225
- p2sq = 0.20*0.20 = 0.0400
- p3sq = 0.30*0.30 = 0.0900
- p4sq = 0.35*0.35 = 0.1225

- Then, we add up the squares:

p1sq+p2sq+p3sq+p4sq = 0.0225+0.0400+0.0900+0.1225 = 0.275

- Then we subtract this sum from 1.0: 1.0-0.275 = 0.725 (this is called the diversity index, D)
- So, D = 1 minus the sum of the squared p's
- Next, we let k = the number of categories -- in this case the number of categories is 4
- Index of qualitative variation = k/(k-1) x D = 4/(4-1) x 0.725 = 0.9666667
- (multiplied by 100 is 96.66667 which matches the answer in the book).
- In arithmetic symbols, here are the 2 equations:

Note: I simplified the summation notation in the equation for the diversity index, D, above.

Topic 15: Range

  • 15.1: A basic measure of dispersion for interval and ratio level data is the range.
  • 15.2: Here is the formula:

  • 15.3: And, here is a worked example:

Lesson 8 - Thursday 2/20/25

  • Today, we continue working on Chapter 5 (Dispersion).
  • We ended our last class by discussing the range which only uses limited information in the data.
  • The measures we will discuss today use information from each observation in the dataset.
  • A preliminary concept we need to discuss is the idea of mean difference scores.
  • This will be fundamental information for Topics 16, 17, and 19.
  • We consider an example:

Note: a similar example is discussed in your textbook on pp. 95-96.

Topic 16: Variance

  • 16.1: The variance and the standard deviation (topic 17) overcome the problem of deviations summing to zero discussed above.
  • 16.2: Both of these measures consider the variation of individual data points about the mean or average value of the data.
  • 16.3: Here is the formula for the variance (equation 5.4 on page 96):

  • 16.4: Here is a worked example (similar to the worked example on pages 97-99):

Note: in the special case where the variance is 0, that means there is no variation in the variable.

Topic 17: Standard Deviation

  • 17.1: The standard deviation (mentioned above) is simply the square root of the variance (discussed in the book on pages 99 and 102-103).
  • 17.2: Main benefit of the standard deviation is that it expresses variation in terms of the original scale of the variable.
  • 17.3: Remember we squared the differences between the individual scores and the mean; the standard deviation helps us get back to a more understandable metric.
  • 17.4: As with the variance -- if the standard deviation of a variable is 0, then there is no variation in the variable (i.e., it is a constant).
  • 17.5: For our worked example above, we calculate the standard deviation by taking the square root of the variance that we already estimated:
sqrt(3.142857143) = 1.77281052, which rounds to 1.773.

Note : there is a section on pp. 100-101 that provides a shortcut formula for the variance and standard deviation. I am not going to go over the shortcut formula in this class but you are welcome to use it if you want to.

Topic 18: Coefficient of Relative Variation

  • 18.1: When two distributions have similar means, the standard deviations of those distributions can be usefully compared (pp. 104-105).
  • 18.2: When two distributions have different means, it is customary to divide the standard deviation by the mean in each group.
  • 18.3: This adjustment is called the coefficient of relative variation and facilitates the comparison of variation across groups when the means of the two groups are different from each other.
  • 18.4: Example: Suppose we have 2 datasets measuring the number of prior convictions for violent and property offenders entering prison this month. The violent offenders had a mean of 5 prior convictions (with a standard deviation of 1) and the property offenders had a mean of 12 prior convictions with a standard deviation of 3. So, the standard deviation for the property offenders is three times the standard deviation of the violent offenders. How does consideration of the mean differences between the two groups, affect our interpretation of the dispersion in the two groups? If we compute the coefficient of relative variation for each group, we get:
CRV for violent offenders: s/mean = 1/5 = 0.3
CRV for property offenders: s/mean = 3/12 = 0.25

Conclusion: After adjusting the between-group variation for the mean, the variation is greater for the violent offenders than it is for the property offenders.

Topic 19: Mean Absolute Deviation

  • 19.1: Earlier we considered mean difference scores for individual observations.
  • 19.2: We noted that when these scores are computed that they add up to zero.
  • 19.3: We also noted that the average value of the scores would be equal to zero.
  • 19.4: On pp. 105-107, your textbook discusses a concept called "mean deviation."
  • 19.5: The idea of mean deviation is closely connected to the mean difference scores discussed earlier.
  • 19.6: Nevertheless, there are some important distinctions which we will now discuss.
  • 19.7: As we have seen, if we sum the mean differences the total is 0.
  • 19.8: The variance and standard deviation address this problem by squaring the differences.
  • 19.9: The mean deviation (probably more accurately referred to as the mean absolute deviation) overcomes this issue as well but does so by evaluating the absolute values of the difference scores:

Chapter 6 and Topic 20: Samples and Populations

  • 20.1: Scientific knowledge is often based on a sample rather than a full population or universe of cases (p. 116). Your book describes this as a dilemma.
  • 20.2: We study a sample and then use the sample information to develop an inference about what is occurring in a scientifically interesting population.
  • 20.3: Example: physicists haven't studied the motion of every rock that orbits the sun. Yet physicists are confident that they can make predictions about the future pathways of these objects based on what they have learned from the samples of rocks that have been carefully studied. The practice of developing inferences about a population or universe based on what is observed in a sample is called extrapolation.
  • 20.4: Another Example: Take a globe off its stand. Throw it in the air, catch it, and look at whether your right index finger is touching land or water. Then do this another 100 or so times. What you have is a sample of experiments where each experiment has one of two outcomes, land or water. What fraction of the time do you think your finger will be touching water?
  • 20.5: There is no single way to specify a population. What is a reasonable population or universe for one problem may be quite different than the population for a different problem (p. 117).
  • 20.6: Here is a good example of a sample-and-population problem: the National Crime Victimization Survey (link to 2023 report). In this report, the population is described on pages 19 and 35.
  • 20.7: Here is another example of a sample-and-population problem: the Bureau of Justice Statistics Recidivism Studies (here is a link to the 1994 BJS State Prisoner Recidivism Study); population and sample selection methodology are described on pp. 10-11 of this report.

Practice Exercises for this Week

  • Problems 5.1a, 5.1c, 5.2, 5.3a, 5.3b, 5.4, 5.5, 5.6, 5.7, 5.8, 5.10, and 5.11
  • Excluded problems which you do not need to do: 5.1b, 5.3c, and 5.9 (there will not be any exam questions like these questions).
  • There is no mean (absolute) deviation problem in the book, so I will give you a couple here (I will go over these on Tuesday 2/25/25):
- Mean (Absolute) Deviation Problem #1 - Consider the
following waiting times (in years) on death row for a
sample of 6 executed offenders: 5,7,11,18,9,10.
Calculate the mean (absolute) deviation for these data.

- Mean (Absolute) Deviation Problem #2 - A sample of 9
kids who have all been on probation for 1 year are
studied to measure the number of violations of
supervision conditions during that 1-year time frame.
The data are: 3,4,4,5,1,2,2,0,6. Calculate the mean
(absolute) deviation for this dataset.

Lesson 9 - Tuesday 2/25/25

  • Reminder: first exam is scheduled for Tuesday 3/4/25
  • Mean (Absolute) Deviation Problem #1:

  • Mean (Absolute) Deviation Problem #2:

Topic 20: Samples and Populations (Continued)

  • 20.8: There are many ways that samples can be drawn from a population.
  • 20.9: The sampling methodology is the approach that is used to select a sample from the population.
  • 20.10: Two broad methodologies: (1) random or probability samples; and (2) nonprobability samples.
  • 20.11: Both are used quite often in criminology and criminal justice research.
  • 20.12: Here is an example application:

  • 20.13: A simple random sample is a special type of probability sample; we will devote most of our attention in this course to simple random sampling problems. Here is some more information about simple random samples:

  • 20.14: Simple random samples are a special case of probability samples:

  • 20.15: What does the term probability mean?

  • 20.16: An overview of different types of probability sampling:

  • 20.17: An overview of nonprobability sampling:

Topic 21: Parameters and Statistics

  • 21.1: A parameter is a quantity that is based on an entire population (or universe) of cases (p. 117).
  • 21.2: A statistic is a sample quantity that is meant to be an estimate of a population parameter.
  • 21.3: Let's return to our recidivism example:

  • 21.4: Compare this exercise to the exercise in Chapter 6 (pp. 118-119 and Table 6.2).
  • 21.5: We have a known population parameter, 67.5% and we have the recidivism rates that were calculated in individual samples.
  • 21.6: When we use sample statistics to learn about population parameters, we are engaged in the process of statistical inference (p. 119).

Topic 22: Research Questions

  • 22.1: The first part of the inferential process is deciding on a research question.
  • 22.2: A research question is a question that can be answered by way of the scientific method.
  • 22.3: In other words, if we can use scientific methods to answer a question that question is a research or researchable question.
  • 22.4: A research question involve an effort to either describe or explain some scientifically interesting phenomenon (possibly both).
  • 22.5: Research questions are often expressed in fairly general terms.
  • 22.6: Descriptive research is designed to answer questions like "what's going on?"
  • 22.7: Here is an example of population and sample data on age at the time of release from prison from the North Carolina Department of Corrections (NCDOC) for the year 1978.

  • 22.8: Explanatory research is designed to answer questions like "why do the data look the way they do?"
  • 22.9: Here is an example from the death penalty literature. An interesting question is whether the death penalty reduces murder. Consider the case of South Dakota:

  • Now, let's consider what happens when we look at North Dakota:

Practice Questions for this Week

  • Problems 6.1-6.5 at the back of Chapter 6.
  • Suppose I tell you that a police chief decides to open a neighborhood police station. Then, we find that the average monthly number of robberies in the year before the police station opened was 12.2; in the year after it opened the average monthly number of robberies was 8.7. Based on this information, what conclusions can be drawn about the causal effect of the police station on the monthly robbery rates? Explain your reasoning.
  • If I go to a school and pass out a survey to the students who are present, do I have a probability or a nonprobability sample? Explain.
  • Considering the 2 charts of North Carolina prison releasees above, I tell you that the mean of the first chart is 30 and the mean of the second chart is 29. What could explain the difference between the 2 means?

Lesson 10 - Thursday 2/27/25

Announcements

  • Reminder: Exam 1 is scheduled for Tuesday 3/4/25.
  • Please review all guidelines and rules for the exam stated in the syllabus above.
  • You should bring a calculator that can take square roots (no phones, computers, or other technology are permitted).
  • The exam will include problems and questions like the ones we have been working on each week since the beginning of the semester (please see homework-related slides posted in the ELMS announcements each week).
  • I am providing a draft formula sheet for exam 1 (link). The formula sheet will be handed out with the exam.

Topic 22: Research Questions (Continued)

  • 22.10: Recall from last time that research questions are usually expressed in general terms like, "what is the age distribution of offenders?" or "does the death penalty reduce murder?".
  • 22.11: In the context of a specific study, we need to become more specific about the precise empirical issue our study is addressing; this effort to become more specific often takes the form of a hypothesis.
  • 22.12: There are several different types of hypotheses: research hypotheses, directional/nondirectional hypotheses, and null hypotheses.

Topic 23: Research Hypothesis

  • 23.1: A research hypothesis is a conjecture or idea that can be empirically tested within the context of a specific study.
  • 23.2: We usually base a research hypothesis on theory or prior research (i.e., it is an expectation derived from information that is available before our study is conducted).
  • 23.3: As your book notes on page 120, we do not assume that a research hypothesis is accurate or correct. Quite to the contrary, we are conducting the study to see whether the research hypothesis is consistent with the data we observe.
  • 23.4: An example -- Deterrence theory predicts that offending rates will drop when sanctions become more severe. This might lead us to contemplate a fairly general research question asking whether the death penalty reduces murder (thereby increasing public safety). Then, considering our South Dakota example from 22.9 (above), we might propose the following (more specific) research hypothesis -- If deterrence theory is correct, we expect that reinstatement of the death penalty in South Dakota would be followed by a reduction in homicides.
  • 23.5: What I described in 23.4 is a progression from a more general way of thinking (articulating a research question) to a more specific way of thinking (specifying an empirically testable research hypothesis); this is an example of deductive reasoning.

Topic 24: Directional and Non-Directional Hypotheses

  • 24.1: Sometimes, our research hypotheses specify a particular direction. The example in 23.4 (above) specifies a direction of the effect or the change we expect to see based on theory or prior research. So, 23.4 specifies a directional hypothesis.
  • 24.2: Other times, however, our expectations are not as well defined, leading to a nondirectional hypothesis.
  • 24.3: Example of a nondirectional hypothesis: based on theory and prior research, we might question whether there is some kind of important causal connection between unemployment and crime. However, some theories predict that unemployment induces strain which increases offending while other theories predict that unemployment increases guardianship which decreases offending. Since we have different viable perspectives making different predictions we might specify a nondirectional hypothesis that expects there will be a change in offending when unemployment rates change.
  • 24.4: Notice that 23.4 uses the word reduction while 24.3 uses the directionally neutral term, change.
  • 24.5: When we specify a directional hypothesis we are saying we do not expect to see an outcome in the opposite direction. Using our death penalty example, we are saying that we do not believe there is a reasonable possibility that reinstatement of the death penalty could increase homicide rates (p. 120 discusses this issue).

Topic 25: Null Hypothesis

  • 25.1: Notice that the terms reduction (23.4 above) and change (24.4 above) are inexact and vaguely stated.
  • 25.2: When researchers are doing empirical work, they will often investigate and test a more precisely stated null hypothesis.
  • 25.3: Considering our death penalty example, a testable null hypothesis states that the difference between homicide rates before and after reinstatement of the death penalty will be equal to zero.
  • 25.4: Considering our unemployment and crime example, a testable null hypothesis states that there will be zero change in crime rates when unemployment rates change.
  • 25.5: Once we have stated our null hypothesis, we are faced with the task of deciding what evidence would convince us that our precisely stated null hypothesis is wrong (i.e., that it should be rejected). We will turn to this issue next week.

This is the end of the material covered on Exam 1.

  • Question from a student: I'm trying to prepare myself for this upcoming exam and was wondering if you had a practice test or an old exam you could share with me just so I have a better idea what to expect and to use it as more practice? Please let me know, thank you!

Answer: I've addressed this issue in class but in case some of you weren't there when I did, here is my response: I don't circulate old exams. This week's exam and all exams in this class are closely connected to the homework problems you've been working on each week. We have been going over these problems and questions each week during discussion sections and the slides showing the solutions to those problems (covered in the discussion sections) are posted on ELMS. Please review them if you have not done so. Finally, before each exam (including this one), I have been and will be taking time to review all of the material covered on the exam. There are many resources available to help you do well on this exam and I wish you all the best as you prepare. Good luck!

Lesson 11 - Thursday 3/6/25

  • Reminder: your second assignment will be distributed next Thursday 3/13/25. It will be due at 11:59pm on Wednesday 3/26/25.
  • This week's practice questions: you should complete the remaining questions for Chapter 6 (6.6-6.11); we will go over these problems in discussion sections this week.

Topic 25 (Continued): Null Hypothesis (pp. 121-123)

  • 25.6: Last week, we considered the concept of the null hypothesis -- the hypothesis that is empirically tested.
  • 25.7: The abbreviation or symbol we usually use for the null hypothesis is Ho.
  • 25.8: A hypothesis test leads to a decision of whether to reject Ho.
  • 25.9: Since Ho is usually worded in a precise way (i.e., some population parameter value is exactly equal to 0 is an example), we either reject Ho or fail to reject Ho.
  • 25.10: By the way, if we fail to reject Ho, that doesn't mean that Ho is accurate; it simply means the evidence isn't strong enough to reject it.
  • 25.11: For this reason, we usually prefer to say "fail to reject Ho" instead of saying that we "accept Ho."
  • 25.12: This seems like semantics but it is a question of emphasis; by using this phrasing we are emphasizing our ignorancex. If the evidence is insufficient to reject Ho, we can't leap from that to the conclusion that Ho is correct.
  • 25.13: Let's use a criminal trial as an example.
  • 25.14: Suppose Ho embodies the common presumption that someone is innocent until they are demonstrated to be guilty.
  • 25.15: Generally, in a criminal trial, we would require very strong ("beyond a reasonable doubt") evidence of guilt to rebut the presumption of innocence.
  • 25.16: Suppose, after hearing the evidence we are fairly certain the defendant is guilty but not certain enough to meet the "beyond a reasonable doubt" threshold.
  • 25.17: Based on this evidence, an ethical jury would still vote to acquit (assuming all jurors viewed the evidence the same way).
  • 25.18: But such a decision does not mean the defendant is innocent (Ho); it just means the evidence wasn't strong enough to convict (i.e., reject Ho).

Topic 26: Errors in Hypothesis Testing (pp. 123-125)

  • 26.1: When we make a decision to either reject Ho or fail to reject Ho, we recognize that we might be making a mistake.
  • 26.2: Rejecting Ho when Ho is false is a correct decision (i.e., convicting a guilty person).
  • 26.3: Failing to reject Ho when Ho is true is also a correct decision (i.e., acquitting an innocent person).
  • 26.4: If we reject Ho, when Ho is true, we make a Type I error (i.e., convicting someone when they are really innocent).
  • 26.5: If we fail to reject Ho, when Ho is false, we make a Type II error (i.e., acquitting someone when they are really guilty).

Topic 27: Error Risk and Statistical Levels of Significance (pp. 125-128)

  • 27.1: It is a convention in the social sciences to say that Ho embodies simplicity.
  • 27.2: Simplicity and parsimony are high priorities in science (Einstein saying: "things should be as simple as possible -- but no simpler.").
  • 27.3: In order to abandon simplicity for something more complicated, we require strong evidence -- because we value simplicity.
  • 27.4: This demand for strong evidence to reject Ho is meant to reduce the chance of making a Type I error (rejecting Ho when Ho is true).
  • 27.5: In other words, to reject Ho, the chance of making a Type I error (reject Ho when Ho is true) must be small (because of the value we place on simplicity).
  • 27.6: A significance level corresponds to the chance or probability of a Type I error a researcher is willing to tolerate.
  • 27.7: The observed significance level corresponds to the chance of a Type I error occurring if we reject Ho in a specific sample dataset.
  • 27.8: In this class, we will primarily focus on the concept of rejecting or failing to reject a population Ho based on the evidence that we observe in a particular sample.
  • 27.9: The risk of Type II errors is generally managed through study planning efforts (i.e., deciding how many cases will be studied, for example).

New R Code

  • Suppose I give you the following dataset comprised of the last 300 jail sentences (in months) handed down by the county court.
  • We denote the jail/prison sentence lengths by the variable, x.
x = c(rep(1,5),rep(2,29),rep(3,51),rep(4,61),
  rep(5,61),rep(6,44),rep(7,22),rep(8,17),rep(9,5),rep(10,4),11)
table(x)
  • This gives us the following output:
> x = c(rep(1,5),rep(2,29),rep(3,51),rep(4,61),
+   rep(5,61),rep(6,44),rep(7,22),rep(8,17),rep(9,5),rep(10,4),11)
> table(x)
x
 1  2  3  4  5  6  7  8  9 10 11 
 5 29 51 61 61 44 22 17  5  4  1 
>
  • Next, we create a barplot giving us a visual representation of this distribution:
barplot(table(x),
  main="Distribution of Jail Sentences",
  ylab="Number of People",
  xlab="Number of Months Sentenced to Jail")
  • Here is the output:

  • Calculate the mean, median, minimum, maximum, and range of this distribution.
mean(x)
median(x)
min(x)
max(x)
max(x)-min(x)
  • Here are the results:
> mean(x)
[1] 4.716667
> median(x)
[1] 5
> min(x)
[1] 1
> max(x)
[1] 11
> max(x)-min(x)
[1] 10
>
  • Now, let's look at a similar dataset from the next-door neighbor county. We will call the sentence length variable in this dataset, y
y = c(1,rep(2,7),rep(3,28),rep(4,45),rep(5,55),rep(6,57),
  rep(7,39),rep(8,28),rep(9,22),rep(10,12),rep(11,3),rep(12,3))
table(y)

and here is the output:

> y = c(1,rep(2,7),rep(3,28),rep(4,45),rep(5,55),rep(6,57),
+   rep(7,39),rep(8,28),rep(9,22),rep(10,12),rep(11,3),rep(12,3))
> table(y)
y
 1  2  3  4  5  6  7  8  9 10 11 12 
 1  7 28 45 55 57 39 28 22 12  3  3 
>
  • As before, we can calculate some descriptive information about this variable:
mean(y)
median(y)
min(y)
max(y)
max(y)-min(y)

which gives us the following output:

 mean(y)
[1] 5.933333
> median(y)
[1] 6
> min(y)
[1] 1
> max(y)
[1] 12
> max(y)-min(y)
[1] 11
>
  • How do these numbers compare to those for the first county?
  • Let's create a side-by-side barchart for the 2 distributions:
par(mfrow=c(1,2))
barplot(table(x),
  main="Distribution of Jail Sentences in County #1",
  ylab="Number of People",
  xlab="Number of Months Sentenced to Jail")
barplot(table(y),
  main="Distribution of Jail Sentences in County #2",
  ylab="Number of People",
  xlab="Number of Months Sentenced to Jail")
  • Barcharts are discussed in your textbook beginning on page 40.
  • Be sure to save this chart so you can look at it again.
  • Here are the 2 charts together:

  • One thing that becomes apparent is that comparing two barcharts side-by-side is not the easiest task in the world. Is there something better?
  • Answer: Yes!

Lesson 12 - Tuesday 3/11/25

  • Reminder: your second assignment will be distributed on Thursday -- 3/13/25. It will be due at 11:59pm on Wednesday 3/26/25.
  • We begin today's class by finishing up our R example from last class.
  • Introducing the boxplot:
x = c(rep(1,5),rep(2,29),rep(3,51),rep(4,61),
  rep(5,61),rep(6,44),rep(7,22),rep(8,17),rep(9,5),rep(10,4),11)
table(x)
y = c(1,rep(2,7),rep(3,28),rep(4,45),rep(5,55),rep(6,57),
  rep(7,39),rep(8,28),rep(9,22),rep(10,12),rep(11,3),rep(12,3))
table(y)

boxplot(x,y,
  main="Distribution of Jail Sentences by County",
  ylab="Jail Sentence Length (in months)",
  names=c("County 1","County 2"))

which gives us the screen output:

> x = c(rep(1,5),rep(2,29),rep(3,51),rep(4,61),
+   rep(5,61),rep(6,44),rep(7,22),rep(8,17),rep(9,5),rep(10,4),11)
> table(x)
x
 1  2  3  4  5  6  7  8  9 10 11 
 5 29 51 61 61 44 22 17  5  4  1 
> y = c(1,rep(2,7),rep(3,28),rep(4,45),rep(5,55),rep(6,57),
+   rep(7,39),rep(8,28),rep(9,22),rep(10,12),rep(11,3),rep(12,3))
> table(y)
y
 1  2  3  4  5  6  7  8  9 10 11 12 
 1  7 28 45 55 57 39 28 22 12  3  3 
>

and the following chart:

  • There is some complexity to these plots but they are useful for comparing distributions side-by-side.
  • One point of ambiguity in the chart is the Interquartile Range (IQR) which I will mention briefly here.
  • The median is the 50th percentile of the distribution.
  • But we also have other percentiles that receive a good deal of attention from statisticians.
  • The first and third quartiles of the distribution are examples (Q1 = 25th percentile and Q3 = 75th percentile).
  • The IQR is simply Q3-Q1 which gives us a measure of dispersion for a distribution.
  • An important use of the IQR is that its length corresponds to the vertical length of the the "box" in the boxplot.
  • The top of the box is Q3 and the bottom of the box is Q1.
  • Note: I am not going to test you on the IQR but you need to know what it is to properly interpret a boxplot.
  • Note that we might also want to see what the difference between the means and the medians of the distributions are.
mean(y)-mean(x)
median(y)-median(x)

which gives us these results:

> mean(y)-mean(x)
[1] 1.216667
> median(y)-median(x)
[1] 1
>
  • Therefore, the boxplot and both measures of central tendency point to the conclusion that the distribution of sentences in the second county is a little higher than in the first county.
  • Let's do another example of comparing the distributions of two groups.
  • Suppose we have two states that share a long border; North State has 22 counties along the border and South State has 17 counties along the border.
  • Suppose further that North State has a new mandatory minimum sentencing law for armed robberies; South State has had a more lenient law on the books for many years.
  • The year after the law took effect in North State, we record the following armed robbery rates in the border counties in each state.
ns = c(30.4,22.7,26.5,26.2,25.9,33.6,22.9,31.9,28.3,24.9,29.7,
        24.1,22.0,23.1,24.0,28.3,27.9,27.8,25.9,25.3,24.0,28.5)
table(ns)
ss = c(38.6,38.8,36.8,44.6,33.2,33.4,30.3,34.2,30.8,36.0,29.0,
        35.6,24.9,28.1,32.4,27.3,34.2)
table(ss)
  • Here are the resulting frequency tables:
> table(ns)
ns
  22 22.7 22.9 23.1   24 24.1 24.9 25.3 25.9 26.2 26.5 27.8 
   1    1    1    1    2    1    1    1    2    1    1    1 
27.9 28.3 28.5 29.7 30.4 31.9 33.6 
   1    2    1    1    1    1    1 
> ss <- c(38.6,38.8,36.8,44.6,33.2,33.4,30.3,34.2,30.8,36.0,29.0,
+         35.6,24.9,28.1,32.4,27.3,34.2)
> table(ss)
ss
24.9 27.3 28.1   29 30.3 30.8 32.4 33.2 33.4 34.2 35.6   36 
   1    1    1    1    1    1    1    1    1    2    1    1 
36.8 38.6 38.8 44.6 
   1    1    1    1 
>
  • Now, suppose we want to look at histograms of the two distributions:
par(mfrow=c(1,2))
hist(ns,main="Armed Robbery Rates in North State Border Counties")
hist(ss,main="Armed Robbery Rates in South State Border Counties")

which gives us this chart:

  • We have the same problem here that we had with the barplots in the last class -- it's not easy to compare the 2 distributions.
  • Here is a boxplot of the same data:
boxplot(ns,ss,
  main="Armed Robbery Rates",
  ylab="# of Armed Robberies per 100,000 Population",
  names=c("North State Counties","South State Counties"),
  ylim=c(0,45))

which gives us this chart:

  • showing that the armed robbery rate seems generally higher in the South State counties and lower in the North State counties (but there are exceptions!).
  • Note we can also calculate the mean and median differences:
mean(ns)-mean(ss)
median(ns)-median(ss)

which gives us the following differences:

> mean(ns)-mean(ss)
[1] -6.88262
> median(ns)-median(ss)
[1] -7.35
>
  • Notice that the average and median differences are both negative which indicates that the North State counties typically had lower armed robbery rates.

Topic 28 & Beginning of Chapter 7 (pp. 136-139).

  • 28.1: Question for this topic: "when should we begin to suspect that a coin used in a coin toss is unfair or biased?"
  • 28.2: Generally, if we are flipping a coin to make a decision, we assume that the coin is fair unless we have a good reason to suspect it's not fair.

Lesson 13 - Thursday 3/13/25

  • Reminder: your second assignment will be distributed today (Thursday 3/13/25). It will be due at 11:59pm on Wednesday 3/26/25.
  • For this week's homework, you should be looking at Chapter 7 homework problems 7.1-7.5; these will be covered in tomorrow's discussion section.

Topic 28; Coin Flipping (Continued; pp. 137-138).

  • 28.3: What would be a "good reason" to reject the null hypothesis that a coin is fair?
  • 28.4: The "good reason" would probably have to be based on evidence.
  • 28.5: For coin flipping, the evidence would probably have to take the form of the coin coming up either heads or tails too often.
  • 28.6: What evidence would convince us that the coin is not fair?
  • 28.7: The outcomes "H" and "T" are the mutually exclusive and exhaustive outcomes in the sample space of a single coin-flipping experiment.
  • 28.8: Suppose that we flip a coin 100 times and we get 97 heads and 3 tails. Would you reject the null hypothesis that the coin is a fair coin?
  • 28.9: Now suppose that we flip a coin 100 times and we get 60 heads. What would you do then?
  • 28.10: Again, we want to know if the evidence is strong enough to reject the null hypothesis that the coin is fair.
  • 28.11: Let's see how this applies to something we really care about.

Example: State Homicide Rates

  • Let's read some data into R:
st = c("alabama","alaska","arizona","arkansas","california",
  "colorado","connecticut","delaware","florida","georgia",
  "hawaii","idaho","illinois","indiana","iowa","kansas",
  "kentucky","louisiana","maine","maryland","massachusetts",
  "michigan","minnesota","mississippi","missouri","montana",
  "nebraska","nevada","new hampshire","new jersey",
  "new mexico","new york","north carolina","north dakota",
  "ohio","oklahoma","oregon","pennsylvania","rhode island",
  "south carolina","south dakota","tennessee","texas",
  "utah","vermont","virginia","washington","west virginia",
  "wisconsin","wyoming")

h18 = c(11.5,7.6,5.7,8.5,4.7,4.7,2.4,5.5,6.2,7.5,
  2.8,2.1,7.6,7.0,2.5,5.3,5.8,12.9,1.5,8.7,2.1,
  6.1,2.2,12.5,11.2,4.0,2.1,7.5,1.8,3.3,9.9,3.0,
  6.0,3.0,6.4,6.6,2.6,6.2,1.7,9.3,2.9,9.2,5.3,2.1,
  2.1,4.8,3.5,5.5,3.4,3.3)

h19 = c(11.8,10.4,5.7,8.7,4.5,4.3,3.1,5.1,6.1,8.2,
  2.6,1.3,7.4,6.5,2.5,4.5,5.7,13.7,1.7,9.6,2.1,6.0,
  2.7,14.1,10.6,3.6,2.9,5.3,2.6,3.0,11.0,3.0,6.6,
  2.9,6.1,8.5,2.9,5.7,2.5,10.0,3.4,9.2,5.8,2.5,1.9,
  5.1,3.1,5.5,3.6,3.6)

hdata = data.frame(st,h18,h19)
hdata
  • Now that we've read this data into R, let's look at the dataset:
> hdata
               st  h18  h19
1         alabama 11.5 11.8
2          alaska  7.6 10.4
3         arizona  5.7  5.7
4        arkansas  8.5  8.7
5      california  4.7  4.5
6        colorado  4.7  4.3
7     connecticut  2.4  3.1
8        delaware  5.5  5.1
9         florida  6.2  6.1
10        georgia  7.5  8.2
11         hawaii  2.8  2.6
12          idaho  2.1  1.3
13       illinois  7.6  7.4
14        indiana  7.0  6.5
15           iowa  2.5  2.5
16         kansas  5.3  4.5
17       kentucky  5.8  5.7
18      louisiana 12.9 13.7
19          maine  1.5  1.7
20       maryland  8.7  9.6
21  massachusetts  2.1  2.1
22       michigan  6.1  6.0
23      minnesota  2.2  2.7
24    mississippi 12.5 14.1
25       missouri 11.2 10.6
26        montana  4.0  3.6
27       nebraska  2.1  2.9
28         nevada  7.5  5.3
29  new hampshire  1.8  2.6
30     new jersey  3.3  3.0
31     new mexico  9.9 11.0
32       new york  3.0  3.0
33 north carolina  6.0  6.6
34   north dakota  3.0  2.9
35           ohio  6.4  6.1
36       oklahoma  6.6  8.5
37         oregon  2.6  2.9
38   pennsylvania  6.2  5.7
39   rhode island  1.7  2.5
40 south carolina  9.3 10.0
41   south dakota  2.9  3.4
42      tennessee  9.2  9.2
43          texas  5.3  5.8
44           utah  2.1  2.5
45        vermont  2.1  1.9
46       virginia  4.8  5.1
47     washington  3.5  3.1
48  west virginia  5.5  5.5
49      wisconsin  3.4  3.6
50        wyoming  3.3  3.6
> 
  • We can calculate the mean, median, and range of the homicide rates for each year, just like we've done before:
mean(h18)
mean(h19)
median(h18)
median(h19)
max(h18)-min(h18)
max(h19)-min(h19)
  • Here is our output:
> mean(h18)
[1] 5.402
> mean(h19)
[1] 5.584
> median(h18)
[1] 5.3
> median(h19)
[1] 5.1
> max(h18)-min(h18)
[1] 11.4
> max(h19)-min(h19)
[1] 12.8
> 
  • As you can see, there is a lot of similarity in the statistics for the 2 years.
  • We can also look at histograms and a boxplot for each year:
par(mfrow=c(1,3))
hist(h18,
  main="2018 State Homicide Rates",
  xlab="# of Homicides per 100k Population",
  ylab="# of States")
hist(h19,
  main="2019 State Homicide Rates",
  xlab="# of Homicides per 100k Population",
  ylab="# of States")
boxplot(h18,h19,
  main="2018 & 2019 Homicide Rates",
  names=c("Year = 2018","Year=2019"),
  ylab="# of Homicides per 100k Population")

  • As interesting as all this is, there is an important feature of this dataset that we need to consider.
  • Each state is measured in 2 different years -- 2018 and 2019.
  • Se we have the opportunity to study the change in homicide rates for each state.
  • If there was no overall change in homicide rates, we could expect that some states would experience an increase and others would drop.
  • In other words, if there was no change, the states that increased would tend to cancel out the states that dropped.
  • So, our null hypothesis is that a state is equally likely to experience an increase or a decrease.
  • Let's consider the evidence. Here is some R code:
delta = h19-h18
hdata = data.frame(st,h18,h19,delta)
hdata
mean(delta)
median(delta)
max(delta)-min(delta)
> delta = h19-h18
> hdata = data.frame(st,h18,h19,delta)
> hdata
               st  h18  h19 delta
1         alabama 11.5 11.8   0.3
2          alaska  7.6 10.4   2.8
3         arizona  5.7  5.7   0.0
4        arkansas  8.5  8.7   0.2
5      california  4.7  4.5  -0.2
6        colorado  4.7  4.3  -0.4
7     connecticut  2.4  3.1   0.7
8        delaware  5.5  5.1  -0.4
9         florida  6.2  6.1  -0.1
10        georgia  7.5  8.2   0.7
11         hawaii  2.8  2.6  -0.2
12          idaho  2.1  1.3  -0.8
13       illinois  7.6  7.4  -0.2
14        indiana  7.0  6.5  -0.5
15           iowa  2.5  2.5   0.0
16         kansas  5.3  4.5  -0.8
17       kentucky  5.8  5.7  -0.1
18      louisiana 12.9 13.7   0.8
19          maine  1.5  1.7   0.2
20       maryland  8.7  9.6   0.9
21  massachusetts  2.1  2.1   0.0
22       michigan  6.1  6.0  -0.1
23      minnesota  2.2  2.7   0.5
24    mississippi 12.5 14.1   1.6
25       missouri 11.2 10.6  -0.6
26        montana  4.0  3.6  -0.4
27       nebraska  2.1  2.9   0.8
28         nevada  7.5  5.3  -2.2
29  new hampshire  1.8  2.6   0.8
30     new jersey  3.3  3.0  -0.3
31     new mexico  9.9 11.0   1.1
32       new york  3.0  3.0   0.0
33 north carolina  6.0  6.6   0.6
34   north dakota  3.0  2.9  -0.1
35           ohio  6.4  6.1  -0.3
36       oklahoma  6.6  8.5   1.9
37         oregon  2.6  2.9   0.3
38   pennsylvania  6.2  5.7  -0.5
39   rhode island  1.7  2.5   0.8
40 south carolina  9.3 10.0   0.7
41   south dakota  2.9  3.4   0.5
42      tennessee  9.2  9.2   0.0
43          texas  5.3  5.8   0.5
44           utah  2.1  2.5   0.4
45        vermont  2.1  1.9  -0.2
46       virginia  4.8  5.1   0.3
47     washington  3.5  3.1  -0.4
48  west virginia  5.5  5.5   0.0
49      wisconsin  3.4  3.6   0.2
50        wyoming  3.3  3.6   0.3
>
> mean(delta)
[1] 0.182
> median(delta)
[1] 0
> max(delta)-min(delta)
[1] 5
> 
  • So, it looks like 20 states dropped while 24 states experienced an increase and 6 states had no change.
  • A problem is that the 6 "no change" states really did change; the zero is just an artifact of rounding.
  • I can report to you that I've looked at the actual calculations out to many decimal places; I found that 4 of them increased and 2 dropped.
  • The final result is that 28 states increased (56%) and 22 states dropped (44%).
  • If you flipped a coin 50 times and you got 28 heads and 22 tails, would you conclude that the coin is unfair?
  • Here, we are asking whether the evidence is strong enough to reject the hypothesis that a state was equally likely to experience an increase or a decrease in its homicide rate.
  • The point here is to see that real criminology problems can be approximated by a coin flipping experiment.
  • Make sure you close your previous chart window. Now, let's look at a new boxplot showing the distribution of change scores:
boxplot(delta,
  main="Change in Homicide Rates from 2018 to 2019",
  ylab="2019 Rate - 2018 Rate")

which gives

  • Many people would look at this evidence and say there is not strong evidence to reject the null hypothesis but based on the tools we have so far, this would be a subjective judgment.

Topic 29: Sampling Distributions (pp. 137-138)

  • 29.1: Let's think some more about our example of drawing a small random sample of people from the entire population of people who were released from prison in the year 2021.
  • 29.2: Now, let's suppose we follow that sample up for 3 years to the year 2024.
  • 29.3: For each person in the sample, we look at that person's criminal history record and make a determination of whether that person has been arrested for a new crime or not during the 3 year follow-up period (kind of like a heads or a tails!).
  • 29.4: Then, we divide the number of people who were rearrested by the total number of people in the sample to obtain a recidivism rate (note: we might decide to multiply the fraction by 100 to convert it to a percentage).
  • 29.5: We recognize that if we had drawn a different sample than the one we happened to draw, we probably would have gotten a different numerical result (unless the second sample we drew happened by chance to have exactly the same number of successes and failures).
  • 29.6: Suppose, based on prior research, we specify that the null hypothesis rearrest recidivism rate is 68%.
  • 29.7: Even if the null hypothesis is correct, we often will not see a recidivism rate of exactly 68% in the samples we draw.
  • 29.8: Instead the recidivism rate would bounce around from sample to sample.

  • 29.9: Note that in one of the samples of size 50, we happened to see a recidivism rate that is close to 80%.
  • 29.10: We need some principled, transparent, and objective way of deciding whether to reject the null hypothesis when we are presented with actual empirical evidence from a sample.
  • 29.11: As explained in your book on p. 139, we use probability distributions to do this.

Topic 30: Probability Distributions (p. 139)

  • 30.1: A probability distribution can apply to the results in a specific sample or it can be used to approximate a sampling distribution.
  • 30.2: Each person in the unusual sample (S2) of 50 cases (where the recidivism rate is 78%) can be viewed as an experimental study with two outcomes: rearrest or no rearrest.
  • 30.3: Thus, the outcomes (or sample space) of each experiment is: yes, (recidivism); or no (desistance). Note that these outcomes are mutually exclusive and exhaustive.
  • 30.4: When we look at the sample as a whole, we find that 39 of the people in the sample recidivated while 11 did not.
  • 30.5: So, within the sample, S2, we see that p(recidivism) = 39/50 = 0.78 and p(desistance) = 11/50 = 1-p(recidivism) = 0.22; this is the probability distribution of outcomes in our sample.

Assignment #2 - Due on ELMS at 11:59pm ET on Wednesday 3/26/25

Instructions: Please complete each of the problems listed below. You are required to submit your assignment as a single pdf file on ELMS. Please review all assignment guidelines and rules in the syllabus above. We will accept questions about the assignment up until 11:59am ET on Monday 3/24/25. Please note that any questions we judge to be of interest to the entire class will be posted on this webpage so everyone has access to the same information in a timely fashion. If you submit your assignment and then discover a mistake and you want to fix it before it is due you can resubmit your assignment on ELMS (up to 3 submissions are permitted); if you do this, only the last submission will be graded. Assignments can be submitted beginning at 12:01am (ET) on Tuesday 3/25/25.

  1. For each month of last year, the local police department made the following numbers of drug arrests:
Year N =
Jan 19
Feb 19
Mar 23
Apr 21
May 17
Jun 18
Jul 22
Aug 28
Sep 21
Oct 20
Nov 23
Dec 22
  • 1a. enter the data into R and create a frequency table (5pts).
  • 1b. create a reasonably labeled barplot for these data (5pts).
  • 1c. use R's mean() and median() and min()/max() functions to calculate the mean, median, and range of the number of drug arrests (5pts).
  1. The following dataset provides state names along with each state's homicide rate in the years 2018 and 2022 based on data from the Centers for Disease Control.
st = c("alabama","alaska","arizona","arkansas","california",
  "colorado","connecticut","delaware","florida","georgia",
  "hawaii","idaho","illinois","indiana","iowa","kansas",
  "kentucky","louisiana","maine","maryland","massachusetts",
  "michigan","minnesota","mississippi","missouri","montana",
  "nebraska","nevada","new hampshire","new jersey",
  "new mexico","new york","north carolina","north dakota",
  "ohio","oklahoma","oregon","pennsylvania","rhode island",
  "south carolina","south dakota","tennessee","texas",
  "utah","vermont","virginia","washington","west virginia",
  "wisconsin","wyoming")

h18 = c(11.5,7.6,5.7,8.5,4.7,4.7,2.4,5.5,6.2,7.5,
  2.8,2.1,7.6,7.0,2.5,5.3,5.8,12.9,1.5,8.7,2.1,
  6.1,2.2,12.5,11.2,4.0,2.1,7.5,1.8,3.3,9.9,3.0,
  6.0,3.0,6.4,6.6,2.6,6.2,1.7,9.3,2.9,9.2,5.3,2.1,
  2.1,4.8,3.5,5.5,3.4,3.3)

h22 = c(13.8,10.1,8.5,10.6,5.7,7.1,4.0,5.7,6.6,11.2,
  2.7,2.5,10.0,8.1,2.8,5.9,7.7,18.5,2.3,9.6,2.4,7.6,
  3.6,18.7,12.0,5.2,3.4,7.7,1.8,3.3,13.7,4.2,8.8,3.6,
  7.7,7.9,5.1,8.1,2.3,10.9,6.0,10.8,7.6,2.2,3.7,7.2,
  5.3,6.3,5.6,2.6)

hdata = data.frame(st,h18,h22)
hdata

Now, with this dataset in hand, you should work on each of the following tasks (except 2p and 2q) using R:

  • 2a. Print out a summary table showing each state and its homicide rates for the 2 years (5pts).
  • 2b. Use the mean() function to calculate the average homicide rate for the year 2018 (5pts).
  • 2c. Use the mean() function to calculate the average homicide rate for the year 2022 (5pts).
  • 2d. Calculate the difference between the means for the 2 years (5pts).
  • 2e. Use the median() function to calculate the median homicide rate for the year 2018 (5pts).
  • 2f. Use the median() function to calculate the median homicide rate for the year 2022 (5pts).
  • 2g. Calculate the difference between the medians for the 2 years (5pts).
  • 2h. Present a reasonably labeled histogram for each year (5pts).
  • 2i. Create a reasonably labeled boxplot showing the homicide rate distribution for each year (5pts).
  • 2j. On your boxplot, clearly identify the median and the 1st and 3rd quartiles of the distribution for each year (you can circle and label them on your boxplot); (5pts).
  • 2k. Calculate delta values for each state (where delta is the homicide rate for 2022 minus the homicide rate for 2018, calculated for each state); print out the summary table you did in #1 above except now include the delta values in the table (i.e., there should be a delta value on each row of the table; 5pts).
  • 2l. Create a reasonably labeled boxplot to summarize the distribution of the delta values (5pts).
  • 2m. Calculate the average of the delta values; what does this average tell us (5pts)?
  • 2n. Calculate the median of the delta values; what does this median tell us (5pts)?
  • 2o. Use the min() and max() functions to calculate the range of the delta values (5pts).
  • 2p. There are two values of delta that are very close to zero (so their delta value looks like it is zero). It turns out that these 2 delta values belong to New Hampshire and New Jersey. New Jersey's delta value actually showed a slight increase in the homicide rate from 2018 to 2022 while New Hampshire's showed a slight decrease. Keeping this in mind, count the number of states where delta increased and the number of states where delta decreased (5pts)
  • 2q. If our null hypothesis specifies that the chance a state experiences an increase in its homicide rate from 2018 to 2022 is 0.5, would you say the evidence to reject the null hypothesis is stronger or weaker than the evidence we discussed in class for the 2018 to 2019 comparison? Explain your reasoning (5pts).

Questions from Students About Assignment #2

  1. Question asked on 3/14/25: "In class when we discussed whether the evidence was strong enough to reject the null in the 50-states example, I was unclear on the conclusion we came to. Was the evidence strong enough to reject the null? Or is that judgement subjective to the researcher? If the evidence was strong enough, what makes it so?" My response: I would suggest looking at the last bullet point before Topic 29.

  2. Question asked on 3/14/25: "How do I calculate the average, median and ranges for the delta values; I went back to view course material and it wasn't there." My response: I'm sorry about that oversight. The code is there now.

  3. Question asked on 3/17/25: "Will we have to include the coding for the charts (histograms, barcharts, box plots) when submitting the assignment, or just the graphics themselves? Also, I was struggling a bit with the coding for the boxplots ( par(mfrow=c(1,3))) when I would put it into R. When I use that code, the histograms would appear, but not the boxplots. Upon doing some research, the code (par(mfrow=c(2,2))) worked for me, but the formatting of graphics changed a little bit. Let me know if that's okay, or if I should stick to the original coding used in class." My response: Yes, you need to show the R code you used to answer assignment questions; there is no requirement that you use the par(mfrow=c(x,x)) code for the assignment. You are free to simply generate the individual charts. Of course, you are welcome to create the side-by-side charts if you want to do that but I will not deduct any points if you just create the individual charts.

  4. Question asked on 3/20/25: "The first question I have [about the first question on the assignment] is that I can't find in my notes or on the website which table the frequency table is referring too - is that the table where we use the x = c(rep(a,b)) code and then print out table(x)? I'm just confused because I don't see the word frequency in the notes so I'm having trouble pinpointing which kind of table we are referring to. My response: I put the term "frequency table" in the appropriate place in the notes so you can clearly see the relevant example.

  5. Question asked on 3/20/25: "If I denote the months as their numbers (for example, January is 1, February is 2) and [enter the data as x=c(rep(a,b))] and put the month number in the "a" spot in the code and put the number of drug arrests in the "b" spot in the code, then when I print min(x), max(x), and median(x), they are all according to the months and not the number of drug arrests. I'm wondering if I am just using the wrong code for the tables or doing something wrong within the code?" My response: I am sorry for this ambiguity in the assignment. You can resolve this issue by letting x be the number of drug arrests for each month:

x = c(19,19,23,21,17,18,22,28,21,20,23,22)

Lesson 14 - Tuesday 3/25/25

  • Reminder: your second assignment is due at 11:59pm on Wednesday 3/26/25 (submit on ELMS).
  • We are currently in Chapter 7.

Topic 31: Multiplication Rule

  • 31.1: Let's say we have an experiment where we flip a coin and the coin lands heads or tails with equal probability (1/2).
  • 31.2: Now, we conduct this same experiment a second time. Again, the coin lands heads or tails with equal probability (1/2).
  • 31.3: If the outcomes of these 2 experiments are independent then we can calculate the chance of getting HH when we flip the coin twice, p(H and H), as 1/2 × 1/2 = 1/4.
  • 31.4: Two outcomes are independent if the outcome of the first experiment has nothing to do with the outcome of the second experiment.
  • 31.5: Multiplication rule with independent experimental outcomes is: p(A and B) = p(A) × p(B)
  • 31.6: This can be extended to: p(A and B and C) = p(A) × p(B) × p(C), etc.
  • 31.7: Flip a coin 3 times. What is the chance of 3 heads? Solution: p(H and H and H) = p(H) × p(H) × p(H) = 1/2 × 1/2 × 1/2 = 1/8.

Topic 32: Ordering and Arrangement

  • 32.1: Suppose we study a group of people on community supervision who regularly get drug tested. Historically, the probability of a positive drug test result is 1/2. Now, let's consider the next 3 people. What is the probability of getting 3 positive results?

Step 1: what is the number of ways (arrangements) we could get 3 positive results in 3 tests?

Answer: there is only 1 way to get 3 positive results in 3 tests -- each test would have to be positive.

Step 2: use the multiplication rule for independent events:

p(+ and + and +) = p(+) × p(+) × p(+) = 1/2 × 1/2 × 1/2 = 1/8
  • 32.2: Suppose we look at the next 4 people and ask about the probability of getting 2 positive results?

Step 1: what is the number of ways (arrangements) we could get 2 positive results in 4 tests?

4C2 = 4!/(2!2!) = 24/(2×2) = 24/4 = 6

Note: use Appendix 1 on page 651 to calculate factorials.

Step 2: use the multiplication rule for independent events:

p(--++) = p(-) × p(-) × p(+) × p(+) = 1/2 × 1/2 × 1/2 × 1/2 = 1/16
p(-+-+) = p(-) × p(+) × p(-) × p(+) = 1/2 × 1/2 × 1/2 × 1/2 = 1/16
p(+--+) = p(+) × p(-) × p(-) × p(+) = 1/2 × 1/2 × 1/2 × 1/2 = 1/16
p(+-+-) = p(+) × p(-) × p(+) × p(-) = 1/2 × 1/2 × 1/2 × 1/2 = 1/16
p(-++-) = p(-) × p(+) × p(+) × p(-) = 1/2 × 1/2 × 1/2 × 1/2 = 1/16
p(++--) = p(+) × p(+) × p(-) × p(-) = 1/2 × 1/2 × 1/2 × 1/2 = 1/16

Step 3: use the addition rule to add up the probabilities for the different ways we could get 2 positive results in 4 tests:

p(2+ out of 4 tests) = 1/16 + 1/16 + 1/16 + 1/16 + 1/16 + 1/16 = 6/16 = 3/8
  • 32.3: Now, suppose we go to another jurisdiction where the probability of getting a positive test result is 1/3. What is the probability that there will be 2 positive results in 4 tests in the new jurisdiction?

Step 1: we still have 6 ways of getting 2 positive results in 4 tests (combination formula).

Step 2: use the multiplication rule for independent events:

p(--++) = p(-) × p(-) × p(+) × p(+) = 2/3 × 2/3 × 1/3 × 1/3 = 4/81
p(-+-+) = p(-) × p(+) × p(-) × p(+) = 2/3 × 1/3 × 2/3 × 1/3 = 4/81
p(+--+) = p(+) × p(-) × p(-) × p(+) = 1/3 × 2/3 × 2/3 × 1/3 = 4/81
p(+-+-) = p(+) × p(-) × p(+) × p(-) = 1/3 × 2/3 × 1/3 × 2/3 = 4/81
p(-++-) = p(-) × p(+) × p(+) × p(-) = 2/3 × 1/3 × 1/3 × 2/3 = 4/81
p(++--) = p(+) × p(+) × p(-) × p(-) = 1/3 × 1/3 × 2/3 × 2/3 = 4/81

Step 3: use the addition rule to add up the probabilities for the different ways we could get 2 positive results in 4 tests:

p(2+ out of 4 tests) = 4/81 + 4/81 + 4/81 + 4/81 + 4/81 + 4/81 = 24/81 = 8/27

Topic 33: Permutations and Combinations

  • 33.1: A permutation is an arrangement of experimental outcomes in a specific order. Considering the example of 32.3, the outcome --++ is not equivalent to the outcome ++--.
  • 33.2: Combinations are indifferent to ordering. Again, considering the example of 32.3, each of the experimental outcomes listed in Step 2 is an equivalent qualifying event. We are indifferent to the ordering of the outcomes.
  • 33.3: In applied criminological research, we are generally interested in the number of ways we could get a set of results which implies an emphasis on combinations rather than permutations.
  • 33.4: To count the number of different ways you could get r "events" in N "experiments" or "trials", we generally use the combination formula.

Topic 34: The Binomial Distribution (pp. 145-149)

  • 34.1: We have already been using the binomial probability distribution above but we haven't made it explicit. We will do so now.
  • 34.2: The binomial probability distribution is well suited for problems where there is a certain number of experiments or trials and in each trial there is an "success" or a "failure".
  • 34.3: These trials have a name -- they are called Bernoulli trials. So a single Bernoulli trial is like a coin flip where the coin can land on "heads" or "tails".
  • 34.4: In criminology, a single Bernoulli trial might be a person who is released from prison; then we observe whether that person is rearrested within some well-defined period of time (a failure) or not (a success).
  • 34.5: If we then study a group of people released from prison, we can view the number of people in the study as the number of Bernoulli trials and the number of people who fail as the binomial outcome.
  • 34.6: Here is another example: we survey a random sample of N people and ask each of them whether they have been victimized or not. Then the number of people who have been victimized is a binomial outcome variable out of N Bernoulli trials.
  • 34.7: Another example: we observe a judge in a courtroom and count the number of people who appear in her courtroom for sentencing (this is the number of Bernoulli trials); then we can study the number of people who are sent to prison (instead of probation) as the binomial outcome variable.
  • 34.8: Let's consider an example based on real data: the data come from Cook and Zarkin (1985; link). In this paper, the authors examined 9 business cycles and found that robbery rates increased (relative to their change during the preceding growth phase) when the economy tipped into a recession in 8 of the 9 cycles; this means the robbery rates decreased in 1 of the 9 cycles. The null hypothesis in this study was that when the economy tips into a recession then robbery rates are just as likely to decrease as they are to increase (like flipping a coin!).

Step 1: how many different ways could we get 1 "event" in 9 "trials"?

1. ++++++++-
2. +++++++-+
3. ++++++-++
4. +++++-+++
5. ++++-++++
6. +++-+++++
7. ++-++++++
8. +-+++++++
9. -++++++++

Note: we could also use the combination formula:

9C1 = 9!/(1!8!) = 362880/40320 = 9

Step 2: What is the probability of each permutation occurring if the robbery rate is equally likely to increase or decrease when the economy tips into a recession (i.e., p0 = 1/2)? We use the multiplication rule:

1. p(++++++++-) = 1/2^9 = 1/512
2. p(+++++++-+) = 1/2^9 = 1/512
3. p(++++++-++) = 1/2^9 = 1/512
4. p(+++++-+++) = 1/2^9 = 1/512
5. p(++++-++++) = 1/2^9 = 1/512
6. p(+++-+++++) = 1/2^9 = 1/512
7. p(++-++++++) = 1/2^9 = 1/512
8. p(+-+++++++) = 1/2^9 = 1/512
9. p(-++++++++) = 1/2^9 = 1/512

Step 3: Use the addition rule to add up the probabilities:

1/512 + 1/512 + 1/512 + 1/512 + 1/512 + 1/512 + 1/512 + 1/512 + 1/512 = 9/512

Or, we can use binomial probability mass function to calculate the same probability:

p(x decreases out of N trials) = # of combinations × p0^x × (1-p0)^(N-x)

where the number of combinations comes from the formula used above, the number of economic cycles is N = 9 and the number of times robbery decreased is x = 1 and (N-x) = the number of increases which is 8; note that p0 is the null hypothesis probability that robbery rates decrease when the economy tips into a recession (p0 = 1/2).

p(1 decrease out of 9 trials) = 9 × (1/2)^1 × (1-1/2)^8

and this is equal to

p(1 decrease out of 9 trials) = 9 × 1/2^9 = 9/512

which is the same number we got above.

  • 34.9: How can we tell if this result is statistically significant?
  • 34.10: We need to decide what evidence would convince us that the null hypothesis, Ho, (p0 = 1/2) is wrong.
  • 34.11: A convention in the field is to say that we want the probability of a type 1 error (rejecting Ho when Ho is true) to be a very small number; the cutoff often used is 0.05. We will use this same cutoff.
  • 34.12: How do we figure out whether the evidence in our study is strong enough to reject Ho?
  • 34.13: Well, the probability that robbery rates decreased when the economy tipped into a recession is estimated to be 9/512 which is 0.01757812.
  • 34.14: What is the margin of error or confidence interval around this probability estimate?
  • 34.15: We can look up the confidence interval (or margin of error) on a table (link).
  • 34.16: If you use this table and look at the column headed "n=9" and "x=1", you will see that the 95% confidence interval (or margin of error) for the probability of a decrease in robbery rates when a recession hits is [0.013,0.414]. Notice that this 95% confidence interval (or margin of error) does not include the number 1/2.
  • 34.17: Based on this evidence, we would reject the hypothesis that p0 = 1/2.

Lesson 15 - Thursday 3/27/25

  • We are currently in Chapter 7; continuing with Topic 34 (Binomial Distribution).
  • This week's practice/homework questions are problems 7.6-7.8.
  • 34.18: In our last class, we looked at changes in robbery when the economy tips into a recession.
  • 34.19: Today, we will study the robbery data from an updated version of the Cook and Zarkin study.
  • 34.20: Between the early 1980's and 2010's there were 3 new recessions above and beyond those studied by Cook and Zarkin; Bushway, Cook, and Phillips (2013; link) studied all 13 recessions.
  • 34.21: They found that robbery rates decreased in 3 recessions while for the other 10 it increased -- when the economy transitioned from a growth phase to a recession. Can we reject the hypothesis that p0 = 1/2?
  • 34.22: We conduct our test with a Type I error risk of p < 0.05 (probability of rejecting Ho when Ho is true).
  • 34.23: The sample estimated fraction of recessions where robbery decreased is 3/13 (which is 0.2307692).
  • 34.24: The hypothesis to be tested is that p0 = 1/2.
  • 34.25: Does the 95% confidence interval (or margin of error) around 3/13 include the number 1/2?
  • 34.26: Based on the table of binomial confidence intervals, we find that the margin of error when we have 13 "experiments" and 3 "events" (in this case, an event is defined as "the robbery rate declines") is [0.070,0.497].
  • 34.27: Since this interval does not include the number 1/2, we conclude that the evidence is strong enough to reject Ho.
  • 34.28: Based on the data we observe in our sample, we conclude that p0 is not equal to 1/2.
  • 34.29: Notice that I framed this discussion in terms of p(robberies decrease when a recession arrives) = 3/13 but we could also have framed it in terms of p(robberies increase when a recession arrives).
  • 34.30: If we did this, our sample estimate would be 10/13 instead of 3/13. But the substantive conclusion would be the same.
  • 34.31: The confidence interval (or margin of error) is [1-0.497 = 0.503,1-0.070 = 0.930] or more simply, [0.503,0.930]. Notice that the 95% confidence interval still does not include 0.5, so we still reject Ho.
  • 34.32: So, our conclusion does not depend on how we define what an "event" is in a binomial problem. Let's look at another example.

New Example

  • What happens to homicide rates when a state abolishes the death penalty?
  • We can see when states abolished the death penalty by going to the Death Penalty Information Center (DPIC) website.
  • Next, we look up the homicide rates at the Centers for Disease Control and Prevention website.
State Year of Abolition Pre-Rate Post-Rate Sign
Colorado 2020 4.3 6.3 +
Connecticut 2012 3.8 2.8 -
Delaware 2016 6.9 6.5 -
Illinois 2011 6.1 6.5 +
Maryland 2013 6.9 6.5 -
Massachusetts 1984 3.4 3.5 +
New Hampshire 2019 1.5 1.0 -
New Jersey 2007 10.1 9.2 -
New Mexico 2009 7.5 7.4 -
New York 2007 5.0 4.6 -
North Dakota 1973 1.0 1.9 +
Rhode Island 1984 3.8 3.9 +
Vermont 1972 2.2 1.5 -
Virginia 2021 6.2 7.5 +
Washington 2023 5.4 3.2 -
  • We treat each of these 15 abolition cases as a naturally occurring experiment.
  • The event of interest is whether the homicide rate increases after the death penalty is abolished.
  • Our null hypothesis, p0 = 1/2, asserts that homicide rates are equally likely to increase or decrease after the death penalty is abolished.
  • Our table shows that in 6 of the 15 experiments, the homicide rate increased after the death penalty was abolished.
  • This means that our evidence in this sample is 6/15 = 0.4.
  • Is the evidence in our sample strong enough to reject the p0 = 1/2 (null) hypothesis?
  • We consult our table which is based on setting the probability of getting a Type I error at 0.05.
  • If we look at the column, N = 15 and the row with 6 events, we see that the confidence interval associated with our sample estimate (0.4) is [0.188,0.647].
  • Since this confidence interval includes the number 1/2 (our p0 value), we fail to reject the hypothesis that p0 = 1/2.
  • 34.33: In these examples, we have been using the binomial probability distribution as a sampling distribution, a theoretical probability distribution.
  • 34.34: If we flip a fair coin 10 times (Table 7.5 in the textbook) there is a decent chance that we will not get exactly 5 heads. Suppose we get 7 heads; as you can see from Table 7.5, there is a greater than 10% chance that we could get 7 heads when we flip a fair coin 10 times.
  • 34.35: In our examples, we are doing something similar. We are trying to discern the chance that we could get results like the ones we got if the true underlying probability of an event occurring is equal to 1/2. In the case of robbery rates when the economy tips into a recession, we conclude that it is unlikely we could have gotten the results we did if the true underlying chance of robbery decreasing was really 1/2. We reached the opposite conclusion in our death penalty aboltion/homicide rates example.

Lesson 15 - Tuesday 4/1/25

  • We have been considering conceptual issues related to hypothesis testing.
  • The steps we have examined include: (1) specify a hypothesis; (2) before looking at the data, state the evidence that would convince us that the hypothesis should be rejected; (3) collect appropriate data; and (4) discern whether the evidence is strong enough to reject the hypothesis.

Review of Last Week's Procedure

  • Let's review the test we conducted last week pertaining to the abolition of the death penalty in 15 states.
  • Step 1: specify the hypothesis to be tested; we generally call this the null hypothesis. Generally, we test a null hypothesis about a population parameter value based on sample data. In this case, we will say that the population parameter is p0 and the hypothesis to be tested is that p0 = 1/2. This implies that homicide rates are equally likely to increase or decrease when a state abolishes the death penalty. Note that we do not observe p0, we can only infer whether the evidence is strong enough to reject the hypothesis that p0 is 1/2.
  • Step 2: What evidence would convince us that p0 is not 1/2? Last week, we approached this problem by thinking about a 95% confidence interval or 955 margin of error interval for the sample estimate of the probability that a state's homicide rate increases when it abolishes the death penalty. If we divide the number of states where the homicide rate increased after abolition by the total number of states that abolished the death penalty, we get the sample estimate of the probability that a state's homicide rate increases when it abolishes the death penalty. We will call this estimate, ps. We will reject the hypothesis that p0 = 1/2 if the 95% confidence interval or margin of error around ps includes 1/2.
  • Step 3: When we collect our data we find that 15 states abolished the death penalty after 1970; among these 15 states, we see that in 6 instances, the homicide rate increased after abolition while it decreased in the other 9 instances. This means that ps = 6/15 (which reduces to 2/5 or 0.40).
  • Step 4: Based on our lookup table, we see that the 95% confidence interval associated with this estimate is [0.188,0.647]. Since this interval includes the number 1/2, we fail to reject the hypothesis that p0 = 1/2.

The Textbook Procedure (pp. 149 and 152-153).

  • Step 1: p0 = 1/2.
  • Step 2: What evidence would convince us that p0 is not 1/2? Our first step is to create a table of probabilities. We use the binomial probability function to make these calculations. As an example, consider the probability of getting 5 increases out of 15 states if p0 = 1/2 (this is review from last week):
p(5 increases out of 15 states if p0 = 1/2) = 15![5!10!] 1/2^5 1/2^10 = 0.09164429

Let's do another one, just to make sure we've got it:

p(9 increases out of 15 states if p0 = 1/2) = 15![9!6!] 1/2^9 1/2^6 = 0.1527405

Note: we can perform similar calculations for each of the possible events in the sample space. Table 7.5 on page 148 in your textbook has an example of how these calculations are carried out using the binomial probability formula.

# of Increases p(# of Increases if p0 = 1/2)
0 0.00003051758
1 0.0004577637
2 0.003204346
3 0.0138855
4 0.04165649
5 0.09164429
6 0.1527405
7 0.1963806
8 0.1963806
9 0.1527405
10 0.09164429
11 0.04165649
12 0.0138855
13 0.003204346
14 0.0004577637
15 0.00003051758

Based on this table, we see that p(0 or 1 or 2 or 3 or 12 or 13 or 14 or 15 increases) = 0.00003051758 + 0.0004577637 + 0.003204346 + 0.0138855 + 0.0138855 + 0.003204346 + 0.0004577637 + 0.00003051758 = 0.03515625 (the addition rule from 34.8 above). Note that this number is less than 0.05 so it constitutes our critical region. If we get a number of increases in this range, we would conclude that our evidence is strong enough to reject the Ho that p0 = 1/2 because the probability of getting a number of increases in this range is small if p0 is really 1/2. Also note that if we allowed 4 increases and/or 11 increases to be in the critical region, our chance of making a Type 1 error would become larger than 0.05.

  • Step 3: we collect our data and see that the homicide rate increased after abolition in 6 states while it decreased after abolition in 9 states.
  • Step 4: since 6 increases is not in the critical region, we fail to reject the hypothesis that p0 = 1/2.
  • It is important to observe that using both procedures we failed to reject the hypothesis that p0 = 1/2.

Another Example

  • Suppose you are a crime analyst at a local police department. As a result of a grant, the police department is able to open 12 new district substations. After the substations have been operating for a year, the police chief asks you to compare robbery rates from the year before each station opened to the year after the station opened using a binomial significance test (with the probability of a Type I error set to be less than 0.05). The null hypothesis is that each district is equally likely to experience an increase or a decrease after the new substation opens (p0 = 1/2). * Evidence required to reject Ho: 95% confidence interval for ps (our sample estimate) does not include 1/2; this is equivalent to conducting our test at the p < 0.05 significance level (i.e., the probability of a Type 1 error is less than 0.05).
  • Reminder: a Type 1 error means we reject Ho, when Ho is true.
  • Collect data: 2 of the districts experienced a decrease while 10 of the districts experienced an increase.
  • Specify the distribution for the number of districts experiencing an increase:
Example: p(3 increases out of 12 districts if p0 = 1/2) = 12![3!9!] 1/2^3 1/2^9 = 0.05371094

Let's do another one:

Example: p(7 increases out of 12 districts if p0 = 1/2) = 12![7!5!] 1/2^7 1/2^5 = 0.1933594
# of Increases p(# of Increases if p0 = 1/2)
0 0.0002441406
1 0.002929688
2 0.01611328
3 0.05371094
4 0.1208496
5 0.1933594
6 0.2255859
7 0.1933594
8 0.1208496
9 0.05371094
10 0.01611328
11 0.002929688
12 0.0002441406
  • Identify the evidence that would cause us to reject Ho at the p < 0.05 significance level.
  • p(0 or 1 or 2 or 10 or 11 or 12 increases) = 0.0002441406 + 0.002929688 + 0.01611328 + 0.01611328 + 0.002929688 + 0.0002441406 = 0.03857422 (this is the critical region).
  • Why do we focus on these outcomes?
  • Because they are the outcomes that are least likely to occur if Ho is true (i.e., θ = 1/2).
  • If we were to reject Ho when there are 3 increases or 9 increases, our p-value (probability of a type I error) would exceed 0.05.
  • These 6 outcomes are the ones that would lead us to reject Ho at the p < 0.05 significance level.
  • So, these 6 outcomes constitute the critical region for our test.
  • Since 2 increases occurred and 2 is in the critical region, we reject Ho and conclude that p0 is not equal to 1/2.
  • Note: we could also see that ps = 2/12 = 1/6 (or 0.1666667). Based on our 95% confidence interval look-up table we can see that the confidence interval or margin of error around this estimate is [0.036,0.436]. Since this interval does not include 1/2 we reject Ho and conclude that p0 is not 1/2.

Chapter 8 and Topic 35 Measurement Type Begins Here

  • 35.1: The type of statistical hypothesis test we will do will vary depending on the type of data we have.
  • 35.2: So far, we have been working with data that is measured at the nominal level (either an event happens or it doesn't; there is no logical ordering of the categories).
  • 35.3: The example in Chapter 8 will also involve nominal data but it looks at the effect of a hot-spots intervention; in this study, there is a treatment and a control group. As Table 8.1 on page 161 shows, there were 11 experimental comparisons and in 10 of the 11 comparisons, the experimental site had a better outcome than the control site. This leads us to say we have 10 "pluses" and 1 "minus". So, the sign +/- is a nominal two category variable (i.e., it doesn't matter whether you order the outcomes +/- or -/+). On page 164, your book refers to this as a "nominal binary scale."

Topic 36: Assumptions About the Population

  • 36.1: The binomial tests we have been using do not make strong assumptions about the population, so they are examples of nonparametric tests.
  • 36.2: Other tests that we will do later in the semester do make strong assumptions about the population; they will be called parametric tests.
  • 36.3: The reason the binomial test is nonparametric is that it is based on a physical process. If you conduct a 10-coin flip experiment many times, we can say with certainty what the distribution of the number of heads (from 0-10) will be. We don't have to make an assumption about it.

Topic 37: Sampling Method

  • 37.1: External validity: when we conduct a study we are concerned with the generalizability of the results.
  • 37.2: Sampling: the process by which cases are included in a specific study (probability (random) and non-probability samples)
  • 37.3: Sampling Frame: the list of population members from which a sample is chosen.
  • 37.4: Independent random sampling: one case's inclusion in the sample has nothing to do with whether another case is included in the sample.
  • 37.5: Sampling with replacement: once a case is sampled it is put back into the population so that it is possible it could be drawn again.
  • 37.6: For the binomial test, we assume that the cases we are studying are a representative or random sample from the population.
  • 37.7: The textbook notes on page 168 that our samples are often not random samples from a population, "However, we can ask whether our sample is likely to provide valid inferences to those populations."
  • 37.8: This is actually a deep philosophical topic. Can we use a binomial test when our sample is not a random sample from a well-defined population?
  • 37.9: For an answer to this question, it is useful to return to the concept of a 10 coin-flip experiment. If the true underlying p0 = 1/2, then we would expect to see a particular probability distribution for the number of heads (between 0 and 10). Depending on how many heads we get, we can say whether we have strong enough evidence to reject Ho that p0 = 1/2. Here we are asking what is the probability of getting the result we got if p0 = 1/2. The question is a reasonable question even if the sample is not chosen randomly.

About

CCJS 200 - Statistics for Criminology & Criminal Justice

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published