MTH 245 (Statistics I) students are required to do the Inferential Statistics project.

According to the Mathematics Department at BRCC, our students are required to use a free software.

As at today: 07/15/2023; the only free statistical software is R/RStudio

__Update:__ As at 2024; the statistical software required by the BRCC Mathematics Department are:

(1.) R/RStudio __or__

(2.) Statistical Package for the Social Sciences (SPSS).

SPSS is available virtually for free to VCCS users.

(*Please see Virtual Computing Lab for SPSS*)

The inferential statistics project is designed to:

(1.) Meet the requirements of the VCCS detailed course outline for MTH 245

(2.) Meet one of the learning objectives of the VCCS (Virginia Community College System) standards for:

(a.) MTH 155: Statistical Reasoning

*
(Presents elementary statistical methods and concepts including visual data presentation, descriptive statistics, probability, estimation, hypothesis testing, correlation and linear regression.).
*

(b.) MTH 245: Statistics I

(3.) Draw statistical inferences from a sample onto a population. This is achieved by any of these tasks:

(a.) Estimate a population parameter using a sample statistic.

(b.) Conduct a hypothesis test of a population parameter using a sample statistic.

(4.) Meet the QM (Quality Matters) and USDOE (United States Department of Education) requirements for distance education as regards the provision of RSI (Regular and Substantive Interaction).

Federal Register: Distance Education and Innovation

St. John's University: New Federal Requirements for Distance Education: Regular and Substantive Interaction (RSI)

Student – Content Interaction: Very high

Student – Student Interaction: Flexible

Student – Faculty Interaction: Very High

(1.) **Data (Dataset):** What dataset(s) should you use?

There are five approved ways to get the dataset(s).

Use any or combination of them.

(a.) RStudio Datasets:

You may use any of the *applicable* built-in datasets from RStudio.

(b.) Textbook (eBook) Datasets

You may use any of the *applicable* datasets from your eBook

(c.) MyLab Math (MLM) Datasets

You may use any of the *applicable* datasets from your MLM assignments if the sample size is at least 30. (*n* ≥ 30)

(d.) Datasets from the U.S Government website: United States Government's Open Data: Datasets

You may use any of the *applicable* open datasets from the U.S government.

(e.) Data Collection from my Students. (*This option is for onsite (traditional/in-class) students only.*)

Please see me in the Office during Student Engagement hours so we can discuss the data collection methods and other requirements.

**Any other dataset besides the ones mentioned should be pre-approved by me.**

(2.) **Parameters**

At least two parameters are required.

Please choose any two parameters.

If you choose to do more than two parameters, the best two would be used for your project grade.

The parameters to be estimated (Inferential Statistics) or tested (Hypothesis Testing) are:

(a.) Population Mean (Estimate or Test)

(b.) Population Proportion (Estimate or Test)

(c.) Population Variance (Estimate or Test)

(d.) Population Standard Deviation (Estimate or Test)

(e.) Correlation (Test)

The data values for the variable(s) could be from one sample or from more than one sample. Please ensure you specify the number of samples.

Please review the examples/samples I did for you.

**Inferential Statistics**

(a.) Using Sample Mean to estimate Population Mean

(b.) Using Sample Proportion to estimate Population Proportion

(c.) Using Sample Variance to estimate Population Variance

(d.) Using Sample Standard Deviation to estimate Population Standard Deviation

**Hypothesis Testing**

(a.) Hypothesis Test about a Population Proportion

(b.) Hypothesis Test about a Population Mean

(c.) Hypothesis Test about a Population Variance

(d.) Hypothesis Test about a Population Standard Deviation

(e.) Hypothesis Test about a Correlation

Please __NOTE:__ For **any Hypothesis Tests, at least two approaches are required.**

(3.) This is an individual project.

You may collaborate with one another. However, it is not a group project.

No two students should use the same dataset

I understand you are not Computer Science/Programming students. So, I spent time to write several notes and codes on R/RStudio.

Also, I provided you with several resources which I cited as references.

Be it as it may, I am here to help you. You have my number. Feel free to text me anytime. We can arrange Zoom sessions so I share my screen and work you through any questions you have. Please do this as soon as possible. Keep the due dates in mind.

(a.) Please submit the two datasets (names of the datasets including the sources) and at least two parameters that you intend to estimate/test in the

(b.) Once I give you the approval, please send your draft to me via email (if you prefer my review to be seen by you alone) or submit your draft in the

(c.) When everything is fine (after you make changes as applicable based on my feedback), please submit your work in the appropriate area (

Only projects submitted in the appropriate area (

Draft projects are not graded. In other words, projects submitted via email and/or in the

Submitting drafts is highly recommended. If your professor gives you an opportunity to submit a draft, please use that opportunity.

Submitting drafts is not required. It is highly recommended because I want to give you the opportunity to do your project very well and make an excellent grade in it.

(4.) (a.) The deliverables for the project draft are: Google Docs or Micrsoft Office Word containing the

(b.) The deliverables for the project are:

(I.) Google Docs or Micrsoft Office Word containing the

(II.) The

Zip the document and the folder and submit as a zip folder in the

You may choose to use a table format or other appropriate format that contains the

(5.) (a.) For the RStudio screenshots and RStudio settings, please set the font size in the editor to at least 14. Also, use a transparent background (default

In other words, please do not change the theme. If you do need to change the theme, use a light/transparent theme.

Change only the font size to at least 14.

(b.) As a student, you have access to Microsoft Office suite of apps.

(a.) You can download and install these apps on your laptop/desktop.

You also have access to Google apps.

(c.) For all English terms/work (entire project): use Times New Roman; font size of 14; line spacing of 1.5.

Further, please make sure you have appropriate spacing between each heading and/or section as applicable.

Your work should be well-formatted and visually appealing.

(d.) For all Math terms/work: symbols, variables, numbers, formulas, expressions, equations and fractions among others, the Math Equation Editor is required.

(i.) The font is set to Cambria Math by default (set it to that font if it is not); font size of 14, and align accordingly (preferably left-aligned).

(ii.) To ensure appropriate spacing between your Math work, use a line spacing of 2.0.

Alternatively, you may use line spacing of 1.5 but insert a space after each equation as applicable.

Your work should be well-formatted, organized, well-spaced (not compact), and visually appealing.

(e.) Include page numbers. You may include at the top of the pages or at the bottom of the pages but not both.

(6.) All work must be turned in by the final due date to receive credit.

Please note the due dates listed in the course syllabus for the submission of the draft and the actual project. In the course syllabus, we have the:

(a.) Initial due date for the Project Draft: Please turn in your draft.

(b.) Initial due date for the Project: If your draft is not ready for submission, keep working with me. Make changes based on my feedback and keep working with me until I give you the

If you prefer not to turn in a draft, please review all the resources provided for you and do your project well and submit.

(c.) Final due date for the Project Draft: This is necessary if you want a written feedback for your draft.

After this date, written feedback would not be provided for your draft. However, verbal feedback would still be provided during Office Hours/Student Engagement Hours/Live Sessions.

(d.) Final due date for the Project: All work must be turned in by this date to receive credit.

After this date, no work may be accepted.

Name: | Your name |

Date: | The date |

Instructor: | Samuel Chukwuemeka |

Project: |
(Please choose one) Inferential Statistics or Hypothesis Testing |

1st Dataset: (Please write the name of the dataset and describe it.) |
1st Source: (Please write your source) |

1st Parameter: |
(Please choose one) (a.) Population Mean (Estimating or Testing) (b.) Population Proportion (Estimating or Testing) (c.) Population Variance (Estimating or Testing) (d.) Population Standard Deviation (Estimating or Testing) (e.) Correlation (Testing) (1.) Specify the sample, number of unique sample type(s), and the sample size. (2.) Write the population. (3.) Write the type of estimation or type of test. (4.) Verify the requirements/conditions for your estimation/test or assume that requirements/conditions are satisfied. (5.) Make reliable assumptions for any missing requirement and support it with sources if possible. (6.) State the reasons for your estimation/test. (7.) Write and run the codes for your estimation/test. For hypothesis tests, at least two approaches are required. (8.) Write comments in your codes accordingly. (9.) Explain the results. (10.) Interpret the results in the context of the specific objectives. |

2nd Dataset: (Please write the name of the dataset and describe it) |
2nd Source: (Please write your source) |

2nd Parameter: |
(Please choose one) (a.) Population Mean (Estimating or Testing) (b.) Population Proportion (Estimating or Testing) (c.) Population Variance (Estimating or Testing) (d.) Population Standard Deviation (Estimating or Testing) (e.) Correlation (Testing) (1.) Specify the sample, number of unique sample type(s), and the sample size. (2.) Write the population. (3.) Write the type of estimation or type of test. (4.) Verify the requirements/conditions for your estimation/test or assume that requirements/conditions are satisfied. (5.) Make reliable assumptions for any missing requirement and support it with sources if possible. (6.) State the reasons for your estimation/test. (7.) Write and run the codes for your estimation/test. For hypothesis tests, at least two approaches are required. (8.) Write comments in your codes accordingly. (9.) Explain the results. (10.) Interpret the results in the context of the specific objectives. |

Objectives: |
(Please write ) specific objectives(1.) (2.) (3.) (4.) (5.) |

References: |
Please cite your sources accordingly.
Indicate the citation format. |

**Dataset:** 2011 D75 School Surveys | NYC (New York City) Open Data

**Source:** Data.gov: The Home of the U.S. Government's Open Data

**General Description:** (Taken from: https://data.cityofnewyork.us/Education/2011-D75-School-Surveys/t9nb-zfe4)

NYC Department of Education 2011 District 75 School Surveys.

Every year, all parents, all teachers, and students in grades 6 – 12 take the NYC School Survey.

The survey ranks among the largest surveys of any kind ever conducted nationally.

Survey results provide insight into a school's learning environment and contribute a measure of diversification that
goes beyond test scores on the Progress Report.

NYC School Survey results contribute 10% - 15% of a school's Progress Report grade (the exact contribution to the Progress Report is dependant on school type).

Survey questions assess the community's opinions on academic expectations, communication, engagement, and safety and respect.

School leaders can use survey results to better understand their own school's strengths and target areas for improvement.

**Specific Description:**

The data set is the NYC Department of Education 2011 District 75 School Surveys.

Specifically, we are interested in determining the proportion of students who ** Strongly agree** that they
feel welcome in their school.

So, in the file, we look at the

Importing the Excel file into RStudio so we can analyze the data

Let us clear the console window so we have more space to write our code

We are only interested in:

(a.) the number of students who responded (Column: Number of Student Responses).

There are NA and N_s also in that column. But we are interested in the numeric values.

(b.) the number of students who strongly agree that they feel welcome in their schools (Column: 1a. I feel welcome in my school.).

We shall deal with the numeric values only.

(1.) Construct a 95% confidence interval for the population proportion of New York City (NYC) Students in Grades 6 – 12 who strongly agree that they feel welcome in their schools in the year 2011.

(2.) Interpret the confidence interval.

(3.) Estimate the population proportion of New York City (NYC) Students in Grades 6 – 12 who strongly agree that they feel welcome in their schools in the year 2011.

(1.) The sample is a simple random sample.

(2.) The population is normally distributed.

$ \hat{p} = \dfrac{x}{n} \\[5ex] $ where:

p̂ = sample proportion

We have to write the codes now: the codes to:

(a.) determine the number of all the students who strongly agree that they feel welcome in their schools in the year 2011.

This is the numerator: $x$ (1a. I feel welcome in my school.)

It is the number of individuals in the sample with the specified characteristic.

(b.) determine the total number of all the students who responded.

This is the denominator: $n$ (Number of Student Responses)

It is the sample size .

Let me explain and hopefully, this explanation will make some sense to you. If it does not, please let me know so I'll try another approach/example.

The "Number of Eligible Students" is the population size.

As you may have noticed, every eligible student did not respond to the survey.

So, we want to use the ones who responded to determine everyone's response

In other words, we want to use the sample proportion (the proportion of the ones who responded) to

This is known as Inferential Statistics: using the results from a sample to

Let us do some explanations before we write the codes:

(1.) It is always good to work with a copy so we do not mess up the original should we need to use it again.

(a.) So, we shall make a copy of the dataset.

(b.) Then, we shall use appropriate variaables to represent each of the two columns, beginning with the column for the numerator: (1a. I feel welcome in my school.)

(2.) There is one header row.

There are 59 rows.

The first four rows of the two columns that we need, contain non-numeric values. We do not need those values.

So, we shall focus on the values from the 5th row up to the 59th row for the two columns.

(3.) There is at least one "NA" (Not Available) non-numeric value in each of the two columns.

We shall replace those "NA" values with 0's using the

`gsub()`

function. This is important so we can add them with the numeric values.

(4.) We shall determine the class of the data values and make sure they are numeric.

This is done using the

`class()`

function. If they are not numeric, we shall convert them to mumeric values using the

`as.numeric()`

function.
(5.) Then, we shall add the data values in each of the two columns.

Some students probably liked the use of several variables.

Some students probably did not. I can imagine some programmers that will be frustrated at me for using many variables.

Can we write all these using few variables and few lines of code? Of course, we can.

So, let us write the code for the denominator (finding the sample size) using only two variables and few lines of code.

(6.) Run the

`prop.test()`

function. The sample size is greater than 30. So, we shall use the argument:

`correct = FALSE`

If the sample size was less than 30, we shall use the argument:

`correct = TRUE`

Also, if the confidence interval is not specified in the argument, it is 95% by default

Based on the results:

The sample proportion (point estimate) is 0.5213829

The 95% confidence interval is (0.5088406, 0.5338984)

This implies that New York City Department of Education District 75 is 95% confident that the population percentage of students who strongly agree they feel welcome in their schools in the year 2011 is between 50.88406% and 52.13829%

**Dataset:** women

**Source:** R/RStudio

**Description:**

This data set gives the average heights and weights for American women aged 30 – 39.

The data set appears to have been taken from the American Society of Actuaries *Build and Blood Pressure Study*
for some (unknown to us) earlier year.

The World Almanac notes: “The figures represent weights in ordinary indoor clothing and shoes, and heights with
shoes”.

**Variable:** Height

**Unit of the variable:** inches

**Sample Size:** 15

**Sample:** 15 American women aged 30 – 39 in the year 2000

**Population:** All American women aged 30 – 39 in the year 2000

**Assume Year:** 2000 (*The year is unknown, hence we assume the year.*)

**Objectives:**

(1.) Construct a 95% confidence interval for the population mean of the heights of American women aged 30 – 39 in the year 2000.

(2.) Interpret the confidence interval.

(3.) Estimate the population mean of the heights of American women aged 30 – 39 in the year 2000.

(*In other words, we want to use the heights of 15 American women aged 30 – 39 in the year 2000 to estimate the average height of all American women aged 30 – 39 in the year 2000*)

**Parameter to estimate:** Population Mean

**Test:** *t-test*

**Reason for Test:** The population standard deviation was not given.

**Verify Requirements for Test**

(1.) The sample is a simple random sample.

(2.) The population is normally distributed.

(3.) The sample size is less than 5% of the population size.

95% Confidence Interval for Population Mean from Sample Data

We shall use the `t.test()`

function

Because we are only interested in the heights, we shall use: `t.test(women$height)`

*If the confidence interval is not specified in the argument, it is 95% by default*

Based on the results:

The test statistic is 56.292

The degrees of freedom is 14

The sample mean is 65 inches

The 95% confidence interval is (62.52341, 67.47659) inches

We are 95% confident the population mean of the heights of American women aged 30 – 39 in the year 2000 is
between 62.52341 inches and 67.47659 inches.

This implies that in about 95% of all the samples of American women aged 30 – 39 in the year 2000, the
confidence interval will contain the population mean of (62.52341, 67.47659) inches.

**Dataset:** MatchedWeights (*the name we shall give it because it does not have a name*)

**Source:** MyLab Math (MLM)

**1st Column:** Reported Weights

**2nd Column:** Measured Weights

**Description:**

The data set gives the measured and reported weights (in pounds) of 127 female subjects.

**Question:**

Listed in the accompanying table are 127 measured and reported weights (lb) of female subjects.

Use the listed paired sample data, and assume that the samples are simple random samples and that the differences have a distribution that is approximately normal.

(a.) Use a 0.05 significance level to test the claim that for females, the measured weights tend to be higher than the reported weights.

In this example, μ_{d} is the mean value of the differences d for the population of all pairs of data, where each individual difference *d* is defined as the measured weight minus the reported weight.

What are the null and alternative hypotheses for the hypothesis test?

The difference is the: measured weights minus the reported weights.

**Null Hypothesis:** H_{0}: μ_{d} = 0 (*because the measured weights is assumed to be equal to the reported weights.*)

**Alternative Hypothesis:** H_{1}: μ_{d} > 0 (*because the measured weights tend to be higher than the reported weights.*)

(b.) Test the claim that the measured weights tend to be higher than the reported weights for females.

Use at least two approaches.

Interpret your results. This includes your decision and your conclusion.

**Variable of both subjects:** Weight

**Unit of the variable(s):** pounds

**Sample Size for both subjects:** 127

**Sample:** 127 American females in the year 2023 (*the nationality and year are assumed.*)

**Population:** All American females in the year 2023 (*the nationality and year are assumed.*)

**Objectives:**

(1.) Test the claim that the measured weights of 127 American females in year 2023 are higher than their reported weights using the Critical Value Method (Classical Approach).

(2.) Test the claim that the measured weights of 127 American females in year 2023 are higher than their reported weights using the P-Value (Probability-Value) Approach.

(3.) Test the claim that the measured weights of 127 American females in year 2023 are higher than their reported weights using the Confidence Interval Method.

(4.) Interpret the results.

(5.) Write the decision.

(6.) State the conclusion.

**Parameter to test:** Population Mean

**Test:** *t-test*

**Direction of Test:** Right-tailed test (*because of the greater than symbol: > in the alternative hypothesis*)

**Reason for Test:** The population standard deviation was not given.

**Verify Requirements for Test**

(1.) The sample data are matched pairs and equal sample size.

(2.) The matched pairs are simple random samples.

(3.) The sample size is large (at least a sample size of 30 for each pair).

(4.) The population from which the pairs of values were drawn is normally distributed.

Download the dataset. Rename it to **MatchedWeights**

Import the dataset into RStudio

1st Approach: **Critical Value (Classical) Approach**

Define and Assign Variables. Use appropriate names for the variables.

$
t = \dfrac{\bar{x}_d - \mu_d}{\dfrac{s_d}{\sqrt{n}}} \\[7ex]
$
where:

$d$ is the differences for the paired sample data (difference between the measured weight and the reported weight)

$t$ is the *t* test statistic

$\bar{x}_d$ is the mean of the differences for the paired sample data

$s_d$ is the standard deviation of the differences for the paired sample data

$\mu_d$ is the mean value of the differences for the population of all the pairs of data

$n$ is the sample size of either sample (because of matched pair of samples)

*
Did you notice the:
(a.) exact value of the test statistic?
(b.) approximate value of the test statistic?
*

We have determined the test statistic

We need to determine the critical value of the

The level of significance is 0.05 (given by the question)

The test is a Right-tailed test (because of the alternative hypothesis)

The degrees of freedom for a one-tailed right-tailed test is 1 less than the sample size (sample size − 1)

The

`qt()`

function is used to determine the critical For one-tailed left-tailed test; the

`qt(p = significanceLevel, df = sampleSize − 1, lower.tail = TRUE)`

or `qt(p = significanceLevel, df = sampleSize)`

(because the lower tail is left-tailed by default, so omitting that argument treats the lower tail as left-tailed) If we want a right-tailed test, then we set the lower tail to the Boolean value of FALSE

For one-tailed right-tailed test; the

`qt(p = significanceLevel, df = sampleSize − 1, lower.tail = FALSE)`

For two-tailed test; the

`qt(p = significanceLevel / 2, df = sampleSize − 2, lower.tail = FALSE)`

So, the code we shall use is:

`qt(p = 0.05, df = 126, lower.tail = FALSE)`

(a.) exact value of the test statistic?

(b.) approximate value of the test statistic?

Critical Value Method for Right-tailed test:

The

The critical

The test statistic is greater than the critical value

This implies that the test statistic falls in the critical region

2nd Approach:

Let us determine the probability that the critical value is greater than the test statistic

To determine this probability, we shall use the

`pt()`

function For one-tailed left-tailed test; P(criticalT < −testStatistic) is the

`pt(p = −1 * testStatistic, df = sampleSize − 1, lower.tail = TRUE)`

or `pt(p = -1 * testStatistic, df = sampleSize)`

(because the lower tail is left-tailed by default, so omitting that argument treats the lower tail as left-tailed) If we want a right-tailed test, then we set the lower tail to the Boolean value of FALSE

For one-tailed right-tailed test; P(criticalT > testStatistic) is the

`pt(p = testStatistic, df = sampleSize − 1, lower.tail = FALSE)`

For two-tailed test; [P(criticalT < −testStatistic) + P(criticalT > testStatistic)] is the

`pt(p = *testStatistic*, df = sampleSize − 2, lower.tail = FALSE)`

So, the code we shall use is:

`pt(q = testStatistic, df = 126, lower.tail = FALSE)`

(a.) exact value of the probability value?

(b.) approximate value of the probability value?

Probability Value Method for Right-tailed test:

The significance level is: 0.05

The probability value is: 0.00592803278263704

The probability value is less than the significance level

3rd Approach:

Define and Assign Variables. Use appropriate names for the variables.

$ CL = 1 - \alpha ...one-tailed\;\;test \\[3ex] CL = 1 - 0.05 \\[3ex] CL = 0.95 = 95\% \\[3ex] E = t_{\dfrac{\alpha}{2}} = \dfrac{s_d}{\sqrt{n}} \\[5ex] \underline{Confidence\;\;Interval} \\[3ex] \bar{x}_d - E \lt \mu_d \lt \bar{x}_d + E \\[3ex] $ where:

$\alpha$ is the significance level

$CL$ is the confidence level

$E$ is the margin of error

$t_{\dfrac{\alpha}{2}}$ is the critical

$\bar{x}_d - E$ is the lower bound of the confidence interval

$\bar{x}_d + E$ is the upper bound of the confidence interval

The 95% confidence interval is: (0.652670811106761, 3.06543942511371)

The lower bound of the confidence interval is: 0.652670811106761 pounds

The lower bound is greater than 0

The upper bound of the confidence interval is: 3.06543942511371 pounds

The upper bound is also greater than 0

The confidence interval does not contain 0.

Both bounds are positive. Therefore, it is likely that the mean of the differences is always greater than 0.

Let students know you are willing to help.

Step 2:

Step 3:

Step 4:

Step 5:

Step 6:

Step 7:

Step 8:

Step 9:

Step 10: 1st:

Step 10: 2nd:

Step 11: 1st:

Step 11: 2nd:

Step 12:

Step 13:

Step 14:

Chukwuemeka, Samuel Dominic (2023). *R and RStudio Statistics Software.*
Retrieved from https://statistical-science.appspot.com/

McNeil, D. R. (1977) *Interactive Data Analysis.* Wiley.

*2011 D75 School Surveys.* (2019, May 9). Data.gov; data.cityofnewyork.us. https://catalog.data.gov/dataset/2011-d75-school-surveys

Triola, M. F. (2022). *Elementary Statistics.* (14th ed.) Hoboken: Pearson.

*R Guides* (n.d.). Statology. https://www.statology.org/r-guides/