What's new

# Sampling Distribution of the Mean

#### brian.field

##### Well-Known Member
Subscriber
I always found this to be a bit confusing, so I spent some time today looking at it some more. I also felt it would be helpful for some others, so I have included my thoughts below.

@David Harper CFA FRM - please let me know if you have any concerns with my thoughts.

Assume we have a population with a mean of 3.5 and a variance of 2.916667.

Now, let's assume we take a sample of size 2 from this population. The actual mean of this specific sample may or may not be 3.5 but its expected value is still 3.5. Similarly, the variance of this specific example might not be the variance of the population divided by n, or 2.9166667/2, but it is expected to be 2.916667/2.

Let's repeat this 1000 times. Then, the resulting 1000 samples of size 2 produce 1000 different means each of which is a point estimate of the population mean. Further, the mean of the 1000 point estimates is also a point estimate of the population mean! This mean of the 1000 individual means is the mean of the sampling distribution of the mean (which is also expected to be 3.5)!

Note that the sample size is still 2 and NOT 1000 in the above scenario! (I always found it difficult to decide if the sample size was 2 or 1000)!

The variance of the 1000 point data set of samples of size 2 is the variance of the original population divided by 2 - remember n = 2 not 1000!

Now, let's assume we take samples of size 100 and we repeat the above exercise. Now, n = 100 and we still produce 1000 point estimates each from a sample of size 100. Again, the expected value of the 1000 means will be close to 3.5 - it should be closer to 3.5 than the average associated with the 1000 point estimates from samples of size 2 but in both cases. the "expected value" is 3.5. Similarly, the variance of the distribution of means for the 1000 point estimates from samples of size 100 is expected to be 2.9166667/100 - i.e., the variance is much lower than in the first example. This is intuitive since the impact of an outlier on a sample size of 100 is significantly less than the impact an outlier would have on a sample size of 2!

Now is the interesting part.....and this is where I have had trouble in the past. We know that the sampling distribution of the sample mean will be normal or approximately normal regardless of the original population's distribution via the CLT.

I often wondered if the sampling in the above scenarios was with or without replacement. My analysis suggests to me that it must be with replacement (particularly if the population is discrete and small)!

Consider a single die! Then we know that the expected value is 3.5 and the variance is 2.9166667, as I elected to use above. Using a random number generator in excel, I generated 100 samples of size 2, i.e., or in other words, 200 rolls since they are all independent! Then I took the average of each pair to arrive at a sampling distribution of the mean. The average of the 100 sample means was 3.4250 (fairly close to the true population mean of 3.5). I also calculated the variance of the 100 averages which turned out to be 1.3619. This is pretty close to the population variance divided by 2, or 2.1966667/2 = 1.45833.

Then, I repeated the exercise with 100 samples of size 6! This is where it gets interesting. Since I am using a single die as my population, if I sample a size of 6 without replacement, then my average for every single iteration will equal the population average AND the variance of the sampling distribution of sample means will be 0 since every sample point estimate for the population mean would be 3.5! This is not consistent with the rule that the variance of the sampling distribution of the sampling mean is equal to the population variance divided by n! i.e., this rule suggests that the variance of the sampling distribution of sampling means (which are all 3.5) should be 2.9166667/6 = 0.486111 and not 0!

I have also been bothered by the fact that the variance of the sampling distribution of the mean approaches 0 as n approaches infinity! How can n approach infinity if the population has only 6 elements!!!!

Hence, I have concluded that the sampling must be "with replacement"!

So, I then generated 100 samples of size 500 each (obviously with replacement since we are dealing with a single die with 6 elements)! Now, the average of the 100 samples of size 500 is 3.5123 (closer to the population mean of 3.5) and the variance of this distribution of sample means is 0.0052 which is very close to the population variance divided by 500, or 2.916667/500 = 0.0058 (and incidentally, pretty close to 0 I might add)!

So, as n approaches infinity, the variance of the sampling distribution does approach 0! Or, in other words, the sampling distribution approaches a normal distribution with mean equal to the population mean and variance approaching 0!

One last question for @David Harper CFA FRM!

How can we have a normal distribution with variance 0?

Thanks for letting me put this down....

Brian

Last edited:

#### brian.field

##### Well-Known Member
Subscriber
Now I am confused again....

#### brian.field

##### Well-Known Member
Subscriber
After reading some other posts on this topic, I fear I may be incorrect on the definition of n!

If we have m random samples of size n, what is the mu-hat and what is sigma-hat?

Say we have a sample of 10 and we repeat the 100 times, then n=10 and m=100. Does this mean we have 100 different mu-hats or do we have 1 mu-hat which is the average of the 100 means from the samples of size 10?

Similarly, what is the denominator for the variance of the sampling distribution of the sample mean? Is it 10 or 100?

So confused....

Now I am wondering if I should stop reviewing material that I am supposed to already know!

Last edited:

#### brian.field

##### Well-Known Member
Subscriber
What a crapshoot! I think I get it now. We only need one actual sample of size n. Then, the "expected value" of the sampling distribution of the sample mean is equal to the population mean. This is clear.

Again, assume only 1 sample. The variance of the sampling distribution of the sample mean is the population variance divided by n where n is the number of elements in the specific sample, not the number of samples taken!

#### brian.field

##### Well-Known Member
Subscriber
@David Harper CFA FRM - I know I was kind of all over the place above but could you offer your thoughts so that I can be sure that I am thinking about things correctly?

Thanks!

Brian

#### David Harper CFA FRM

##### David Harper CFA FRM
Staff member
Subscriber
Hi @brian.field sorry for the delay responding ... As I was reading/contemplating your post, I similarly parroted your single die experiment in Excel at https://www.dropbox.com/s/al3rzoauvzi0s0e/sample-mean-die.xlsx?dl=0 (it's quick and dirty, not pretty). I can tell you I just assumed the CLT applies "with replacement". My first thought is: if it were "without replacement," then our sample size can't approach infinity, it is limited by the population size (right?); second, if "without replacement," might the independence criteria be violated? (I didn't take time to research this, but it just seems intuitive to me, that with replacement samples aren't "independent").

To confirm your first post, we can define a matrix of (n) columns by (r) rows, where n (i.e., columns) is the sample size and the number of rows is the number of simulations. In my Excel, the tab "n=2" has two columns and 100 rows (simulations). The tab "n=10" has ten columns and 100 rows. The CLT tells us to expect the variance of the sample mean to be ~= 1.46 = 2.29/2 when the sample is two; and to expect the variance of the sample mean to be ~= 0.29 = 2.92/20 when the sample size is ten. The number of simulations (rows) won't change this; increasing the number of rows should only produce an actual simulated value which is near to the expected (analytical) values of 1.46 or 0.29. What's the key difference? The key difference is that we are computing a sample mean of the number of observations in the row. Like you, I find it challenging to distinguish between sample size (rows) and simulation size (columns) but the key, it seems to me, is: what's the random variable? The random variable is the sample mean. How is that produced? It is produced by a calculation on the row. So the size of the row is the key (in CLT).

In case it is interesting, I had to report on this as part of my John Hopkins Data Science curriculum, my brief report is here @ https://rpubs.com/bionicturtle/jh-ds-st-p1
i.e, this demonstrates CLT with an exponential distribution, assuming sample size of 40 and 1,000 simulations.