WBM STATS Solutions: Understanding Stratified Sampling with R

Let X be the characteristic that we want to study. If the population is homogeneous with respect to X, then the sample selected using a simple random sampling technique will give us a homogeneous sample and the sample mean will serve as a good and reliable estimate of the population mean. This means that the sample is expected to be representative of the population compared to a sample drawn from a heterogeneous population. The variance of the sample means not only depends on the sample size and sample fraction but also on the population variance. Therefore, in order to increase the precision of the estimates, it is necessary to draw a sample from a population that is homogeneous with respect to the characteristics under study. One such sampling technique is the stratified sampling technique.

The process of stratified sampling technique works as follows:

divide the population into smaller groups or sub-population (known as strata) such that the sampling units are homogeneous with respect to X within the strata but heterogeneous between the strata.
treat each stratum as a separate population and draw a sample with a simple random sampling technique from each stratum.

Here, we used an example dataset named sample_data.csv for this particular exercise. The dataset is based on a hypothetical region. It has six variables viz. Sl_No - Serial Number, region - region code, cc - cluster code, por - Place of Residence, VoW - Village or Wards code and dist - distance of the district hospital from the village or ward (in km). Here the place of residence is coded as 1 for Rural and 2 for Urban is treated as different strata. And we will draw a sample of size 100 each from two different strata using a simple random sampling technique and estimate the mean distance of the district hospital from the village or ward. [The mean distance of the district hospital from the village or ward for this particular region is 15.75 km].

The process of estimating the mean distance of the district hospital from the village or ward using the stratified sampling technique in R is given below. Here X is the distance of the district hospital from the villages/wards.

First, create the directory for the exercise. The directory can be created using the command "setwd",

> setwd("path of the directory/folder")

In order to sample the data using standard sampling techniques, you need to use a particular package. called "samplingbook". This package can be installed and loaded using the following commands,

> installed.packages("samplingbook")

> library(samplingbook)

Then import the data using the command "read.csv"

> sdata = read.csv("sample_data.csv" , header = T)

Consider place of residence (por = 1 "Rural" & por = 2 "Urban") as the two homogeneous strata. Then, a total of 100 sampling units each are selected randomly (SRSWoR) from the two strata. "dsts" is the final sample dataset (having 200 sampling units) using stratified sampling technique. The R commands are as follows,

> table(sdata$por)

> sdatast = sdata[order(sdata$por) , ]

> sts1 = sample(1:1156 , 100 , replace = FALSE)

> sts2 = sample(1157:2378 , 100 , replace = FALSE)

> sts = sort(c(sts1 , sts2))

> dsts = sdatast[sts , ]

Having the "finite population correction" factor is necessary for computing the mean distance of the district hospital from the village or ward using a stratified sampling technique. This can be done by creating one new variable sN with values equal to 1156 for rural areas and 1222 for urban areas.

> dsts$sN[dsts$por == 1] = 1156

> dsts$sN[dsts$por == 2] = 1222

Finally, using the sampled data "dsts" using stratified sampling technique, a stratified survey design is specified in the sampled data using "svydesign" command. Then using the "svymean" command, the required mean distance of the district hospital from the village or ward is estimated.

> sts = svydesign(id = ~1 , data = dsts , strata = ~por , fpc = ~sN)

> svymean(~dist , design = sts)

> mean SE

dist 15.739 0.537

The estimated mean distance of the district hospital from the village or wards using the stratified sampling technique is 15.74 km compared to 15.75 km of the whole villages or wards of the region.

The sampled dataset is then exported in .csv formate using the following commands,

> write.csv(dsts , "data_sts.csv" , row.names = FALSE)

The outcome can be verified using the following commands,

For sample mean,

> st_mean = aggregate(dsts$dist , list(dsts$por) , mean)

> str_mean = (1156/2378)*st_mean[1 , 2] + (1222/2378)*st_mean[2 , 2]; str_mean

> 15.739

For sample standard error,

> st_var = aggregate(dsts$dist , list(dsts$por) , var)

> str_var = ((1156/2378)^2)*(1/100-1/1156)*st_var[1 , 2] + ((1222/2378)^2)*(1/100-1/1222)*st_var[2 , 2]

> sqrt(str_var)

> 0.537

Saturday, 6 March 2021

Understanding Stratified Sampling with R

No comments:

Post a Comment

Labels

Contents