Sunday, 8 March 2020

Understanding Two-Stage Sampling (both SRSWoR) with R

Two-stage sampling is the process of sampling the "sampling units" at two different stages. The units selected at the first stage of sampling are called the first stage units and the units or group of units within first stage units are called the second stage units or subunits.
The process of two-stage sampling works as follows: 
  1. we divide the whole population units into different clusters
  2. we select n clusters out of the N clusters (first stage selection)
  3. a sample of size mi is selected from the selected ith cluster i.e. we select a sample  of a specified number of units from the selected cluster (second stage selection)
It is a more flexible sampling technique compared to one-stage sampling (commonly known as cluster sampling). But it can be reduced to the one-stage sampling when the number of units to be sampled from each cluster equals the number of units in each cluster. Although this technique of sampling gives higher statistical precision compared to one-stage sampling, the statistical precision comes with the cost. The cost incurred in adopting this technique will be higher compared to one-stage sampling.

Figure 1 represents the number of villages/wards in a particular geographical region by its clusters (each square box represents a cluster). The region has 16 clusters with its cluster code (cc) from 1 to 16. And each cluster has Mi (i = 1, 2, ..., 16) villages/wards. We used an example dataset named sample_data.csv for this particular exercise. The dataset consists of six variables viz. Sl_No - Serial Number, region - region code, cc - cluster code, por - Place of Residence, VoW - Village or Wards code and dist - distance of the district hospital from the village or ward (in km). [The mean distance of the district hospital from the village or ward for this particular region is 15.75 km]

The process of estimating the mean distance of the district hospital from the village or ward using two-stage sampling technique in R is given below.
Figure 1: No. of villages/wards by cluster codes of a region

First, create the directory for the exercise. The directory can be created using the command "setwd",

>
setwd("path of the directory/folder")

In order to sample the data using standard sampling techniques, you need to use a particular package. called "samplingbook". This package can be installed and loaded using the following commands

> installed.packages("samplingbook")
> library(samplingbook)

Then import the data using the command "read.csv"

>
sdata = read.csv("sample_data.csv" , header = T)

Five clusters out of the 16 clusters are selected randomly (SRSWoR) in the first stage. The below set of commands are used to select the clusters. The command "srswor" is used to sample the five clusters randomly from the sixteen clusters. Here, "fs" is a new variable taking values 0 and 1 (0 means the cluster is not included in the sample while 1 means the cluster is included in the sample). Finally, "dfst" is the required sample dataset at the first stage having only the selected clusters.

> table(sdata$cc); cc = 1:16
> c = srswor(5 , 16); df = data.frame(cc , c); df
> sdata = sdata[order(sdata$cc) , ]

> fs = 0
>
for(i in 1:16){
      fs[sdata$cc == df$cc[i]] = c[i]
   }


> sdata$fs = fs
> dfst = subset(sdata , fs == 1)

Now, in the second stage, a total of 200 sampling units are selected randomly (SRSWoR). The 200 sampling units consist of sampling units from all the selected clusters. And to facilitate this, the number of sampling units to be selected from each selected cluster is determined by the proportion of sampling units in the ith selected cluster in the first stage sample dataset multiplied by 200. The R commands are as follows,

> s1 = round(t[1]/length(dfst$fs)*200)
>
s2 = round(t[2]/length(dfst$fs)*200)
> s3 = round(t[3]/length(dfst$fs)*200)
> s4 = round(t[4]/length(dfst$fs)*200)
> s5 = round(t[5]/length(dfst$fs)*200) 

The sample of 200 sampling units are selected by using the informations on mi computed above for all the 5 selected clusters. "dtwost" is the final sample dataset selected using two-stage sampling technique. The R commands are as follows,

> sst1 = sample(1:t[1] , s1 , replace = FALSE)
> sst2 = sample((t[1] + 1):cumsum(t)[2] , s2 , replace = FALSE)
> sst3 = sample(cumsum(t)[2]:cumsum(t)[3] , s3 , replace = FALSE)
> sst4 = sample(cumsum(t)[3]:cumsum(t)[4] , s4 , replace = FALSE)
> sst5 = sample(cumsum(t)[4]:cumsum(t)[5] , s5 , replace = FALSE)
> sst = sort(c(sst1 , sst2 , sst3 , sst4 , sst5))
> dtwost = dfst[sst , ]

Having the "finite population correction" factor is necessary for computing the mean distance of the district hospital from the village or ward using a two-stage sampling technique. This can be done by creating two new variables N1 with all values equal to 16 (the total number of clusters) and N2 with values equal to the corresponding number of sampling units in each selected cluster at first stage of the sampling.

> dtwost$N1 = 16
> dtwost$N2[dtwost$cc == as.numeric(names(t)[1])] = t[1]
> dtwost$N2[dtwost$cc == as.numeric(names(t)[2])] = t[2]
> dtwost$N2[dtwost$cc == as.numeric(names(t)[3])] = t[3]
> dtwost$N2[dtwost$cc == as.numeric(names(t)[4])] = t[4]
> dtwost$N2[dtwost$cc == as.numeric(names(t)[5])] = t[5]

Finally, using the sampled data "dtwost" using two-stage sampling technique, a two-stage survey desingn is specified in the sampled data using "svydesign" command. Then using the "svymean" command, the required mean distance of the district hospital from the village or ward is estimated.

> twost = svydesign(id = ~ cc + Sl_No , data = dtwost , fpc = ~N1+N2)
> svymean(~dist , design = twost)
>          mean
   dist  15.038

The estimated mean distance of the district hospital from the village or wards using the two-stage sampling technique is 15.04 km compared to 15.75 km of the whole villages or wards of the region. The estimated mean distance is not very different from the actual mean distance.

The sampled dataset is then exported in .csv formate using the following commands,

> write.csv(dtwost , "data_dtwost.csv" , row.names = FALSE)

No comments:

Post a Comment