Survey data analysis in Stata: Basic introduction step by step with copiable commands

STATA is a very powerful, but not an open-source, data analysis software.

Please, download a copy of STATA from here. 

Link STATA 14

Link for MAC: STATA 14

Link for STATA 12: STATA 12

How to install a portable version of STATA?

Watch Video

************************************

// Note: commands in // or * are comments. The commands in quotes are Stata Syntax.

//By Mr. Rohan Byanjankar

Supportive materials

Excel file: Download

Do file: Download

//Easily copiable Do file Syntax

PDF file with Syntax: Download 

//clear
clear	
//importing data
 *syntax 
 *import excel "",sheet("Data") firstrow
import excel "D:\~~~SPSS session\Materials\Files\2077.06.18 SPSS dataset sudal.xlsx", sheet("Data") firstrow	
//browse data
 //syntax
 *browse browse
 //syntax for browse selecting cases
 *browse in 1/20
browse in 1/20	
//editing data
 //syntax
 *edit
edit	
//labeling or describing variable
 //syntax
*label variable "description of variable"
*Note: label variable is STATA command.
label variable Household "Household ID"	
//renaming variable
 //syntax *rename
rename Membersinfamily hhsize	
// generating new variable *syntax: gen new_var = operation to be performed
gen new_var=0	
//dropping variable
 *syntax drop
drop new_var	
//count
count if Age>=40	
//sort dataset in ascending order
 //syntax *sort
sort Age	
//labeling of variables
 //syntax *Step 1: label define गर्ने ।
 *Step 2: label लाई variable संग टास्ने ।
 *label define "label_names"
 *label values 
*Note: label define and label values are STATA commands.
label define Gender 1"Male" 2"Female"	
label values Gender Gender 	
//label list
label list Gender	
//Labeling for Religion
label define Religion 1"Hindiusm" 2"Kirat" 3"Buddhist"	
label values Religion Religion	
label list Religion	
//Labeling for family type
label define Familytype 1"Nuclear" 2"Joint"	
label values Familytype Familytype	
//Labeling for Education
label def Education 1"Never Attended School" 2"Attended School" ///	
3"SLC" 4"Intermediate" 5"Bachelors" 6"Masters"	
label values Education Education	
//labeling for Area
label def Area 1"Sudal" 2"Koteshwor"	
label values Area Area	
//Same process of labeling other variables
//Recoding variable Age
//syntax
*recode <variable_name> <(range=code "label")>,gen<new_variable_name>
recode Age (min/20=1 "Below 20") (20/30=2 "20-30") ///	
(30/40=3 "30-40")(40/50=4 "40-50")(50/60=5 "50-60") ///	
(60/max=6 "60+"),gen(age_group)	
list Age age_group	
label var age_group "Grouping of Age"	
//Producing tables
 *Tables are only generated for Categorical variables (nominal and ordinal)
 //one-way tables
tab Gender	
tab Gender,missing //if there is missing values	
tab Occupation	
tab Occupation,missing	
tab Religion //display with label name	
//I want codes, not label
tab Religion, nol	
tab Gender,nol //do not display labels	
tab Education	
//all one-way tables tab1 var_1 var_2 var_n
tab1 Gender Education Familytype	
//two way tables **command tab
tab Gender Education	
tab Gender Area	
//all two-way tables
tab2 Gender Area Gender Education	
//two-way table with row percent
tab Gender Education, row	
tab Gender Area, row	
//two-way table with column percent
tab Gender Education, col	
tab Gender Area, col	
//two-way table with row percent but no frequency
tab Gender Education, row nofreq	
tab Gender Area, row nofreq	
//two-way table with column percent but no frequency
tab Gender Education, col nofreq	
tab Gender Area, col nofreq	
//graph
 *bar chart, pie chart, box plot...
//bar chart
graph bar,over(Area) ///	
blabel(bar,position(outside) size(11pt)) ///	
ylabel(none) ytitle("Percentage") yscale(r(0,70)) ///	
title("Percentage of respondents by Area",margin(b=4)) ///	
caption("Source: Author's own data") ///	
asyvars	
graph pie,over(Area)	
//box plot
graph box Food_today,over(Area) ///	
ylabel(none) ytitle("Food expenses in Rs.") ///	
title("Box plot of Food expenses in Sudal and Koteshwor",margin(b=4)) ///	
asyvars ///	
box(1,color(red)) ///	
box(2,color(navy)) ///	
legend(stack size(small) ///	
title(Area of respondents,size(medium)))	
	
graph export boxplot.png,replace	
//summary statistics
 *we will find summary statistics for numerical variables
 //it shows obs, mean, sd, Min, max
 // syntax sum var_list sum Food_today Food_10_years_ago Total_expense TotalIncome AverageIncome
 //I need detail
sum Food_today,detail	
//correlation *syntax corr var_list
corr Food_today Food_10_years_ago	
// lets try another command *syntax pwcorr var_list //normal correlation
pwcorr Food_today Food_10_years_ago	
//correlation with a significance level
pwcorr Food_today Food_10_years_ago,sig	
//correlation with significance level in a star
pwcorr Food_today Food_10_years_ago,star(0.05) //show star if sig.<0 .05="" code="">
//hypothesis test //null hypothesis: Hypothesis of no difference
//rule do not reject null if p>0.05
//t test, z test, f-test (ANOVA), Chi2
//t test //one sample t test (variable: numerical) //null: food expenses=10000
ttest Food_today=10000	
//one sample t-test by group
by Area,sort:ttest Food_today == 10000	
//independent sample t test
//null: there is no difference between food expenses in sudal and koteshwor.
//numerical variable and categorical
*syntax: ttest numercial_var,by(categorical variable) //equal variance
ttest Food_today,by(Area)	
//unequal variance
ttest Food_today,by(Area) unequal	
//paired t test //both variables numerical
 //null/claim: Food expenses today=Food expenses 10 year ago
ttest Food_today==Food_10_years_ago	
//z-test 
//z-test valid for large sample but t-test valid for both large and small samples. 
//null: food expenses=10000
//p<0.05 we reject our claim.             
ztest Food_today=10000	
//independent sample z test
//null: there is no difference between food expenses in sudal and koteshwor.
//numerical variable and categorical
*syntax: ttest numercial_var,by(categorical variable)
ztest Food_today,by(Area)	
//ANOVA
//extension of independent sample t test
//one numerical and other categorical with more than two categories
//in case of t test
 //one numerical and other categorical with only two categories
 *syntax anova num_var cat_var
//null: there is no difference between food expenses among religious groups
//p<0.05 we reject our claim. 
anova Food_today Religion	
//Chi2 //when to use it? //when both variables are categorical //null: there is no association between Gender and Area
tab Gender Area,chi2	
//for expected count
tab Gender Area,chi2 expected 	
//if the expected count is less than 5, then we cannot use Pearson chi2. So we need to move to Fischer's exact test.
 //for Fischer test
tab Gender Area,chi2 expected exact	
//for log-likelihood
tab Gender Area,chi2 expected exact lrchi2	
//regression //multiple linear regresion *syntax reg num_dep_var ind_vars //generating new variable *syntax gen new_var_name = ln(old_var_name) //in case of log transformation
gen ln_food_today=ln(Food_today)	
gen ln_avg_income=ln(AverageIncome)	
//regression begins
reg ln_food_today i.Area i.Gender Age ln_avg_income	
//result interpretation
//area: The food expenses in Koteshwor are 65 percent higher than the food expenses in Sudal. //Income: If income increases by 10 percent, then food expenses increase by 1.9 percent. 
 //test for multicollinearity and heteroskedasticity
 //for multicollinearity and heteroskedasticity
vif	
		
//rule: if vif is less than 10, the model is free from multicollinarity	
//for heteroskedasticity	
estat hettest	
		
//for autocorrelation only for time series data	
estat dwatson	
		
//logistic regression *syntax logit cat_dep_var ind_vars
gen gender1=0	
replace gender1=1 if Gender==1	
//logit coefficients
//logit coefficients are not intrepretable as the coefficients of OLS. These coefficients only show the nature of relationship between the dependent and independent variable.
logit gender1 ln_food_today ln_avg_income i.Area i.Familytype	
//odds ratio
//It is the odds ratio that matters the most. An odds ratio is the ratio of the probability of happening an event by the probability of not happening an event. If an odds ratio is greater than 1, then the probability of happening an event is greater. Suppose, an odds ratio of "landless" to poverty in logit regression is 1.2, then it means that the probability of being poor increases by 20 percent [(1.2-1)*100] for "landless" households compared to houesholds with "land"..
logit gender1 ln_food_today ln_avg_income i.Area i.Familytype,or	
//Specification tests for logit model
//test: mispecification test
ssc install linktest,replace //install linktest in stata
linktest	
		
//goodness of fit	
estat gof	
		
//fitstat	
fitstat	
		
//heteroskedasticity and multicollinearity
reg gender1 ln_food_today ln_avg_income i.Area i.Familytype	
estat hettest	
vif	
//logit coefficients	
logit gender1 ln_food_today ln_avg_income i.Area i.Familytype	
//odds ratio	
logit gender1 ln_food_today ln_avg_income i.Area i.Familytype,or	
//for marginal effect	
mfx,force	
//export work from stata to word //installing package //ssc install //we will use outreg2 ssc install outreg2
reg ln_food_today i.Area i.Gender Age ln_avg_income	
outreg2 using word.doc,label	
//logit coefficients
logit gender1 ln_food_today ln_avg_income i.Area i.Familytype	
outreg2 using word1.doc,label ctitle(Logit Coeff) replace	
//odds ratio
logit gender1 ln_food_today ln_avg_income i.Area i.Familytype,or	
outreg2 using word1.doc,label ctitle(Odds ratio) append eform	
//for the marginal effect
mfx,force	
outreg2 using word1.doc,label ctitle(mfx) append mfx	

Post a Comment

0 Comments