Code – Steven Guo

Stata Codes

open -n /Applications/Stata/StataMP.app
set more off, permanently

///////////////// section
//sub-section
xxxx // desc
*用于说明步骤

scatter Y X
regress Y X
predict yhat
predict yhat if diff=>-5 & diff<=5 用于event study，但是需要loop
predict residuals, res
scatter residuals x
tab x
tab x y
tab1 x y
generate dummy varaible for x
tab x, generate(dummy name)

br varA varB varC

drop varname1 varname2 varname3
drop if missing(x1) //missing后面没有空格, string variable only. 对于string来说，missing对于blank cell
drop if x1==. //对于numeric一样的效果，cell是一个点 .
drop if varname >=.

des var1 var2 //var1 var2 property

list x y z in 1/10
list x if y==3 & x=2 in 1/300, N
help
gsort +var7 +var8…

rename var1 var2
rename (v1-v18) (var1 var2…etc)

reg price weight headroom //这是说明文件
reg price weight ///这是连接符号，下一行
headroom

control+D +shift //执行本行或者执行选定

preserve

gen f=.
replace f=1 if

replace x=x-1

gen time=date(obs,”YMD”)
gen time=monthly(obs,”YM”)
gen time=yearly(obs,”Y”) ///capital letter

gen time=year(var1)

*要是一个variable是text的话，他可能是蓝色或者红色
*蓝色的是numerical variable，红色的是string variable

*要是一个variable是数字的话，他可能是黑色或者红色
*黑色是numberical，红色是string

destring obs,generate(Month)ignore(`”:”‘)
destring varname, replace
destring varname, generate()

gen var1=substr(var2,3,9) //从var2第三位开始，新的var一共九位
gen long var1 = floor(var2/10) // numeric var减少最后一位用10，减少最后二位用100

keep if varname=1
keep if varname1=varname2
keep if varname=”good”

keep in 1/n
use in 1/n
drop in 1/n

//if need to mannually delete obs, we could generate a variable called “delete”, then edit the database
//give a value of “y” if deleted
gen delete=”” //string variable

//variable name & variable labels export
sysuse auto, clear
describe, replace
export excel name varlab using test.xlsx, firstrow(variables) replace

order varlist1 varlist2 //move these two varlist at the front of the data set
move varlist1 varlist2 //move varlist1 at the front of varlist2

sort varname
by varname: keep if _n==1
///this is similar to
bysort varname: keep if _n==1
///this is similar to
sort varname
by varname: gen count=_n
keep if count==1 //by without the sort option requires that the data be sorted by varlist;
//by and bysort are really the same command; bysort is just by with the sort option.

egen group = group(code)

by varname: g count=_N

by varlist: g varname=_n+1974

merge m:1 gvkey fyear using /Users/steven/Google_Drive/FemaleCEO/US_Sample/NA_highest_edu
//1:1 m:1 1:m m:m(不常用）
//前面的m代表master file可以重复，后面的1代表using file要 uniquely identified

merge 1:m companyid using “/Users/steven/Google_Drive/Thesis/Rumored Deals/rumored_deal_data/companyid_gvkey.dta”
//这里如果没有“”，那么会显示错误，因为Rumored Deals 这个文件夹没有连续
//所以想用 rumored deals这个文件夹，必须 “ … /Rumored deals/”

tsset time,monthly
tsfill

gen g=revt/revt[_n-1]-1 //growth
gen gg=log(revt/revt[_n-1]) //normalize growth
hist g
hist gg

twoway (tsrline residuals)

//////breaking a command line into multiple line//loop//
reg price /// pirce与/// 必须有空格
mpg

global varlist price ///
mpg
foreach x in $varlist {
tab x' reg pricex’
egen meanx'=mean(x’)
}

//////////event study syntax///////////
*step 1: prepare funda or stock return data
use samples
sort gvkey eventdate
by gvkey: gen eventcount=_N
bysort gvkey: keep if _n==1
keep gvkey eventcount
save eventcount

use funda or ret files
sort gvkey
merge m:1 gvkey using eventcount
keep if _merge==3
drop _merge
expand eventcount
sort gvkey date
by gvkey date: gen set=_n
save funda2 or ret2
*step 2: prepare event files & Merge!
use samples
sort gvkey eventdate
by gvkey: gen set=_n
save sample2

use funda2 or ret2
merge m:1 gvkey set using sample2
save funda3 or ret3
keep if _merge==3
sort gvkey set date

stack var1-var540, into(v1-v18) clear
stack var1-var540, group(18)
//When you want the new variables to have the same names as the variables in the first group

append using “C:\FemaleCEO\US_Sample\US5.1.dta” // append more than two databases into one single file

////program//monte carlo simualtion//
clear
program drop MC_NPV

program define MC_NPV
set obs 100000
gen sale_unit=rnormal(10000,700)
gen vc=rnormal(14.75, 1)
gen NPV=-46000+((sale_unit(22.50-vc)-11500)0.66+11500)*(3.169865446)
end

simulate NPV=NPV, reps(10000): MC_NPV

kdensity NPV
kdensity var, k(gau) nor/stu/bw(1) //

sum NPV
sum var, d

graph hbox var1 var2
//////////////////////////////////////////
clear
set seed 20
set obs 500
gen normal_sample = invnormal(uniform())
label var normal_sample “normal sample”
gen exp_sample = -ln(uniform())
label var exp_sample “exponential sample”

graph hbox normal_sample
kdensity normal_sample
kdensity normal_sample, k(gau)
kdensity normal_sample, nor
jb normal_sample
///////////////////////////////////////////////////
arithmetic

– * / ^ -negative +string concatenation

tsset id date, daily
gen ret=prc/l.prc-1

sort id date
gen varname = var1/var1[_n-1]-1 // using gen when cal involving linear operators
egen varnam = mean() sd() median() rowmean() //using egen when cal involving mean sd median
bysort varname: gen // operators in math have three types: linear, differential, integral
bysort varname: egen
egen price_dummy=xtile(price), n(4) // creat a new dummy that is “1” if price is in the first quartile, “4” if in the fourth quartile.
egen price_dummy=xtile(price), by varname n(4) // by varname, creat a dummy
egen rank=rank(price) // rank it, or rank in reverse: rank(-price)
sort price
gen rank=_n // this rank is different with egen=rank()

gen Var2 = “A” + Var1 //If Var1 is a string variable with numeric characters
gen Var2 = “A” + string(Var1, “%03.0f”) //If Var1 is a numeric variable
gen cusip= target + “10” //6-digit cusip to 8-digit

gen company_name = lower(company_name)
replace company_name = upper(company_name)

by and bysort are really the same command; bysort is just by with the sort option.

collapse (sum) varname=varlist (mean) varname=varlist (count) count=varlist, by(varlist varlist etc.)

logical expression(connect senstence)
&and |or !not ~not
replace var1=1 if var2==1 | var2==2
replace var1=1 if var2==4 & var3==5
replace var1=3 if var2==1 | 2 (wrong! wrong!)

relational (numeric and string)

< >= <= == != ~=

generate himpg = mpg > 30 // himpg==1 if mpg>30, himpg==0 if mpg<=30, himpg==1 if mpg=. //(numeric missing values is higher than any other numeric value) generate himpg = mpg > 30 if mpg < . //himpg==1 if mpg>30, himpg==0 if mpg<=30, himpg==. if mpg=. generate himpg = mpg > 30 if foreign == 0

[ ] 在syntax, 表示整个可以被省略

if missing() !missing(varlist) // gen x=1 if missing(var1)//missing后面没有空格

changing directory
cd c:\

changing default directory, then save it under c:\ado as profile.do
cd c:\
///////////////////////////////////////////////////

*F test*
db test //after estimation, coefficicent的F test 或者 test linear hypotheses
test headroom==0
test headroom-weight==0
test headroom+weight==0
………..等等

t test*
db ttest// mean-comparison tests: one sample, two samples, paired
ttest price==0 //mean-comparison tests: one sample

reg var1, robust //这个跟ttest一样，但是多了robust的选项
reg car5 if _merge==2, robust
//手算的方法 #1 求出sd={(x-xbar)^2/(n-1)}^0.5 #2 se=sd/ n^0.5. #3 t for mu=0: xbar/se

db ttesti
ttesti N mean sd mu //t test calculator
/////////db 任何东西都可以出来比如 db reg
multicollinearity diagnostics***
the most extreme case of multicollinearity is when you have a good fit,
but t or p value is not significant

reg y x1 x2 x3 x4

vif//*variance inflation factor (linear regressor)
//it’s a problem when bigger than 4, 5, 10
//solution-drop these variables

vce//(variance) covariance matrix of coefficients (the estimate of the coefficient)
//it is not scale free

vce, corr//correlation matrix (absolute value bigger than 0.5)

*variance covariance matrix of the raw data

Heteroskedasticity
estat hettest //Bruesch-Pagan test
estat imtest //info matrix test// estat imtest, white/ white’s test for homo…

serial correlation/autocorrelation/lagged correlation**
ac varname //Correlograms
ac d.varname// for first-order correlograms
pac varname // partial correlograms
pac d.varname // for first order partial correlograms

tsset t
estat dwatson //Durbin-Watson test
estat durbinalt //Durbin’s alternative test for serial correlation

estat bgodfrey //Breusch–Godfrey test for higher-order serial correlation
//B-P(pagan)-G Lagrange Multiplier test…LM form of B-P-G
//test for higher-order serial correlation in the disturbance
// This test does not require that all the regressors be strictly exogenous.

time series*
tsset t
tsset t, quarterly

twoway (tsline x1)
gen ln_x1=log(x1)
twoway (tsline ln_x1)
twoway (tsline d.ln_x1)

*postestimation – time series*
gen timevar=_n
tsset timevar //to creat a time variable if needed

estat archlm //test for ARCH effects in the residuals

estat dwatson //Durbin–Watson d statistic to test for first-order serial correlation
estat durbinalt //Durbin’s alternative test for serial correlation
estat bgodfrey //Breusch–Godfrey test for higher-order serial correlation

estat sbknown //perform tests for a structural break with a known break date
estat single //perform tests for a structural break with an unknown break date

postestimation*

estat hettest //tests for heteroskedasticity
estat imtest //information matrix test// estat imtest, white/ white’s test for homoskedasticity
estat szroeter //Szroeter’s rank test for heteroskedasticity

dfbeta DFBETA //influence statistics
estat vif //variance inflation factors for the independent variables

estat ovtest //Ramsey regression specification-error test for omitted variables
estat esize // mu^2 and omega^2 effect sizes

skewness/kurtosis tests for normality***
sktest var
sum var1 var2
sum var1 var2, detail(s) // with skew. kurto. sum 1th, 5th,10th,25th,50th,75th,90th,95th, 99th percentile
hist var // defaul: continuous and den
hist var, d freq //discrete variable

ssc install jb
jb var

OLS regression***
reg y x1 x2

sort id
by id: reg y x1 x2

predict yhat
predict res, residual
twoway (lfit y x1)(scatter y x1)
twoway (lfitci y x1)(scatter y x1)

interaction
reg price mpg weight c.mpg#c.weight
reg price c.mpg##c.weight

reg price weight i.foreign

semi-log model or linear-log*
for %change in y, and %change in x, repectively
log(y)=b0+b1x , y=b0+b1log(x) semi-log independent model

double-log or log-linear
for %change in y repect to %change in x
log(y)=b0+b1log(x) Reciprocal Model
Y = b1 + b2(1/X) + e ///u Model is linear in the parameters, but nonlinear in the variables

polynomial modelquadratic (second-order) polynomial model**
LRAC = b1+ b2Q + b3Q^2

*logit/probit
webuse lbw, clear
(Hosmer & Lemeshow data)

estat gof //Pearson goodness-of-fit test or the Hosmer-Lemeshow gof

logit // report the coefficents
logistic //report the odds ratio

wald test
quietly: logit y x1 x2 x3
test x1 x2 //test if the coefficients for x1 and x2 are not simultaneously equal to zero

ordered probit
wald (chi2) test
quietly: oprobit y x1 x2 x3
test x1 x2 //test if the coefficients for x1 and x2 are not simultaneously equal to zero
test x1 x2 x3// test if all the coefficients for x1 x2 x3 =0, it is the same as wald chi2 reported on the top

ivregress 2sls/liml/gmmendogeneity test
y=X*beta+epsilon
If there are endogenous variables in the model, the ols estimators are inconsistent. we need IV estimator
ivregress 2sls y exogenous_X (endogenous_X = Instruments) //endogenous_X are instrumented
ivregress liml y exogenous_X (endogenous_X = Instruments)
ivregress gmm y exogenous_X (endogenous_X = Instruments)

reg gc ri pg //pg is endogenous, thus the estimator is inconsistent
ivregress 2sls gc ri (pg = rpt rpn rpu ri) //gc=gas consumption, pg=gas price (endogenous) ///
//ri=real income (exogenous), rpt rpn rpu=price of public transport, new cars and used cars
////gmm produce的p value比较大////

ivprobit 2sls/liml/gmmendogeneity test
sysuse auto
oprobit rep mpg disp, nolog
gsem (rep <- mpg disp, oprobit) //these two lines produce the same result for ordered probit

probit foreign price mpg
ivprobit foreign price (mpg = weight)

file management*

*display all example datasets installed with Stata
sysuse dir
sysuse auto
sysuse auto, clear

cd
pwd // both display the current working directory

*save a dataset in Stata 14,Stata 15,Stata 16 so that it can be used in Stata 13 12 11
saveold autoold, version(13) (12) (11)

ssc intalled packages****

****display all installed packages
ado dir

****outreg2 //它和esttab很像，我喜欢用outreg2
outreg2 using filename, replace excel dec()
outreg2 using filename, append excel dec()
outreg2 using filename, append excel dec() st(coef se N corr covar pcorr etc.)
//sideway display
//marginal effects
//summary statistics

import & export
export excel var1 var2 var3 var4 using “stats.xls” if diff_an==0, firstrow(variables) sheet(“car5_3”)
export excel var1 var2 var3 var4 using “stats.xls” if diff_an==0, firstrow(variables) sheetmodify

****estout
estpost
esttab, cells(“

****asdoc
net install asdoc, from(http://fintechprofessor.com) replace
asdoc ttest var==0, replace title (T-test results: H1: Mean=0)
asdoc ttest var2==0, rowappend //one sample ttest

asdoc ttest mpg==price, replace //two sample ttest
bysort foreign: asdoc ttest mpg==price, replace //two sample ttest over groups

asdoc ttest mpg==price, replace cnammes(Treatment, contorl)

****mkcorr
mkcorr varname1 varname2 varname3-varname9, log(filename) replace

****jb Jarque-Bera Normality test
ssc install jb
jb var

****winsor & winsor2
*winsor by group****
******-egen, group()- will happily work with combinations of two or more variables
sysuse auto, clear
egen group = group(rep78)
gen winsorised = .
su group, meanonly
forval i = 1/r(max)' { capture { winsor price if group ==i’, gen(work) p(0.04)
replace winsorised = work if group == `i’
drop work
}
}

sort rep78 price
**maping sic to FF industry
ffind //安装包在硬盘里

matchitfreqindex*fuzzy matching
matchit requires freqindex to be installed. You can get it in SSC

matchit TargetName companyname, g(simil_1)
// This batch considers only the smaller string when computing the score
// which I suspect it can be more useful in this case (see scores simil_1 and simil_1b for first two obs)
matchit investor_name firm1 , g(simil_1b) s(minsimple)

A few ideas to try:
1) I would try to remove all of the “Corp”, “Inc.”, “LLC”, etc from both sets of names before matching.

2) Similarly, you might create a version of investor_name that is limited to the first two words of the name, or the first 14 letters, and then run matchit.

3) Check out some of the options listed in this discussion here

Code:

Removing the “Corp”, “Inc.”, “LLC” from the names
foreach name in “, LLC” ” LLC” “, Inc.” “, Inc” “Corporation” ” Corp.” ” Corp.” {
replace investor_name = subinstr(investor_name, “`name'”, “”,.)
}

replace investor_name = itrim(trim(investor_name))

Limiting investor_name to first two words
gen investor_2word = word(investor_name,1) + ” ” + word(investor_name,2)
replace investor_name = itrim(trim(investor_name)) // mainly worried about trailing spaces with this one
Limiting to first 14 letters
gen investor_14letters = substr(investor_name, 1, 14)

omodelspostfor ordinla model/ ologit / oprobit test of the proportional odds assumption or the parallel regression assumption.**
***if the parallel assumption is violated, then multinomial logit/probit, generalized ordered logit should be considered.
omodel probit y x1 x2 //can be used in oprobit/ologit

brant, detail //can only be used in ologit

技巧
在GUI界面，按TAB,可以唤出variable list, 然后按字母调到variable首字母