Thomas Mahaffey, Jr. Business Library
L001D Mendoza College of Business
University of Notre Dame
Notre Dame, IN 46556
This tutorial shows you how you can use Stata to produce Table 1 and Figure 1 below.
It covers the following topics:
This tutorial is most beneficial if you fire up Stata and follow along, step by step.
Our goal is to produce Table 1 and Figure 1 below. Figure 1 is a visualization of the data contained in Table 1.
Figure 1
|
We will basically be replicating a small part of this paper: http://www.jstor.org/stable/30033632?origin=JSTOR-pdf
Start a new do-file and save it in a folder of your choice.
On the very first line of the do-file, type the following:
cd Your_Folder
clear all
set more off
where Your_Folder is the full path to the folder of your choice. If this path contains whitespace, you will have to enclose the entire path in double quotes.
You will be working from this do-file for the rest of this exercise. (You do not have to use a do-file, but there are advantages to using one.)
Load the following dataset into Stata using the use command.
use "http://www3.nd.edu/~jng2/libguide/gss_happy.dta", clear
Alternatively, you could download the file onto your hard drive and save it in your current working directory. You would then load the file by typing
use gss_happy
So that you do not lose your work, enter the command in the do-file you created in Step 1, and then execute it.
To see a list of all the variables, variable types, and sample size, type
To obtain summary statistics of all the variables in the dataset, type
summarize
To browse the raw data in spreadsheet form, type
browse
To obtain frequency counts of the values of the variable happy, type
Compare this to
An aside: The dataset has 57,061 observations, but tabulate happy only counted 52,321.
Not everyone reported a rating of their own happiness, so some happy observations are missing. Missing values were coded as Stata default missing values by the data providers. To get tabulate to count missing values, specify the "missing" option.
tabulate happy, nolabel missing
Not all data providers choose to code missing values as Stata default! Some will code missing values as very large numbers. For example, person's age is sometimes coded 999 for missing. If this is the case, Stata has no way of knowing that 999 is meant to be a missing value and will include 999 when you ask for, say, average age. Lesson: know your data before you start doing things with it (read any provided documentation).
Notice that the values for happy are the opposite of what's been coded in Figure 1. We need to recode this variable to match what we see in Figure 1. We can do so by generating a new variable called happynew with the desired values.
gen happynew = happy
replace happynew = 3 if happy == 1
replace happynew = 1 if happy == 3
To confirm that happynew was coded correctly, cross-tabulate happy and happynew:
You can label the variable happynew to make it more descriptive for yourself and others.
label var happynew "Happiness: 3-very, 2-pretty, 1-not too"
Data for the happiness variable are at the person level, but Table 1 and Figure 1 show annual average happiness across all persons at the country level.
To obtain average happiness across all persons for each year, use the collapse command.
This collapses the data and generates a variable meanhappy containing average happiness at the country level.
As usual, you can label the variable to make it more descriptive for yourself and others.
label var meanhappy "Mean happiness: 3-very, 2-pretty, 1-not too"
We need only the observations from 1975 through 1997, so we will keep only observations from that range of years.
keep if year>=1975 & year<=1997
IMPORTANT: You will need to reuse this data, so save the current data in a new file.
save gsshappy, replace
This file will be saved in your current working directory as gsshappy.dta. The saved file will be in Stata format by default and take on the .dta extension.
Stata only handles one dataset at a time. To use variables stored in two separate datasets, you must combine the datasets.
In this tutorial, happiness and GDP data are stored in separate datasets, so we need to combine them.
We merge two datasets across observations using the merge command. Think of it as adding new columns to an existing spreadsheet.
To merge two datasets, you need a variable (or set of variables) that is common to the two datasets. This variable is the "identifier"; it identifies each observation (row).
Then, you will need to figure out whether each observation of the identifier variable appears only once, or whether it repeats. You do so using the duplicates report command.
In the present context, our goal is to merge annual GDP data to annual happiness data. Therefore the identifier variable is year.
In the happiness dataset that we have open, it is obvious that year uniquely identifies each observation when we browse the raw data.
To confirm that the observations are uniquely identified by year, type
duplicates report year
This returns the output below, which tells us that 18 observations (rows) in the data have unique year values, and that there isn't a year value that repeats. Conclusion: in gsshappy.dta, the variable year uniquely identifies each observation.
--------------------------------------
copies | observations surplus
----------+---------------------------
1 | 18 0
--------------------------------------
As noted in the previous step, Stata can only load one dataset at a time. We will now work with GDP data, and then merge the GDP data with the happiness data that we have saved, gsshappy.dta.
First, let's read the GDP data.
use "http://www3.nd.edu/~jng2/libguide/pwt_gdp.dta", clear
Note: In the above command, the clear option instructs Stata to clear the existing data from memory to make way for pwt_gdp.dta. Therefore, it is important that you save in a permanent file any existing data that you wish to keep before loading a new dataset.
Next, let's keep only observations from 1975 to 1997, inclusive.
keep if year>=1975 & year<=1997
Then, create a new variable for real GDP per capita and label it.
gen rgdppc = rgdpna/pop
label var rgdppc "Real GDP per capita"
Verify that the observations are uniquely identified by the year variable (see Step 7).
duplicates report year
We are finally in a position to merge the GDP data to the happiness data stored in gsshappy.dta. The command is
merge 1:1 year using gsshappy
Stata refers to the dataset that you currently have open as the "master dataset".
Stata refers to the dataset that you add to the master dataset in a merge operation as the "using dataset".
Here, the master dataset is the GDP dataset that we currently have open, pwt_gdp.dta. The using dataset is gsshappy.dta.
We just performed a "one-to-one" merge. We knew that a one-to-one merge was required because the identifier variable used to merge the two datasets, year, uniquely identifies observations in both the master and using datasets.
Note: Other merge types are one-to-many (1:m), many-to-one (m:1), or many-to-many (m:m). Merging m:m is best avoided; usually if you think you need a m:m merge, chances are either your observations can be further aggregated to enable a 1:m merge, or joinby is more appropriate.
You will notice that a new variable was created from the merge. This variable, _merge, describes the outcome of the merge, as follows:
The merge command automatically tabulates the value frequencies of _merge:
Result # of obs.
-----------------------------------------
not matched 5
from master 5 (_merge==1)
from using 0 (_merge==2)
matched 18 (_merge==3)
-----------------------------------------
This table tells us that 18 years of observations were merged across both datasets. Five years were found only in pwt_gdp.dta but not gsshappy.dta, and so observations from these years were not merged because there was no match (for these five years, we had data for GDP but not happiness).
To see which years were matched and which were not, browse the data or type bysort _merge: tabulate year.
If you browse the resulting merged data, you will find that it contains exactly the information in Table 1.
To sort the data in ascending order of year, type
sort year
To output the data to a comma-delimited file that can be read by spreadsheet software like Excel, use the outsheet command.
outsheet year rgdppc meanhappy using table1.csv, comma replace
In general, the command to plot a graph is twoway followed by a subcommand that specifies the type of graph.
To plot a connected line graph of GDP over time, type
twoway connected rgdppc year
To plot happiness over time, type
twoway connected meanhappy year
We need these two plots in the same graph, each with its own y-axis. To do so, the basic command is (in one line):
twoway ( connected rgdppc year, yaxis(1) ) ( connected meanhappy year, yaxis(2) )
Alternatively,
twoway connected rgdppc year, yaxis(1) || connected meanhappy year, yaxis(2)
To save the graph in Stata's .gph format, type
graph save happygdp, replace
To export the graph to an external format such as PNG or PDF, type
graph export happygdp.png, replace
Complete commands are given in the do-file, happygdp.do.