Library Guides: Data Analysis with Stata: Manipulation

Happiness and GDP

This tutorial shows you how you can use Stata to produce Table 1 and Figure 1 below.

It covers the following topics:

How to merge two datasets correctly
How to generate and label new variables
When and how to collapse observations
How to plot a graph

This tutorial is most beneficial if you fire up Stata and follow along, step by step.

Goal

Our goal is to produce Table 1 and Figure 1 below. Figure 1 is a visualization of the data contained in Table 1.

Table 1

Figure 1

We will basically be replicating a small part of this paper: http://www.jstor.org/stable/30033632?origin=JSTOR-pdf

Step 1

Start a new do-file and save it in a folder of your choice.

On the very first line of the do-file, type the following:

cd Your_Folder
clear all
set more off

where Your_Folder is the full path to the folder of your choice. If this path contains whitespace, you will have to enclose the entire path in double quotes.

The cd command stands for "change directory"; it instructs Stata to treat Your_Folder as the current working directory. This folder is where Stata will look for files to read and where Stata will save files that are produced in the present session.
To find out your current working directory, use the pwd command.
To get Stata to execute a single line or successive lines of code from a do-file, highlight the line/s and click the "Do" button at the top of the do-file editor.

You will be working from this do-file for the rest of this exercise. (You do not have to use a do-file, but there are advantages to using one.)

Step 2

Load the following dataset into Stata using the use command.

use "http://www3.nd.edu/~jng2/libguide/gss_happy.dta", clear

Alternatively, you could download the file onto your hard drive and save it in your current working directory. You would then load the file by typing

use gss_happy

So that you do not lose your work, enter the command in the do-file you created in Step 1, and then execute it.

Step 3

To see a list of all the variables, variable types, and sample size, type

describe

To obtain summary statistics of all the variables in the dataset, type

summarize

To browse the raw data in spreadsheet form, type

browse

To obtain frequency counts of the values of the variable happy, type

tabulate happy

Compare this to

tabulate happy, nolabel

An aside: The dataset has 57,061 observations, but tabulate happy only counted 52,321.

What's going on?

Not everyone reported a rating of their own happiness, so some happy observations are missing. Missing values were coded as Stata default missing values by the data providers. To get tabulate to count missing values, specify the "missing" option.

tabulate happy, nolabel missing

Not all data providers choose to code missing values as Stata default! Some will code missing values as very large numbers. For example, person's age is sometimes coded 999 for missing. If this is the case, Stata has no way of knowing that 999 is meant to be a missing value and will include 999 when you ask for, say, average age. Lesson: know your data before you start doing things with it (read any provided documentation).

Step 4

Notice that the values for happy are the opposite of what's been coded in Figure 1. We need to recode this variable to match what we see in Figure 1. We can do so by generating a new variable called happynew with the desired values.

gen happynew = happy

replace happynew = 3 if happy == 1

replace happynew = 1 if happy == 3

To confirm that happynew was coded correctly, cross-tabulate happy and happynew:

tabulate happy happynew, nolabel

You can label the variable happynew to make it more descriptive for yourself and others.

label var happynew "Happiness: 3-very, 2-pretty, 1-not too"

Step 5

Data for the happiness variable are at the person level, but Table 1 and Figure 1 show annual average happiness across all persons at the country level.

To obtain average happiness across all persons for each year, use the collapse command.

collapse (mean) meanhappy=happynew, by(year)

This collapses the data and generates a variable meanhappy containing average happiness at the country level.

As usual, you can label the variable to make it more descriptive for yourself and others.

label var meanhappy "Mean happiness: 3-very, 2-pretty, 1-not too"

Step 6

We need only the observations from 1975 through 1997, so we will keep only observations from that range of years.

keep if year>=1975 & year<=1997

IMPORTANT: You will need to reuse this data, so save the current data in a new file.

save gsshappy, replace

This file will be saved in your current working directory as gsshappy.dta. The saved file will be in Stata format by default and take on the .dta extension.

Step 7

Stata only handles one dataset at a time. To use variables stored in two separate datasets, you must combine the datasets.

In this tutorial, happiness and GDP data are stored in separate datasets, so we need to combine them.

We merge two datasets across observations using the merge command. Think of it as adding new columns to an existing spreadsheet.

To merge two datasets, you need a variable (or set of variables) that is common to the two datasets. This variable is the "identifier"; it identifies each observation (row).

Then, you will need to figure out whether each observation of the identifier variable appears only once, or whether it repeats. You do so using the duplicates report command.

In the present context, our goal is to merge annual GDP data to annual happiness data. Therefore the identifier variable is year.

In the happiness dataset that we have open, it is obvious that year uniquely identifies each observation when we browse the raw data.

To confirm that the observations are uniquely identified by year, type

duplicates report year

This returns the output below, which tells us that 18 observations (rows) in the data have unique year values, and that there isn't a year value that repeats. Conclusion: in gsshappy.dta, the variable year uniquely identifies each observation.

--------------------------------------
copies | observations surplus
----------+---------------------------
1 | 18 0
--------------------------------------

Step 8

As noted in the previous step, Stata can only load one dataset at a time. We will now work with GDP data, and then merge the GDP data with the happiness data that we have saved, gsshappy.dta.

First, let's read the GDP data.

use "http://www3.nd.edu/~jng2/libguide/pwt_gdp.dta", clear

Note: In the above command, the clear option instructs Stata to clear the existing data from memory to make way for pwt_gdp.dta. Therefore, it is important that you save in a permanent file any existing data that you wish to keep before loading a new dataset.

Next, let's keep only observations from 1975 to 1997, inclusive.

keep if year>=1975 & year<=1997

Then, create a new variable for real GDP per capita and label it.

gen rgdppc = rgdpna/pop

label var rgdppc "Real GDP per capita"

Step 9

Verify that the observations are uniquely identified by the year variable (see Step 7).

duplicates report year

Step 10

We are finally in a position to merge the GDP data to the happiness data stored in gsshappy.dta. The command is

merge 1:1 year using gsshappy

More about merging...

Stata refers to the dataset that you currently have open as the "master dataset".

Stata refers to the dataset that you add to the master dataset in a merge operation as the "using dataset".

Here, the master dataset is the GDP dataset that we currently have open, pwt_gdp.dta. The using dataset is gsshappy.dta.

We just performed a "one-to-one" merge. We knew that a one-to-one merge was required because the identifier variable used to merge the two datasets, year, uniquely identifies observations in both the master and using datasets.

Note: Other merge types are one-to-many (1:m), many-to-one (m:1), or many-to-many (m:m). Merging m:m is best avoided; usually if you think you need a m:m merge, chances are either your observations can be further aggregated to enable a 1:m merge, or joinby is more appropriate.

You will notice that a new variable was created from the merge. This variable, _merge, describes the outcome of the merge, as follows:

Observations for which _merge == 1 existed only in the master dataset but not the using dataset.
Observations for which _merge == 2 existed only in the using dataset but not the master dataset.
Observations for which _merge == 3 existed in both the using dataset and master dataset.

The merge command automatically tabulates the value frequencies of _merge:

    Result                           # of obs.
    -----------------------------------------
    not matched                             5
        from master                         5 (_merge==1)
        from using                          0 (_merge==2)
    matched                                18 (_merge==3)
    -----------------------------------------

This table tells us that 18 years of observations were merged across both datasets. Five years were found only in pwt_gdp.dta but not gsshappy.dta, and so observations from these years were not merged because there was no match (for these five years, we had data for GDP but not happiness).

To see which years were matched and which were not, browse the data or type bysort _merge: tabulate year.

Step 11

If you browse the resulting merged data, you will find that it contains exactly the information in Table 1.

To sort the data in ascending order of year, type

sort year

To output the data to a comma-delimited file that can be read by spreadsheet software like Excel, use the outsheet command.

outsheet year rgdppc meanhappy using table1.csv, comma replace

Step 12

In general, the command to plot a graph is twoway followed by a subcommand that specifies the type of graph.

To plot a connected line graph of GDP over time, type

twoway connected rgdppc year

To plot happiness over time, type

twoway connected meanhappy year

We need these two plots in the same graph, each with its own y-axis. To do so, the basic command is (in one line):

twoway ( connected rgdppc year, yaxis(1) ) ( connected meanhappy year, yaxis(2) )

Alternatively,

twoway connected rgdppc year, yaxis(1) || connected meanhappy year, yaxis(2)

To save the graph in Stata's .gph format, type

graph save happygdp, replace

To export the graph to an external format such as PNG or PDF, type

graph export happygdp.png, replace

Complete commands are given in the do-file, happygdp.do.