Skip to Main Content

Data Analysis with Stata

Combining Datasets

Information to be analyzed is usually scattered across multiple datasets. Stata opens only one dataset at a time. Therefore, you will quickly find that you need to combine datasets in order to analyze the information contained within.

The two most frequently used commands for combining datasets are merge and append.

See also: joinby

merge

Think of any tabular dataset as a spreadsheet. A picture is worth a thousand words, so consider Datasat 1 below.

Dataset 1
person age
A 21
B 57
C 35
D 23

This dataset has four observations. An observation is essentially a row in a spreadsheet.

This dataset has two variables, person and age. Each variable occupies a column in the spreadsheet.

Merging basically amounts to adding variables -- columns of new information -- from another dataset to the existing dataset.

Suppose information on sex for the four persons in Dataset 1 are in a separate file, Dataset 2. If you need a person's age and sex for your analysis, you will have to merge the two files.

Dataset 2
person sex
A male
B female
C male
D female

Here's what you must know about the two datasets you are about to merge.

  • What is the identifier variable on which the files should be combined?
  • Is each observation (row) of the identifier variable unique? In other words, does each row value for the identifier variable occur only once? The answer to this question matters for how you would merge the two datasets, as you will see.

Let's evaluate the two items above in turn.

  • Since we wish to combine data on a person's age and data on a person's sex, the identifier variable is person.
  • In Dataset 1, each person appears only once, so person uniquely identifies each person in the dataset. Likewise for Dataset 2. This means that we should perform a one-to-one merge of the two datasets based on person.

Before merging, it is good practice to verify whether or not your identifier variable/s is/are unique across observations with duplicates report. Here you would type duplicates report person.


Here's what we want to do:

Dataset 1
Dataset 2
Merged data
person age
A 21
B 57
C 35
D 23

+

person sex
A male
B female
C male
D female

=

person age sex
A 21 male
B 57 female
C 35 male
D 23 female

And here's how to do it (assuming Dataset 1 and Dataset 2 are stored as dataset1.dta and dataset2.dta respectively):

use dataset1, clear
merge 1:1 person using dataset2

In Stata parlance, the dataset that is currently open is called the master dataset. The dataset that you would like to add to the currently open dataset is the using dataset. In this example dataset1 is the master dataset while dataset2 is the using dataset.

When you're done executing the merge command, you will notice that a new variable called _merge has been generated. This variable describes the outcome of the merge, as follows:

  • Observations for which _merge == 1 existed only in the master dataset but not the using dataset. These observations were therefore not merged.
  • Observations for which _merge == 2 existed only in the using dataset but not the master dataset. These observations were therefore not merged.
  • Observations for which _merge == 3 existed in both the using dataset and master dataset. These observations were therefore merged.

Stata automatically displays the frequency distribution of the values of _merge immediately following a merge command. For this example, you would see something like this:

     Result                           # of obs.
     -----------------------------------------
     not matched                             0
     matched                                 4  (_merge==3)
     -----------------------------------------

Alternatively, just type tabulate _merge.


What if persons are NOT unique across observations? This situation arises, for example, in a longitudinal survey where people are tracked over time. Say you have two years' worth of data on people's state of residence (Dataset 3) to which you want to add data on sex.

Dataset 3
Dataset 2
Merged data
person year state
A 2010 CA
A 2011 TX
B 2010 OR
B 2011 OR
C 2010 IN
C 2011 CA
D 2010 MI
D 2011 OR

+

person sex
A male
B female
C male
D female

=

person year state sex
A 2010 CA male
A 2011 TX male
B 2010 OR female
B 2011 OR female
C 2010 IN male
C 2011 CA male
D 2010 MI female
D 2011 OR female

Since persons are not unique in Dataset 3 but are unique in Dataset 2, this is a many-to-one merge.

In Stata, the command for this is (assuming Dataset 3 and Dataset 2 are stored as dataset3.dta and dataset2.dta respectively):

use dataset3, clear
merge m:1 person using dataset2

If Dataset 3 and Dataset 2 switched places, then you would perform a one-to-many (1:m) merge.

For more information, see the Manipulating Data and Date/Time tutorials.

Note: The examples shown here are of 1:1 and m:1 merges (and its mirror image 1:m). It is also possible to do a many-to-many (m:m) merge, but this is best avoided. Usually if you think you need a m:m merge, chances are either your observations can be further aggregated to enable a 1:m/m:1/1:1 merge, or joinby is more appropriate.


Screencast produced by the Office of Digital Learning

append

Think of appending as stacking a dataset on top of another, or adding new rows to an existing dataset. For example,

Dataset 1
Dataset 4
Appended data
person age
A 21
B 57
C 35
D 23

+

person age
E 44
F 29
G 32
H 55

=

person age
A 21
B 57
C 35
D 23
E 44
F 29
G 32
H 55

The code to do this is

use dataset1, clear
append using dataset4