Thomas Mahaffey, Jr. Business Library
L001D Mendoza College of Business
University of Notre Dame
Notre Dame, IN 46556
Information to be analyzed is usually scattered across multiple datasets. Stata opens only one dataset at a time. Therefore, you will quickly find that you need to combine datasets in order to analyze the information contained within.
The two most frequently used commands for combining datasets are merge and append.
See also: joinby
Think of any tabular dataset as a spreadsheet. A picture is worth a thousand words, so consider Datasat 1 below.
|
This dataset has four observations. An observation is essentially a row in a spreadsheet.
This dataset has two variables, person and age. Each variable occupies a column in the spreadsheet.
Merging basically amounts to adding variables -- columns of new information -- from another dataset to the existing dataset.
Suppose information on sex for the four persons in Dataset 1 are in a separate file, Dataset 2. If you need a person's age and sex for your analysis, you will have to merge the two files.
|
Here's what you must know about the two datasets you are about to merge.
Let's evaluate the two items above in turn.
Before merging, it is good practice to verify whether or not your identifier variable/s is/are unique across observations with duplicates report. Here you would type duplicates report person.
Here's what we want to do:
|
+ |
|
= |
|
And here's how to do it (assuming Dataset 1 and Dataset 2 are stored as dataset1.dta and dataset2.dta respectively):
use dataset1, clear
merge 1:1 person using dataset2
In Stata parlance, the dataset that is currently open is called the master dataset. The dataset that you would like to add to the currently open dataset is the using dataset. In this example dataset1 is the master dataset while dataset2 is the using dataset.
When you're done executing the merge command, you will notice that a new variable called _merge has been generated. This variable describes the outcome of the merge, as follows:
Stata automatically displays the frequency distribution of the values of _merge immediately following a merge command. For this example, you would see something like this:
Result # of obs.
-----------------------------------------
not matched 0
matched 4 (_merge==3)
-----------------------------------------
Alternatively, just type tabulate _merge.
What if persons are NOT unique across observations? This situation arises, for example, in a longitudinal survey where people are tracked over time. Say you have two years' worth of data on people's state of residence (Dataset 3) to which you want to add data on sex.
|
+ |
|
= |
|
Since persons are not unique in Dataset 3 but are unique in Dataset 2, this is a many-to-one merge.
In Stata, the command for this is (assuming Dataset 3 and Dataset 2 are stored as dataset3.dta and dataset2.dta respectively):
use dataset3, clear
merge m:1 person using dataset2
If Dataset 3 and Dataset 2 switched places, then you would perform a one-to-many (1:m) merge.
For more information, see the Manipulating Data and Date/Time tutorials.
Note: The examples shown here are of 1:1 and m:1 merges (and its mirror image 1:m). It is also possible to do a many-to-many (m:m) merge, but this is best avoided. Usually if you think you need a m:m merge, chances are either your observations can be further aggregated to enable a 1:m/m:1/1:1 merge, or joinby is more appropriate.
Think of appending as stacking a dataset on top of another, or adding new rows to an existing dataset. For example,
|
+ |
|
= |
|
The code to do this is
use dataset1, clear
append using dataset4