Thomas Mahaffey, Jr. Business Library
L001D Mendoza College of Business
University of Notre Dame
Notre Dame, IN 46556
Suppose you want to investigate whether daily calorie intake is correlated with temperature (you suspect people eat more when it's cold outside). You have a (fictional) dataset on daily calorie intake for a sample of Indiana residents. This dataset also identifies the weather station closest to each person. In order to answer your research question, you will need to merge this dataset to another dataset on daily temperatures measured at the weather stations.
First, let's open the calorie intake data, dailycalories.dta.
use "https://www3.nd.edu/~jng2/libguide/dailycalories.dta", clear
The date variable, day, is a string variable. There are a few ways to confirm that it's a string:
The correct way to handle dates in Stata is to convert them to a number measured in days elapsed since January 1, 1960. For dates stored as strings, the date() function does the conversion flawlessly. Here's how to do it:
gen statadate = date(day, "DMY")
label var statadate "date in days elapsed since 1-Jan-1960"
This creates a variable called statadate that stores the dates in terms of days elapsed since January 1, 1960. If you browse the data now you'll see that statadate is simply a number. You could get Stata to display it as a date by entering format %td statadate.
What's the point of storing dates in "days elapsed since January 1, 1960"?
One, sticking to a common date format enables error-free merging on dates across multiple datasets. Two, calculations involving dates are straightforward and error-free since a date is nothing but a number; you work with the numbers and let Stata keep track of translating the numbers into dates.
Later, we will merge this dataset to temperature data. Each weather station measures the temperature each day, so the merging variables should be stn and statadate.
It just so happens that each person in this dataset lived in a different location, so each person has a different weather station. This means that each observation identified by the combination of stn and statadate should be unique, which we can verify using
duplicates report stn statadate
This command returns
copies | observations surplus
1 | 1095 0
Save the calories data on your hard drive. We will merge it to daily temperature data in Step 3.
There are at least four ways to get at this.
There are 1095 observations (rows) of person.
These person observations occur in copies of 365.
1092 person observations are "surplus" in that they are the second (third, ...) copy of the first of each group of duplicates.
From these pieces of information, we can surmise that there are (1095/365 = 3) different persons in this dataset.
Now open the daily temperature data, dailytemp.dta. Temperature data came from NOAA, specifically GSOD.
use "https://www3.nd.edu/~jng2/libguide/dailytemp.dta", clear
Dates in this file are stored by the variable yearmoda, which is a number where the first four digits contain the year, the next two contain the month, and the last two contain the day. We want to convert it to days elapsed since January 1, 1960, but to do so we first need to split it into its constituent components. An easy way to split it is to use the add-on command nsplit.
nsplit yearmoda, digits(4 2 2) generate(yyyy mm dd)
We can then generate statadate to store dates in days elapsed since January 1, 1960, using the mdy() function:
gen statadate = mdy(mm, dd, yyyy)
Each station-date combination should be unique, which we verify using
duplicates report stn statadate
Picking up from where we left off in Step 2, remember that the dataset that's currently open is the daily temperature data. Now let's merge the calories data to it.
We previously determined in Step 1 that the combination of stn and statadate uniquely identifies each observation in the calories data. Therefore we must perform a one-to-one merge.
merge 1:1 stn statadate using caloriesdata
We can now explore the relationship between calorie intake and temperature.
A simple correlation coefficient suggests that this relationship is negative:
corr calories temp
| calories temp
calories | 1.0000
temp | -0.2789 1.0000
We can visualize this relationship through a scatter plot:
twoway scatter calories temp || lfit calories temp
graph export caltempscatter.png
Suppose you want visualize how each person's calorie intake changed over the course of the year:
format %td statadate
twoway line calories statadate, xlabel(,labsize(tiny)) by(person)
graph export caldaybyperson.png
Now let's take it up a notch and plot daily calories and temperature on the same graph:
twoway ( line calories statadate, yaxis(1) ylabel(,labsize(vsmall) axis(1)) ) ( line temp statadate, yaxis(2) ylabel(,labsize(vsmall) axis(2)) ), xlabel(,labsize(vsmall)) xtitle("") by(person, rows(3) l1title("calories") r1title("temperature (degF)"))