Skip to Main Content

Data Analysis with Stata

Daily Temperature and Calories

Suppose you want to investigate whether daily calorie intake is correlated with temperature (you suspect people eat more when it's cold outside). You have a (fictional) dataset on daily calorie intake for a sample of Indiana residents. This dataset also identifies the weather station closest to each person. In order to answer your research question, you will need to merge this dataset to another dataset on daily temperatures measured at the weather stations.

Step 1

First, let's open the calorie intake data, dailycalories.dta.

use "https://www3.nd.edu/~jng2/libguide/dailycalories.dta", clear

The date variable, day, is a string variable. There are a few ways to confirm that it's a string:

  • Type describe day. The results window tells us that day is of type str9 (a string that's 9 characters in length).
  • Click on day in the "Variables" window. The "Properties" window gives us the same information as the above.
  • Browse the raw data (type browse). Strings appear in red.

The correct way to handle dates in Stata is to convert them to a number measured in days elapsed since January 1, 1960. For dates stored as strings, the date() function does the conversion flawlessly. Here's how to do it:

gen statadate = date(day, "DMY")

label var statadate "date in days elapsed since 1-Jan-1960"

This creates a variable called statadate that stores the dates in terms of days elapsed since January 1, 1960. If you browse the data now you'll see that statadate is simply a number. You could get Stata to display it as a date by entering format %td statadate.

What's the point of storing dates in "days elapsed since January 1, 1960"?
One, sticking to a common date format enables error-free merging on dates across multiple datasets. Two, calculations involving dates are straightforward and error-free since a date is nothing but a number; you work with the numbers and let Stata keep track of translating the numbers into dates.

Later, we will merge this dataset to temperature data. Each weather station measures the temperature each day, so the merging variables should be stn and statadate.

It just so happens that each person in this dataset lived in a different location, so each person has a different weather station. This means that each observation identified by the combination of stn and statadate should be unique, which we can verify using

duplicates report stn statadate

This command returns

--------------------------------------
   copies | observations       surplus
----------+---------------------------
      1   |         1095             0
--------------------------------------

Save the calories data on your hard drive. We will merge it to daily temperature data in Step 3.

save caloriesdata

There are at least four ways to get at this.

  • tabulate person
    There are three persons, each appearing 365 times (which is as expected since the dataset contains each person's daily caloric intake).
  • bysort person: count
  • egen uniqperson = tag(person)
    tabulate uniqperson
  • duplicates report person
    This returns the following table:
--------------------------------------
   copies | observations       surplus
----------+---------------------------
      365 |         1095          1092
--------------------------------------

There are 1095 observations (rows) of person.
These person observations occur in copies of 365.
1092 person observations are "surplus" in that they are the second (third, ...) copy of the first of each group of duplicates.
From these pieces of information, we can surmise that there are (1095/365 = 3) different persons in this dataset.

Step 2

Now open the daily temperature data, dailytemp.dta. Temperature data came from NOAA, specifically GSOD.

use "https://www3.nd.edu/~jng2/libguide/dailytemp.dta", clear

Dates in this file are stored by the variable yearmoda, which is a number where the first four digits contain the year, the next two contain the month, and the last two contain the day. We want to convert it to days elapsed since January 1, 1960, but to do so we first need to split it into its constituent components. An easy way to split it is to use the add-on command nsplit.

nsplit yearmoda, digits(4 2 2) generate(yyyy mm dd)

We can then generate statadate to store dates in days elapsed since January 1, 1960, using the mdy() function:

gen statadate = mdy(mm, dd, yyyy)

Each station-date combination should be unique, which we verify using

duplicates report stn statadate

Step 3

Picking up from where we left off in Step 2, remember that the dataset that's currently open is the daily temperature data. Now let's merge the calories data to it.

We previously determined in Step 1 that the combination of stn and statadate uniquely identifies each observation in the calories data. Therefore we must perform a one-to-one merge.

merge 1:1 stn statadate using caloriesdata


We can now explore the relationship between calorie intake and temperature.

A simple correlation coefficient suggests that this relationship is negative:

corr calories temp

             | calories     temp
-------------+------------------
    calories |   1.0000
        temp |  -0.2789   1.0000

We can visualize this relationship through a scatter plot:

twoway scatter calories temp || lfit calories temp
graph export caltempscatter.png

Suppose you want visualize how each person's calorie intake changed over the course of the year:

format %td statadate

twoway line calories statadate, xlabel(,labsize(tiny)) by(person)
graph export caldaybyperson.png

Now let's take it up a notch and plot daily calories and temperature on the same graph:

Code:

twoway ( line calories statadate, yaxis(1) ylabel(,labsize(vsmall) axis(1)) ) ( line temp statadate, yaxis(2) ylabel(,labsize(vsmall) axis(2)) ), xlabel(,labsize(vsmall)) xtitle("") by(person, rows(3) l1title("calories") r1title("temperature (degF)"))