Skip to Main Content

Data Analysis with Stata

Cleaning a Stock Portfolio

Stata has two system variables that always exist as long as data is loaded, _n and _N. _n basically indexes observations (rows): _n = 1 is the first row, _n = 2 is the second, and so on. _N denotes the total number of rows.

To illustrate, let’s use stocks.dta. This portfolio contains 32 observations.

use "", clear


The data aren't sorted in any particular order. Before we do anything else, it makes sense to first sort the data:

sort stockid year

This sorts observations first by stockid then by year, in ascending order.

To get a sense of what _n and _N can be used for, enter these two commands:

gen obsnum = _n
gen totnum = _N

_n and _N can also be used within subgroups. To generate counters like the above for each stockid, do this:

use "", clear
sort stockid year
by stockid: gen obsnum = _n
by stockid: gen totnum = _N

Equivalently, instead of sorting unsorted data prior to by, use bysort:

bysort stockid (year): gen obsnum = _n
bysort stockid (year): gen totnum = _N

Now that you have a sense of what _n and _N do, let's use _n in combination with by to perform a concrete task. We will fill in the blanks in the ticker variable (this assumes that the ticker symbols for these stocks did not change over time).

First, make sure that the data are sorted by stockid and year.

use "", clear
sort stockid year

To fill in row 6 of ticker with the appropriate value ("AMZN"), you could do this:

replace ticker = "AMZN" in 6

To fill in rows 6 through 10 with "AMZN", you could do this:

replace ticker = "AMZN" in 6/10

Clearly, with multiple observations to replace, replacing values line-by-line this way becomes cumbersome and prone to human error. We can automate the process using by and _n.

bysort stockid (year): replace ticker=ticker[1] if _n>1

The above command states that within each stockid group (that has been sorted by ascending order of year), all ticker observations except the first are to be replaced with the value from the first year.

Notice that we messed up since the tickers for stockid==5 and ==6 are now all blanks. We need to modify our algorithm to avoid overwriting ticker values with blanks. Reload the original data and start from scratch.

use "", clear

The years don't really matter for this task, so instead of sorting by stockid and year, let's sort in ascending order of stockid and then descending order of ticker:

gsort + stockid - ticker

As you can see, the first observation within each stockid now contains the correct ticker symbol, so filling in the blanks is now a simple matter of

by stockid: replace ticker=ticker[1] if _n>1