Wednesday, July 24, 2013

Casting and Melting with Paired Data

Today's post is not about economics, rather it's a note from an R programming struggle that may be helpful for fellow undergraduate researchers.

I'm often testing forecasting models, and what this ends up creating is a bunch of "forecasted" variables that are paired with the "actual" values. R has fabulous faceting capabilities, and I have often wanted to reshape the data in a way where the category of forecasted variable as an identifier, and then two columns that list the forecasted and actual variables. In other words, if the code starts from something like

       aAct      aPred       bAct      bPred id
1 1.2076384 -0.6735547  1.4994464 -1.0691975  1
2 0.4999706 -0.7188215 -0.3601551  0.7224729  2
3 1.0340859 -0.1108304 -0.5941295  0.5027085  3

And I want to convert it where one column has an id, another one identifies whether I'm forecasting a or b, and a third column that has the forecasted value, and then a fourth column with the actual value.

The procedure in R involves "melting" the data frame and then "casting" it. Melting is rather simple -- you provide a set of identifiers, and then the data frame is melted down to only that identifier, the values, and another indicator variable that tells you what the value is supposed to represent. In the above example, if we let df be the data frame described above, I would run:

df.m = melt(df, id.vars = 'id')

   id variable      value
1   1     aAct  1.2076384
2   2     aAct  0.4999706
3   3     aAct  1.0340859
4   1    aPred -0.6735547
5   2    aPred -0.7188215
6   3    aPred -0.1108304
7   1     bAct  1.4994464
8   2     bAct -0.3601551
9   3     bAct -0.5941295
10  1    bPred -1.0691975
11  2    bPred  0.7224729
12  3    bPred  0.5027085

Now I need to "unmelt" part of the data frame to get the forecast/actual pairings. In R, this is known as casting and I know that I personally had a pretty hard time decoding the documentation. The function goes along as

cast(df.m, <IDENTIFIERS> ~ <VALUES>)

The second part is known as the casting formula and is the part that I have struggled with. But in its most simplest form, the casted frame will look like something with all the identifiers added together as uniquely identifying units ,and then the <VALUES> variables being the labels for the actual value column. If that sounded confusing, I apologize. Perhaps solving the example would help.

First, I need to find a way to identify whether a row is looking at a or b, and whether it is a forecast or an actual variable. So I first create these variables:

df.m$var = substring(df.m$variable, 1, 1)
df.m$type = substring(df.m$variable, 2) 

Which gives me the data frame:

> df.m
   id variable      value type var
1   1     aAct  1.2076384  Act   a
2   2     aAct  0.4999706  Act   a
3   3     aAct  1.0340859  Act   a
4   1    aPred -0.6735547 Pred   a
5   2    aPred -0.7188215 Pred   a
6   3    aPred -0.1108304 Pred   a
7   1     bAct  1.4994464  Act   b
8   2     bAct -0.3601551  Act   b
9   3     bAct -0.5941295  Act   b
10  1    bPred -1.0691975 Pred   b
11  2    bPred  0.7224729 Pred   b
12  3    bPred  0.5027085 Pred   b

Now I can cast the frame. In this case, I would use the formula = cast(df.m, id + var ~ type)

This is how you interpret the formula. Id + var means that every observation is uniquely identified by it's id code and the variable we're forecasting -- a or b. Then "type" on the right side represents the new variable names that will be filled by the values.

Hope this is useful to others so they don't end up spending hours agonizing over the issue as did I.

1 comment:

  1. If you are going to be doing a lot of forecasting like this, I recommend taking a look at the pandas library for python. It is much faster than R for data frame work.

    You also get the benefit of beautiful pythonic code over R soup.