Although a mature ML practitioner should be comfortable with both Python and R, it is always entertaining to tune your ears whenever there is ‘friendly’ banter between R and Python enthusiasts.
Table of Contents
The power of R
Python was my first data science language, and I must confess that I usually have the TidyVerse cheatsheet open (amongst others) whenever I use R to do exploratory analysis.
R can be very useful for time series analysis. The statsmodels
library in Python whilst useful, has documentation that is vague at best. It assumes that you already know what is going on.
To the best of my knowledge there is no function in Python that automatically fits an ARIMA model in R. As a result, we have people writing 7 inner for
loops to try and find an optimal SARIMAX model via grid search.
Sometimes it’s just better to use a convenient function like auto.arima()
, which only exists in R.
We begin with how to set up rpy2
and a simple example. Despite being so simple, one must tread carefully because when converting objects between 2 languages, the behaviour can be unexpected.
Setting up Rpy2
Let’s try running some basic R by porting it into our Python environment. We will need 2 components:
import rpy2.robjects as robjects
from rpy2.rojects.packages import importr
We need the robjects
module in order to work with R objects and functions, whilst importr
will allow us to import R packages into our environment to use.
Alright, let’s test it out!
pnorm = robjects.r['pnorm']
q = pnorm(1.96)
As you might know, pnorm
will return a Gaussian CDF given the quantile, which is 0.975.
Now let’s say you wanted to manipulate this with some arithmetic.
q_prime = q + 1
print(q_prime)
You might find it surprising for the output to be
[1] 0.9750021 1.0000000
It turns out that pnorm
is a rpy2.robjects.vectors.FloatVector
object, and thus by adding something you are instead appending to the vector.
To make this work you would need to index the object
q_prime = q[0] + 1
which returns
1.9750021
Some R objects can be much more complicated to deal with. To make your life slightly easier, rpy2
has a convenient function that allows (somewhat) easier translation of objects across languages.
from rpy2.robjects import pandas2ri
from rpy2.robjects import numpy2ri
pandas2ri.activate()
numpy2ri.activate()
pandas2ri
and numpy2ri
allow conversion from Python pandas and numpy objects to an appropriate R object and vice versa.
If you try running the code above, the returned object will be a np.ndarray
[1.9750021]
and you still will have to be careful because the ‘+ 1’ will be element-wise addition!
As you can imagine, this can be ‘fun’ to debug if you assume that returned R objects will behave like Python objects.
You have been warned.
Now let’s try do something that is a little more complicated like porting R functions into Python.
Writing R functions
You’re really not going to be bothered with using R if you just wanted to compute the CDF of a Gaussian, so let’s look at how you can utilise rpy2
to call auto.arima()
.
If you are using this locally, make sure you have installed the forecast
package using an R interface.
If not you may need to install it using Python.
utils = importr('utils')
utils.install_packages("forecast")
However, when I tried this I ran into problems so I would recommend to install packages using R.
We write a R function, store it as a string and save it to a Python variable.
auto_arima_R = """
auto_arima_R <- function(ts_R, steps, freq){
library(forecast)
ts_obj = ts(ts_R$sales, freq=freq)
model <- auto.arima(ts_obj, seasonal=TRUE)
predict <- forecast(model, h=steps)
list(pred=predict[4], pred=model$fitted)
}
"""
auto_arima_R = robjects.r(auto_arima_R)
Alright, so now we have a python function that will call the auto.arima()
function from the forecast
package and return a n-steps ahead prediction.
Let’s generate some data and use it for forecasting.
import numpy as np
import pandas as pd
import random
a = 2; b = 3
mu = 1; sd = 2
n = 200
def rand_points(a, b, mu, sd, n):
arr = []
for i in range(n):
arr.append(a*i + b + random.gauss(mu, sd))
return arr
index = np.arange(0, n)
sales = rand_points(a, b, mu, sd, n)
df = pd.DataFrame({
'date': index,
'sales': sales
})
print(df.head())
result = auto_arima_R(df, steps=8, freq=4)
You should be able to extract the result by indexing
print(result[0][0]) # predictions
print(result[1]) # fitted values
So there you go! No more writing for
loops to fit ARIMA models to your time series!
Conclusion
Read the docs and test the behaviour VERY carefully - you never know what’s going to happen.
The best way to utilise this wonderful library is to return the simplest object type you can e.g. a vector, and mold the vector back into a DataFrame or matrix using Python so you don’t run into unexpected behaviours.
With great power comes great responsibility =)