Benchmarking AWS Instances with MNIST classification
In a previous post I have shown you how to setup an AWS instance running the newest RStudio, R, Python, Julia and so forth, where the configuration of the instance can be freely chosen. However, there is quite a lot of possibilities of instance configurations out there: There are different instance classes (General Purpose, Compute Optimized, RAM Optimized, … ) and different instance sizes within these classes. For General Purpose, or t2, there are, e.g. t2.nano, t2.micro, t2.small, t2.medium, t2.large, t2.xlarge and t2.2xlarge1.
These instances differ in two dimensions: price and performance. Obviously, these dimensions are highly correlated, since higher price means (or should mean, at least) higher performance. Now, price is easily measured, yet performance is a bit trickier: For example, it is not entirely straightforward to assess the impact of higher RAM, CPU or even GPU directly across many different configurations. But we’re doing data science, right? So why not create a programmatic test in order to gauge the performance empirically? Well, let‘s do it!
For this benchmark test I chose a classical machine learning task: the classification of the MNIST dataset of handwritten digits, to be categorized as 0-9. This data set is very commonly used as an example set for machine learning algorithms.
For this benchmark test, I borrowed a nice skript by Kory Becker written here, which trains a Support Vector Machine (SVM) on the problem, using only the first 1000 observations of the dataset, each with 768 attributes. I altered the code ever so slightly to that each run of the script returns the following measurements:
- Elapsed Time: The time elapsed since starting the script (excluding the time to install the libraries and download of the data),
- Accuracy of model, i.e. the percentage of predictions that classified the digits correctly.
Additionally, I included the following information:
- RAM in Gigabytes,
- Number of CPUs in use, and finally
- Price in Dollars per Hour.
AWS provides a large number of different configurations, and I will not discuss all of these in this post. Rather, let me focus on four different specifications of computing resource demands and chose a distinctive representative:
- General Purpose: t2, m4
- Compute Optimized: c4
- Memory Optimized: r4
For each of these classes, I had planned to test the sizes small, medium, large, xlarge and 2xlarge. The sizes micro, small and medium are actually only available for t2 (oh, no!), so that I ended up only testing 14 configurations.
I started with the candidate
t2.micro, which is free of charge. Unfortunately, the script never succesfully ran the training of the model, presumably because the dimension of merely 1 GB of RAM is not sufficient. Still, a “not possible” result is still a useful result for choosing the right infrastructure.
Let’s have a first look at the results, first in plain numbers:
At a quick glance, the accuracy of the models looks quite uniform. This is hardly surprising, as the algorithm itselg is unchanged by hardware limitation, and the apparent fluctuations can be explained by the stochastic nature of the train-test-data set sampling in the script.
A core assumption is that more computing power yields faster results. A second core assumption is that the higher the computing power, the higher the cost. Combining these assumptions leads us to assume that higher cost leads to a lower time elapsed. A quick visualization of the data demonstrates that the results support this notion:
The measurement of the instance “m4.16xlarge” doesn’t quite fit into the pattern, and I am frankly unsure of the reasons. The measurement was taken twice, so that circumstantial errors leading to this measurement can be rejected.
Let us look a little more precisely at the data, in order to establish the most influential factors determining the speed of the analysis. We use the wonderful
ggpairs visualization of the
GGally package and omit the observation of the instance “m4.16xlarge” in the analysis:
This plot contains a number of results at once. First off, and unsurprisingly, the price per hour correlates vey strongly with the number of virtual CPUs and the size of the RAM, indicating that “the higher the computing power, the higher the cost” was a correct core assumption.
Second, the correlation between Elapsed Eime and the numeric indicators of performance RAM, vCPUs and Price per Hour is clearly negative across the board, but the highest correlation is clearly attained by the dimension RAM. This provides yet another indication that the notion Performance of R hinges on RAM is true.
One last question to consider: which instance type is optimal for
R purposes? Optimality will be defined by provide the quickest results for the least money. Compare the fits of a standard linear model:
This shows that the entry price is cheapest for instances of the class “t2”, as the y-intercept is the lowest in this case. However, in cases of higher Price per Hour, i.e. higher necessary computing power, “r4” is the better choice: The time decreases quickest with the increase in power. Both lines meet at a price per hour of roughly 55 Cents per hour, corresponding to an instance r4.2xlarge with 61 GB RAM.
To conclude this article, let me summarize the key findings:
- The most important hardware feature for the increasing computing speed of “R” analysis is RAM
- For analysis with a small to medium scope of performance (RAM less than 60 GB), the instance class “t2” is the best choice in AWS.
- For larger scale projects, the instance class “r4”, optimzed for RAM usage, is the optimal choice.
…why is it ‚nano‘, ‚micro‘ but then ‚large‘ ‚extra large‘? Be consistent, dangit!↩