In a previous post I have shown you how to setup an AWS instance running the newest RStudio, R, Python, Julia and so forth, where the configuration of the instance can be freely chosen. However, there is quite a lot of possibilities of instance configurations out there: There are different instance classes (General Purpose, Compute Optimized, RAM Optimized, … ) and different instance sizes within these classes. For General Purpose, or t2, there are, e.g. t2.nano, t2.micro, t2.small, t2.medium, t2.large, t2.xlarge and t2.2xlarge¹.

These instances differ in two dimensions: price and performance. Obviously, these dimensions are highly correlated, since higher price means (or should mean, at least) higher performance. Now, price is easily measured, yet performance is a bit trickier: For example, it is not entirely straightforward to assess the impact of higher RAM, CPU or even GPU directly across many different configurations. But we’re doing data science, right? So why not create a programmatic test in order to gauge the performance empirically? Well, let‘s do it!

The Test

For this benchmark test I chose a classical machine learning task: the classification of the MNIST dataset of handwritten digits, to be categorized as 0-9. This data set is very commonly used as an example set for machine learning algorithms.

For this benchmark test, I borrowed a nice skript by Kory Becker written here, which trains a Support Vector Machine (SVM) on the problem, using only the first 1000 observations of the dataset, each with 768 attributes. I altered the code ever so slightly to that each run of the script returns the following measurements:

Elapsed Time: The time elapsed since starting the script (excluding the time to install the libraries and download of the data),
Accuracy of model, i.e. the percentage of predictions that classified the digits correctly.

Additionally, I included the following information:

RAM in Gigabytes,
Number of CPUs in use, and finally
Price in Dollars per Hour.

The Candidates

AWS provides a large number of different configurations, and I will not discuss all of these in this post. Rather, let me focus on four different specifications of computing resource demands and chose a distinctive representative:

General Purpose: t2, m4
Compute Optimized: c4
Memory Optimized: r4

For each of these classes, I had planned to test the sizes small, medium, large, xlarge and 2xlarge. The sizes micro, small and medium are actually only available for t2 (oh, no!), so that I ended up only testing 14 configurations.

The Results

I started with the candidate t2.micro, which is free of charge. Unfortunately, the script never succesfully ran the training of the model, presumably because the dimension of merely 1 GB of RAM is not sufficient. Still, a “not possible” result is still a useful result for choosing the right infrastructure.

Let’s have a first look at the results, first in plain numbers:

instance_class	instance_size	ram	vcpus	ecu	price_per_hour	elapsed_time	accuracy
t2	micro	1.00	1.0	NA	0.0134	NA	NA
t2	small	2.00	1.0	NA	0.0268	68.624	0.917
t2	large	8.00	2.0	NA	0.1072	65.335	0.918
t2	xlarge	16.00	4.0	NA	0.2144	55.611	0.918
t2	2xlarge	32.00	8.0	NA	0.4288	56.284	0.919
t2	medium	4.00	2.0	NA	0.0536	63.961	0.919
m4	large	8.00	2.0	6.5	0.1200	82.823	0.933
m4	xlarge	15.00	4.0	13.0	0.2400	80.749	0.928
m4	2xlarge	32.00	8.0	26.0	0.4800	65.728	0.912
m4	4xlarge	64.00	16.0	53.5	0.9600	64.573	0.927
m4	16xlarge	256.00	64.0	188.0	3.8400	93.310	0.915
r4	large	15.25	2.0	7.0	0.1600	80.749	0.928
r4	xlarge	30.50	13.5	4.0	0.3200	68.372	0.920
c4	large	3.75	2.0	8.0	0.1140	121.004	0.915

At a quick glance, the accuracy of the models looks quite uniform. This is hardly surprising, as the algorithm itselg is unchanged by hardware limitation, and the apparent fluctuations can be explained by the stochastic nature of the train-test-data set sampling in the script.

A core assumption is that more computing power yields faster results. A second core assumption is that the higher the computing power, the higher the cost. Combining these assumptions leads us to assume that higher cost leads to a lower time elapsed. A quick visualization of the data demonstrates that the results support this notion:

The measurement of the instance “m4.16xlarge” doesn’t quite fit into the pattern, and I am frankly unsure of the reasons. The measurement was taken twice, so that circumstantial errors leading to this measurement can be rejected.

Let us look a little more precisely at the data, in order to establish the most influential factors determining the speed of the analysis. We use the wonderful ggpairs visualization of the GGally package and omit the observation of the instance “m4.16xlarge” in the analysis:

This plot contains a number of results at once. First off, and unsurprisingly, the price per hour correlates vey strongly with the number of virtual CPUs and the size of the RAM, indicating that “the higher the computing power, the higher the cost” was a correct core assumption.

Second, the correlation between Elapsed Eime and the numeric indicators of performance RAM, vCPUs and Price per Hour is clearly negative across the board, but the highest correlation is clearly attained by the dimension RAM. This provides yet another indication that the notion Performance of R hinges on RAM is true.

One last question to consider: which instance type is optimal for R purposes? Optimality will be defined by provide the quickest results for the least money. Compare the fits of a standard linear model:

instance_class	(Intercept)	price_per_hour
m4	83.66913	-22.66862
r4	93.12600	-77.35625
t2	66.86029	-29.47335

This shows that the entry price is cheapest for instances of the class “t2”, as the y-intercept is the lowest in this case. However, in cases of higher Price per Hour, i.e. higher necessary computing power, “r4” is the better choice: The time decreases quickest with the increase in power. Both lines meet at a price per hour of roughly 55 Cents per hour, corresponding to an instance r4.2xlarge with 61 GB RAM.

Takeaway Messages

To conclude this article, let me summarize the key findings:

The most important hardware feature for the increasing computing speed of “R” analysis is RAM
For analysis with a small to medium scope of performance (RAM less than 60 GB), the instance class “t2” is the best choice in AWS.
For larger scale projects, the instance class “r4”, optimzed for RAM usage, is the optimal choice.

…why is it ‚nano‘, ‚micro‘ but then ‚large‘ ‚extra large‘? Be consistent, dangit!↩

Benchmarking AWS Instances with MNIST classification

The Test

The Candidates

The Results

Takeaway Messages

Sebastian Schweer