September 24th, 2021
Several EC2 instances are benchmarked on CASFS on AWS. Various statistics, such as time, memory usage, and cost, are compared on varying dataset sizes to help users determine which EC2 instance is the best for their specific use case.
In our previous benchmarking analysis, we compared various Python dataframe libraries and compared their performance on a r5.24xlarge (96 cores, 768 GB memory) machine on CASFS+. In this testing, we found that using Pandas parallelized over several cores with Ray was the fastest process, since the process had a trivial way to parallelize the code.
With this paper, we run this process on several different EC2 instances of various types and sizes to try and make recommendations on which EC2 instance is best suited for the user's needs. First, we compared eight core machines, as these are commonly used for daily work. Then we compared 96 core machines, since these can be used for larger workloads.
All instance descriptions were taken from https://aws.amazon.com/ec2/instance-types/
General purpose instances provide a balance of compute, memory and networking resources, and can be used for a variety of diverse workloads. These instances are ideal for applications that use these resources in equal proportions, such as web servers and code repositories.
Compute Optimized instances are ideal for compute-bound applications that benefit from high-performance processors. Instances belonging to this family are well suited for batch-processing workloads, media transcoding, high-performance web servers, high-performance computing (HPC), scientific modeling, dedicated gaming servers and ad server engines, machine learning inference, and other compute intensive applications.
Memory-optimized instances are designed to deliver fast performance for workloads that process large data sets in memory.
Accelerated computing instances use hardware accelerators, or co-processors, to perform functions, such as floating-point number calculations, graphics processing, or data pattern matching, more effciently than is possible in software running on CPUs.
The libraries were tested using daily financial datasets. The dataset sizes are as follows:
To process the data, we used this generalized algorithm, with each file being parallelized out to a single core:
This process does not utilize GPUs or databases, so it did not utilize the benefits of the P3, G4dn, or X1e machines. Further testing would need to be done to analyze the benefits of using these machines.
First, we tested the process on a single file over the eight core machines to get a baseline of which performed the best relative to cost for this particular process.
EC2 Instance |
RAM (GiB) |
GPU |
Mean Time (s) |
STD Time (s) |
Cost ($/hr) |
Total Cost ($) |
c5.2xlarge |
16 |
0 |
6.38 |
0.014 |
0.09 |
0.00016 |
c5n.2xlarge |
21 |
0 |
6.69 |
0.074 |
0.08 |
0.00015 |
g4dn.2xlarge |
32 |
1 |
6.99 |
0.053 |
0.23 |
0.00045 |
r5.2xlarge |
64 |
0 |
6.99 |
0.042 |
0.10 |
0.00019 |
c4.2xlarge |
15 |
0 |
7.07 |
0.162 |
0.07 |
0.00014 |
r5n.2xlarge |
64 |
0 |
7.34 |
0.043 |
0.09 |
0.00018 |
m5n.2xlarge |
32 |
0 |
7.38 |
0.044 |
0.08 |
0.00016 |
m5.2xlarge |
32 |
0 |
7.57 |
0.038 |
0.08 |
0.00017 |
m5a.2xlarge |
32 |
0 |
7.79 |
0.032 |
0.08 |
0.00017 |
r5a.2xlarge |
64 |
0 |
7.84 |
0.073 |
0.10 |
0.00022 |
r4.2xlarge |
61 |
0 |
8.09 |
0.031 |
0.08 |
0.00018 |
m4.2xlarge |
32 |
0 |
8.43 |
0.061 |
0.08 |
0.00019 |
p3.2xlarge |
61 |
1 |
8.45 |
0.071 |
0.92 |
0.00216 |
x1e.2xlarge |
244 |
0 |
8.53 |
0.051 |
0.50 |
0.00118 |
t3.2xlarge |
32 |
0 |
9.23 |
0.395 |
0.10 |
0.00026 |
Next, we tested 96 core machines for the C, M, and R EC2 instances. Since this process doesn't use GPUs or databases, we left out the other EC2 instance types.
EC2 Instance |
RAM (GiB) |
N Files |
Mean Time (s) |
STD Time (s) |
Cost ($/hr) |
Total Cost ($) |
c5.metal |
192 |
1 |
6.28 |
0.312 |
0.91 |
0.00159 |
c5.24xlarge |
192 |
1 |
6.45 |
0.207 |
1.03 |
0.00185 |
r5.metal |
768 |
1 |
6.85 |
0.148 |
1.03 |
0.00196 |
m5.metal |
384 |
1 |
6.87 |
0.100 |
0.96 |
0.00183 |
r5.24xlarge |
768 |
1 |
7.07 |
0.282 |
1.21 |
0.00238 |
r5n.24xlarge |
768 |
1 |
7.08 |
0.287 |
1.00 |
0.00197 |
m5.24xlarge |
384 |
1 |
7.14 |
0.288 |
0.96 |
0.00190 |
m5n.24xlarge |
384 |
1 |
7.20 |
0.345 |
0.96 |
0.00192 |
m5a.24xlarge |
384 |
1 |
7.93 |
0.090 |
0.96 |
0.00211 |
r5a.24xlarge |
768 |
1 |
8.00 |
0.187 |
1.00 |
0.00222 |
EC2 Instance |
RAM (GiB) |
N Files |
Mean Time (s) |
STD Time (s) |
Cost ($/hr) |
Total Cost ($) |
c5.24xlarge |
192 |
10 |
7.02 |
0.075 |
1.03 |
0.00201 |
c5.metal |
192 |
10 |
7.06 |
0.247 |
0.91 |
0.00178 |
m5.metal |
384 |
10 |
7.71 |
0.197 |
0.96 |
0.00206 |
r5.24xlarge |
768 |
10 |
7.76 |
0.148 |
1.21 |
0.00261 |
r5.metal |
768 |
10 |
7.80 |
0.286 |
1.03 |
0.00223 |
r5n.24xlarge |
768 |
10 |
7.86 |
0.305 |
1.00 |
0.00218 |
m5n.24xlarge |
384 |
10 |
7.97 |
0.350 |
0.96 |
0.00213 |
m5.24xlarge |
384 |
10 |
8.07 |
0.353 |
0.96 |
0.00215 |
r5a.24xlarge |
768 |
10 |
9.54 |
0.443 |
1.00 |
0.00265 |
m5a.24xlarge |
384 |
10 |
9.56 |
0.448 |
0.96 |
0.00255 |
EC2 Instance |
RAM (GiB) |
N Files |
Mean Time (s) |
STD Time (s) |
Cost ($/hr) |
Total Cost ($) |
c5.24xlarge |
192 |
100 |
25.50 |
0.416 |
1.03 |
0.00730 |
c5.metal |
192 |
100 |
25.50 |
0.704 |
0.91 |
0.00645 |
r5.metal |
768 |
100 |
26.20 |
0.342 |
1.03 |
0.00750 |
r5.24xlarge |
768 |
100 |
26.60 |
0.453 |
1.21 |
0.00894 |
m5n.24xlarge |
384 |
100 |
26.90 |
0.529 |
0.96 |
0.00717 |
m5.24xlarge |
384 |
100 |
27.00 |
0.445 |
0.96 |
0.00720 |
m5.metal |
384 |
100 |
27.20 |
0.797 |
0.96 |
0.00725 |
r5n.24xlarge |
768 |
100 |
27.20 |
0.498 |
1.00 |
0.00756 |
m5a.24xlarge |
384 |
100 |
31.30 |
1.230 |
0.96 |
0.00835 |
r5a.24xlarge |
768 |
100 |
33.90 |
0.772 |
1.00 |
0.00942 |
C instances are the fastest, but have the least amount of RAM.
R instances have the most amount of RAM and have slightly better performance than M machines, but with a higher cost.
M instances offer higher performance than other general purpose machines, while being cost-effective.
Instances with greater numbers perform better than machines with the same prefix and a lower number. For example, the m5 performs better than m4, with only a minimal cost increase.
CASFS+ is network optimized, so instances with the -n suffx (that are network optimized) do not receive the network performance benefits. However, these are cheaper on average than the base machines and can provide cost savings.
Intel machines perform better than AMD machines (instances with the -a suffx). Also, AMD machines do not provide that great of a cost benefit either, so we recommend using Intel machines.
Metal machines provide a performance boost when the dataset size is smaller, but perform on par with 24xlarge machines as dataset size increases. These are generally cheaper than the 24xlarge instances and can provide cost savings.
For workloads that are not GPU or database dependent, we recommend using either a C, M, or R machine depending on your RAM needs and budget. If the dataset will fit into memory, the C machines will always provide the best performance. After that, choosing between an M or an R machine will depend on memory and budget. If budget is not an issue, the R instances will provide slightly better performance. However, if budget is an issue, M machines can perform comparably.