CASFS + TimescaleDB Upload Speed Research

April 26, 2021


The CASFS + Timescale Research Project was conducted to explore the speed and efficiency of inserting large amounts of data into a TimescaleDB SQL database hosted on CASFS. TimescaleDB is an open-source relational database for time-series data. It runs on top of PostgreSQL. The data being used in this research is time-sequential financial data daily files, with each of them having 2,229,120 rows and 19 columns. There were two different uploading methods tested in this experiment. The first was upserting the data using the package Pangres, while the second was copying the data using Timescale-parallel-copy. The Pangres package used PostgreSQL's insert function while the Timescale-parallel-copy method took advantage of PostgreSQL's built-in copy function used for bulk inserting data. Other variables tested during research were parallel vs serial processing, number of cores, local or non-local server/data relationship, number of workers, upload origin, and upload method.

About the Research

As previously mentioned, the size of the financial data files were kept consistent throughout the term of the project. Pandas was used to process the data before inserting the data into the database.

The Pangres package allows for easy insertion from Pandas dataframes into PostgreSQL databases using their upsert method. The upsert method takes advantage of Postgres' insert function and handles issues like duplicates, adding missing columns, and datatypes. The results of using the Pangres upsert method are below:



Upload Origin

Cores

Ram

Workers

Upload Method

Parallel Processing

Server_data relationship

Time (minutes)

Python Notebook

8

64

1

Pangres

No

none

14.23


Referencing the chart above, you can see that Pangres took on average 14.23 minutes or 853.8 seconds. This is quite a long time, especially with the computing power used. We then decided to explore other methods on other pieces of hardware to see if we could optimize this upsert time.


New Methods

One new method we looked into was Timescale-parallel-copy, this is, "a command line program for parallelizing PostgreSQL's built-in copy functionality for bulk inserting data into TimescaleDB." First, we tested where the server-data relationship is local. This means the TimescaleDB database is in a docker container running on the same server as the Python script. Inserting the exact same set of data took on average exactly 60 seconds or 1 minute. This is a drastic increase in efficiency compared to using the Pangres package to insert the data. The variables are stated in the chart below:


Upload Origin

Cores

Ram

Workers

Upload Method

Parallel Processing

Server_data relationship

Time (minutes)

Command Line

8

64

1

Timescale-p arallel-copy

No

local

1.00


Taking note of our extensive progress, we thought implementing parallel processing would further decrease the time it took to run. We then tried implementing Ray processing, an open source framework for parallel and distributed computing in Python. Keeping all things equal, except for implementing the Ray processing library and increasing the workers by 1, we ran the test. It ran in 21 seconds or 0.35 minutes. As we can see, another drastic increase in efficiency occurred by implementing parallel processing. The results are as follows:


Upload Origin

Cores

Ram

Workers

Upload Method

Parallel Processing

Server_data relationship

Time (minutes)

Command Line

8

64

2

Timescale-p arallel-copy

Yes

local

0.35


We also tested increasing the number of workers from 2 to 4, and then to 6, while keeping everything else the same. Switching from 2 to 4 workers showed an increase in efficiency taking only 14 seconds opposed to 21 seconds. However, when we switched from 4 to 6 workers, it yielded the same results, signifying a plateau in efficiency.


Changing Server-Data Relationship

The last change we made was setting up a TimescaleDB on a separate server and running it persistently. This is described in the variable "Server_data relationship" which will now be "not local." With this change in place, we kept everything else the same besides changing the amount of workers. After trying 1, 2, 4, 6, and 8 workers, we found having 8 workers was the most efficient. The test ran in 8 seconds or 0.13 minutes. The results are as follows:


Upload Origin

Cores

Ram

Workers

Upload Method

Parallel Processing

Server_data relationship

Time (minutes)

Command Line

8

64

8

Timescale-p arallel-copy

Yes

Not local

0.13



Conclusion

By conducting this research, we were able to improve on Pangres' upsert method 99.1%, going from 14.23 minutes to 0.13 minutes, by using TimescaleDB's parallel copy functionality and implementing Ray's parallel processing. In other terms, we were able to copy 278,640 rows per second. This was in the range of TimescaleDB's self-reported insertion performance times, which was between 100K and 400K rows per second. This demonstrates that CASFS was able to match the reported speeds of TimescaleDB and its overall potential to integrate any database into our service. Overall, the first thing we determined is that when upserting large amounts of data, using the built in "copy" function is way more efficient. In our case, we used the Timescale-parallel-copy package. We also determined that using parallel processing with a large amount of workers can drastically decrease the insert time of time-sequential data into a postgres database, and more specifically, TimescaleDB.