Using PySpark On CASFS+

May 28th, 2021



CASFS+

CASFS+ is a cloud platform file system and resource manager offering numerous features designed to run jobs quickly and effciently. The file system uses Amazon's S3 storage bucket as its backend memory store allowing users to access their data blazingly fast from any location.


PySpark

From PySpark's official documentation:


“PySpark is an interface for Apache Spark in Python. It not only allows you to write Spark applications using Python APIs, but also provides the PySpark shell for interactively analyzing your data in a distributed environment. PySpark supports most of Spark's features such as Spark SQL, DataFrame, Streaming, MLlib (Machine Learning) and Spark Core.”


We have developed a new user guide to help new users of CASFS+ setup PySpark in their Analytics environment.


Benefits of Using PySpark on CASFS+

The biggest benefits of using PySpark in the CASFS+ environment are cost efficiency, speed improvements, and querying of datasets stored in the CASFS+ File System.



Before using PySpark, if we wanted to run a data quality check on one of our datasets, we would have to spawn four machines with 96 cores each. These machines would cost us $1.00 an hour and it would take 30 minutes to complete our check. Total cost to run this check would be (4 * 1.00 * .5) = $2.00



Using PySpark, we can run this same check faster and with fewer costs. Now, we can spin up 10 eight core machines to create a Spark Cluster. These machines cost us $0.08 an hour, and we can complete the job in 5 minutes. So by using PySpark, the total cost to run this check is (10 * 0.08 * 0.083) = $0.07 total, or savings of $1.93 per job.



Outside of cost savings due to performance, PySpark can save users money on CASFS+ in other ways. Normally to host a database on a cloud server, you would need to have a persistent machine to host the database that would cost the user up to hundreds of dollars a week. One solution we have come up with to avoid this, is to host our data in parquet files on the CASFS+ File System. PySpark is able to pull in multiple parquet files and use the PySpark SQL library to query the files the same way you would a SQL database. By doing this, we have essentially created a serverless SQL database, saving the cost of hosting a database on AWS or any other cloud provider.



Conclusion

By using PySpark on CASFS+, we have generated cost savings both by performance increases, and by removing the need to pay for a server to host databases. PySpark integrates seamlessly into the CASFS+ ecosystem, and we have developed guidelines and libraries to help users set up and use PySpark as soon as they start using CASFS+.


References:

PySpark