May 4th, 2021
The CASFS Audit Log Research was conducted to demonstrate the CASFS File System, the depth of the CASFS audit logs, how we classify the time sequential data, and how we implemented parallel processing using the “Ray” package. In the process of doing so, we also analyzed the audit logs from the CASFS File System. To do this, we used the Code Willing Analytics Environment, which runs on an AWS EC2 instance. The CASFS audit logs contain time sequential data, and a description of each variable is as follows:
In this research, one of our main goals was to demonstrate our product, the CASFS File System. We can use the CASFS File System to perform various types of analyses, specifically on big data. This is what the cloud-storage-backed file system was built to handle. Another main goal was to demonstrate the depth of the audit logs that are produced by CASFS. These logs provide extensive data on accessed files, permissions, denied occurrences, and usage data. This data is not available in normal UNIX systems. Lastly, we wanted to test the use of parallel processing with the package “Ray processing” in multiple different use cases. We used it in a few different instances, but figured out it worked best specifically when pre-processing all the data.
The CASFS File System is a cloud platform file system and resource manager offering numerous features designed to run jobs quickly and efficiently. This network file system runs on Amazon's S3 storage buckets and allows the system to be blazingly fast no matter where the user is.
The file system also includes some special features, such as separating System Administrators and Data Administrators. Typically, a network file system only has one type of administrator, called a System Admin. However, with CASFS, there are two types of admins to further separate duties and powers. It also has features, such as time-range permissions, on files and directories. For example, if you want to grant User1 access to a file or directory for a certain period of time, this period of time can be set through the file system by a Data Admin. Too much data can sometimes be counterproductive, so keeping analysts on a need-to-know basis can be useful.
This is why Data Admins also have the option to hide directories and only make them visible to certain users. This differs from restricting access to a directory because when you restrict access to users, they can still see the directory and know it's there. When you hide a directory, however, the user does not know it is there. We accomplish this by using Access Control List (ACL) permissions to allow specific file or folder permissions.
The CASFS Audit Logs contain detailed information of what files users are accessing, such as date and time, uid, user, IP address, etc., and when they are doing so. This is an additional feature of the CASFS File System that most Unix-based file systems do not have. In addition, this easily allows System Admins to make sure the correct security permissions are in place. In a case where the security permissions are incorrect, the System Admin can see what data has been compromised by analyzing the audit logs.
Examples of analyses that can be run when analyzing the CASFS Audit Logs include finding the most used datasets accessed by a specific directory or subdirectory, and even the frequency that the file was accessed. Additional examples of analyses include finding data sets that were denied access to users and the amount of times it occurred per user, datasets that were accessed by a user, and even finding the peak times of usage for specific files.
This audit log research provided us with more information for development of our predictive file-caching technology that learns the habits of users and caches frequently used files onto local machines. For example, if user1 accesses the same file every morning at 9 am, the CASFS file-caching technology would learn this habit and have this file downloaded to his local machine by 9 am. This feature would save a substantial amount of time for the end user because they will rarely have to wait for files to download.
In the process of conducting this research, we were able to parse the data and separate it into two subcategories by using regular expressions when analyzing the files' paths. We used the Python package, “re,” to implement this regex parser. This parser finds all files in certain subdirectories or in groups of files with similar file paths. We specifically focused on our client's core data subdirectory, where most shared files are located. We were also able to arrange the data by dataset name, file access, user, region, and time. After organizing the data, we then used the Pandas Python package to clean everything up. Finally, we wrote the data to parquet files.
While we were processing the data, we realized that using a single thread wasn't the most efficient way for all the data we had, so that's when we decided to use parallel processing. We then went on to implement the Ray processing package to process chunks of data in parallel. This helped improve our efficiency with the computing resources we had.
In total, we looked at 110 files totaling 149GB with an average of 1.35GB per file. The total processing time was 43 hours with an average of 23.6 minutes per file. The average database upsert time was about 13 minutes for the 5 selected files, all with the average of 2.46GB.
After, we then developed the TimescaleDB tables to store this newly processed data. We chose TimescaleDB because it is a high-performance relational database built specifically for time-series data. This was perfect because all of the data we were working with was time-series data. In the process of conducting our research, we tried using Ray processing in two particular areas. The first use case, as mentioned previously, was to clean and process the data, while the second use case was to upload the data into the TimescaleDB database. While performing these tasks with Ray, we noticed the time to upload the data actually increased in the first use case. However, while using Ray to process and clean the data in the second use case, this significantly sped up the cleaning process. We then came to the conclusion that the best use case for Ray processing was not using it to upload the data into the database, but to process the chunks of data in parallel before uploading the data. We also wrote queries that in return gave us information and answered some research questions. Some of these research questions included: Who was accessing what datasets, and when were they doing so? Who was denied access to datasets, how many times, and when? What files were accessed by region and version in 1hr time bins? All of these queries were optimized to a 90% overall reduction in run times.
By improving the integration of Ray and TimescaleDB in the log processing script, we were able to increase the log coverage by including more log files, reduce data loss from parsing errors, and reduce data loss from database insert conflicts stemming from previous parallel processing methods. We were also able to compress the DB footprint from 10GB to 3.8GB and reduce the query times by 90%. For example, the dataset accessed by user and date query previously took 5 minutes to run. Now, it currently takes 0.47 minutes. These are drastic improvements, especially when running these queries on a lot of log data.
We also thought that using Ray parallel processing to insert the data into TimescaleDB would make things a lot faster, however after this research was conducted, we learned it actually slows things down. We learned that Ray worked best when implemented for processing the chunks of data in the data preprocessing step.