Power of Parquet in Data Analytic Querying
In modern applications, there is a constant management of data containing variety, volume, and velocity that require advanced digital solutions. As the volume of our data is gradually increasing, we find worth in taking the steps to analyze and optimize the performance of our systems, ensuring scalability. With our cloud-based architecture, we look to use the scalability of cloud computing to shorten development cycles and processing time. To reduce overall data storage costs and data processing times; we’ve turned to the uncommon, open source, non-proprietary file type Parquet, and read these files using AWS Athena to enhance our system’s performance.
CSV vs Parquet
When storing data in text files, most preferred file formats are XML, JSON, and most popular CSV. A CSV (comma-separated values) file is a row-based file format that is a delimited text file which separates each data record line-by-line, and separates its variables with commas in the file. Parquet, aswell, is a data file format designed to store, retrieve and handle complex data in bulk. Unlike the common file formats like CSV, Parquet is a column-oriented file format; where each row group has a header and footer for each column, then stored together in row groups. This causes the values to be stored in a column together rather than in a row, with metadata included in the data schema to optimize locating data. This format permits columns that are not accessed in a query to be skipped -enabling much faster processing of data- whereas queried CSV files go through all the rows and columns unnecessarily.
With this design, Parquet files can begin with a simple schema, adding more columns, even with different but mutually compatible schemas. Thus, it is able to support reading of different files consisting of shared columns at the same time and not run into errors. Sometimes the nested data can also be another JSON file stored inside a struct!
Why is schema evolution easier with parquet?
Parquet also supports easier and more efficient schema evolution. With parquet one can begin with a very basic schema and keep adding more columns. In CSV files the columns are grouped together in rows, so it becomes very tedious to add more data whereas with parquet as the columns are independent, more data can be added in an easier way and support more complex data. In Parquet files, as the data evolves, older files with fewer columns and newer files with additional columns can be scanned together with no issues whereas such abilities are lacking in other file types like CSV.
How do Parquet files reduce storage requirements significantly?
As data is stored in a columnar manner, there is only one type of data in a column like only strings, only integers etc. Each column is compressed individually in Parquet, it uses different encoding algorithms for different columns depending on the type of data a column contains, these compressed files take up much less space than other file formats, hence, also saving costs on storage requirements to a great extent.
How does this benefit querying?: OLTP and OLAP
Online transaction processing (OLTP) and online analytical processing (OLAP) are primary processing systems used in data processing and analytics, both with varying objectives. OLTP handles a large number of small transactions and simple queries, as its primary objective is to capture and maintain transaction data, not analyze it like OLAP. This requires OLTP to handle large volumes of data in a row-by-row storage file system; scanning all of the data available that is suitable in cases where a row of data needs to be simultaneously accessed or processed. This allows OLTP to display required business analytics in a matter of seconds to make fast updates initiated by a user; as OLTP is heavily loaded with interactive transactions because of its row-oriented storage design. Unfortunately, along with less efficient compression, standard row-by-row file systems end up being much more expensive for storage. On the other hand, OLAP’s primary objective is data analysis and retrieval in column-oriented format, making it designed to quickly answer analytics queries involving multiple dimensions.
Utilizing the Parquet file’s column-oriented data file format, Parquet files provide efficient data compression and encoding schemes to handle complex data in bulk. Using Parquet for OLAP makes up for the poor compression and expensive storage brought on by OLTP. The benefit of Parquet is that if you want multiple services to use the same data, it is friendly to data that needs to be written back into a data pipeline’s lake. Writing in new data is efficiently done, replacing old column data without having to modify other columns, just as the intended design of OLAP. Unexpectedly increasing data throughput and performance, no matter the complexity of the data type.
How does West Loop Strategy use Parquet?
WLS uses parquet files to make analytic queries using Athena on the output obtained from Amazon Connect. Amazon Connect analytics outputs log files from the contact center and these log files consist of nested data which is in turn converted to parquet format using Amazon Kinesis. The parquet files are stored in an S3 bucket and Athena is used to make SQL queries on these files where the results from these queries are then visualized on a quicksight dashboard. This is also an example of the OLAP system where data is just fetched from the parquet files to get an analysis of the nested data and display it on a dashboard. Parquet was considered an ideal choice for our stack as data processing and data fetching is “2x faster to unload, [consuming] up to 6x less storage in Amazon S3 compared to text formats” as announced by Amazon’s Managed Apache Cassandra Service.