[Avg. reading time: 20 minutes]

Parquet

Parquet is a columnar storage file format optimized for use with Apache Hadoop and related big data processing frameworks. Originally developed by Twitter and Cloudera, Parquet provides a compact and efficient way of storing large, flat datasets.

Best suited for WORM (Write Once, Read Many) workloads.

Row Storage

Give me list of Total T-Shirts sold or Customers from UK

It scans the entire dataset.

Row Scan

Columnar Storage

Terms to Know

Projection: Columns that are needed by the query.

    select product, country, salesamount from sales;

Here the projections are: product, country & salesamount

Predicate: A filter condition that selects rows.

    select product, country, salesamount from sales where country='UK';

Here predicate is where country = 'UK'

Row Groups in Parquet

Parquet divides data into row groups, each containing column chunks for all columns.
Horizontal partition—each row group can be processed independently.
Row groups enable parallel processing and make it possible to skip unnecessary data using metadata.

Parquet - Columnar Storage + Row Groups

Row Groups

Parquet File format

Parquet Fileformat Layout {{footnote: https://parquet.apache.org/docs/file-format/}}

File Format

Sample Data

Product	Customer	Country	Date	Sales Amount
Ball	John Doe	USA	2023-01-01	100
T-Shirt	John Doe	USA	2023-01-02	200
Socks	Jane Doe	UK	2023-01-03	150
Socks	Jane Doe	UK	2023-01-04	180
T-Shirt	Alex	USA	2023-01-05	120
Socks	Alex	USA	2023-01-06	220

Data stored inside Parquet

┌──────────────────────────────────────────────┐
│                File Header                   │
│  ┌────────────────────────────────────────┐  │
│  │ Magic Number: "PAR1"                   │  │
│  └────────────────────────────────────────┘  │
├──────────────────────────────────────────────┤
│                Row Group 1                   │
│  ┌────────────────────────────────────────┐  │
│  │ Column Chunk: Product                  │  │
│  │  ├─ Page 1: Ball, T-Shirt, Socks       │  │
│  └────────────────────────────────────────┘  │
│  ┌────────────────────────────────────────┐  │
│  │ Column Chunk: Customer                 │  │
│  │  ├─ Page 1: John Doe, John Doe, Jane Doe│ │
│  └────────────────────────────────────────┘  │
│  ┌────────────────────────────────────────┐  │
│  │ Column Chunk: Country                  │  │
│  │  ├─ Page 1: USA, USA, UK               │  │
│  └────────────────────────────────────────┘  │
│  ┌────────────────────────────────────────┐  │
│  │ Column Chunk: Date                     │  │
│  │  ├─ Page 1: 2023-01-01, 2023-01-02,    │  │
│  │            2023-01-03                  │  │
│  └────────────────────────────────────────┘  │
│  ┌────────────────────────────────────────┐  │
│  │ Column Chunk: Sales Amount             │  │
│  │  ├─ Page 1: 100, 200, 150              │  │
│  └────────────────────────────────────────┘  │
│  ┌────────────────────────────────────────┐  │
│  │ Row Group Metadata                     │  │
│  │  ├─ Num Rows: 3                        │  │
│  │  ├─ Min/Max per Column:                │  │
│  │     • Product: Ball/T-Shirt/Socks      │  │
│  │     • Customer: Jane Doe/John Doe      │  │
│  │     • Country: UK/USA                  │  │
│  │     • Date: 2023-01-01 to 2023-01-03    │  │
│  │     • Sales Amount: 100 to 200         │  │
│  └────────────────────────────────────────┘  │
├──────────────────────────────────────────────┤
│                Row Group 2                   │
│  ┌────────────────────────────────────────┐  │
│  │ Column Chunk: Product                  │  │
│  │  ├─ Page 1: Socks, T-Shirt, Socks      │  │
│  └────────────────────────────────────────┘  │
│  ┌────────────────────────────────────────┐  │
│  │ Column Chunk: Customer                 │  │
│  │  ├─ Page 1: Jane Doe, Alex, Alex       │  │
│  └────────────────────────────────────────┘  │
│  ┌────────────────────────────────────────┐  │
│  │ Column Chunk: Country                  │  │
│  │  ├─ Page 1: UK, USA, USA               │  │
│  └────────────────────────────────────────┘  │
│  ┌────────────────────────────────────────┐  │
│  │ Column Chunk: Date                     │  │
│  │  ├─ Page 1: 2023-01-04, 2023-01-05,    │  │
│  │            2023-01-06                  │  │
│  └────────────────────────────────────────┘  │
│  ┌────────────────────────────────────────┐  │
│  │ Column Chunk: Sales Amount             │  │
│  │  ├─ Page 1: 180, 120, 220              │  │
│  └────────────────────────────────────────┘  │
│  ┌────────────────────────────────────────┐  │
│  │ Row Group Metadata                     │  │
│  │  ├─ Num Rows: 3                        │  │
│  │  ├─ Min/Max per Column:                │  │
│  │     • Product: Socks/T-Shirt           │  │
│  │     • Customer: Alex/Jane Doe          │  │
│  │     • Country: UK/USA                  │  │
│  │     • Date: 2023-01-04 to 2023-01-06   │  │
│  │     • Sales Amount: 120 to 220         │  │
│  └────────────────────────────────────────┘  │
├──────────────────────────────────────────────┤
│                File Metadata                 │
│  ┌────────────────────────────────────────┐  │
│  │ Schema:                                │  │
│  │  • Product: string                     │  │
│  │  • Customer: string                    │  │
│  │  • Country: string                     │  │
│  │  • Date: date                          │  │
│  │  • Sales Amount: double                │  │
│  ├────────────────────────────────────────┤  │
│  │ Compression Codec: Snappy              │  │
│  ├────────────────────────────────────────┤  │
│  │ Num Row Groups: 2                      │  │
│  ├────────────────────────────────────────┤  │
│  │ Offsets to Row Groups                  │  │
│  │  • Row Group 1: offset 128             │  │
│  │  • Row Group 2: offset 1024            │  │
│  └────────────────────────────────────────┘  │
├──────────────────────────────────────────────┤
│                File Footer                   │
│  ┌────────────────────────────────────────┐  │
│  │ Offset to File Metadata: 2048          │  │
│  │ Magic Number: "PAR1"                   │  │
│  └────────────────────────────────────────┘  │
└──────────────────────────────────────────────┘

PAR1 - A 4-byte string "PAR1" indicating this is a Parquet file.

The type of compression used (e.g., Snappy).

Snappy

Low CPU Util
Low Compression Rate
Splittable
Use Case: Hot Layer
Compute Intensive

GZip

High CPU Util
High Compression Rate
Splittable
Use Case: Cold Layer
Storage Intensive

Encoding

Encoding is the process of converting data into a different format to:

Save space (compression)
Enable efficient processing
Support interoperability between systems

Packing clothes and necessities in a luggage vs organizing them in separate sections for easier retrieval.

Plain Encoding

Stores raw values as-is (row-by-row, then column-by-column).
Default for columns that don’t compress well or have high cardinality (too many unique values,ex id/email). Ex: Sales Amount

Dictionary Encoding

Stores a dictionary of unique values and then stores references (indexes) to those values in the data pages.
Great for columns with repeated values.

Example:

- 0: Ball
- 1: T-Shirt
- 2: Socks
- Data Page: [0,1,2,2,1,2]

Reduces storage for repetitive values like "Socks".

Run-Length Encoding (RLE)

Compresses consecutive repeated values into a count + value pair.
Ideal when the data is sorted or has runs of the same value.

Example:

If Country column was sorted: [USA, USA, USA, UK, UK, UK]

RLE: [(3, USA), (3, UK)]

Efficient storage for sorted or grouped data.

Delta Encoding

Stores the difference between consecutive values.
Best for numeric columns with increasing or sorted values (like dates).

Example:

Date column: [2023-01-01, 2023-01-02, 2023-01-03, ...]
Delta Encoding: [2023-01-01, +1, +1, +1, ...]

Very compact for sequential data.

Bit Packing

Packs small integers using only the bits needed rather than a full byte.
Often used with dictionary-encoded indexes.

Example:

Dictionary indexes for Product: [0,1,2,2,1,2]

Needs only 2 bits to represent values (00, 01, 10).

Saves space vs. storing full integers.

Key Features of Parquet

Columnar Storage

Schema Evolution

Supports complex nested data structures (arrays, maps, structs).
Allows the schema to evolve over time, making it highly flexible for changing data models.

Compression

Parquet allows the use of highly efficient compression algorithms like Snappy and Gzip.
Columnar layout improves compression by grouping similar data together—leading to significant storage savings.

Various Encodings

Language Agnostic

Parquet is built from the ground up for cross-language compatibility.
Official libraries exist for Java, C++, Python, and many other languages—making it easy to integrate with diverse tech stacks.

Seamless Integration

Designed to integrate smoothly with a wide range of big data frameworks, including:
- Apache Hadoop
- Apache Spark
- Amazon Glue/Athena
- Clickhouse
- DuckDB
- Snowflake
- and many more.

Python Example


import pandas as pd

file_path = 'https://raw.githubusercontent.com/gchandra10/filestorage/main/sales_100.csv'

# Read the CSV file
df = pd.read_csv(file_path)

# Display the first few rows of the DataFrame
print(df.head())

# Write DataFrame to a Parquet file
df.to_parquet('sample.parquet')

Some utilities to inspect Parquet files

WIN/MAC

https://aloneguid.github.io/parquet-dotnet/parquet-floor.html#installing

MAC

https://github.com/hangxie/parquet-tools

parquet-tools row-count sample.parquet
parquet-tools schema sample.parquet
parquet-tools cat sample.parquet
parquet-tools meta sample.parquet

Remote Files

parquet-tools row-count https://github.com/gchandra10/filestorage/raw/refs/heads/main/sales_onemillion.parquet

#bigdata #dataformat #parquet #columnar #compressedVer 5.5.3

Adv Big Data and Tools