[Avg. reading time: 19 minutes]

Data Frames

DataFrames are the core abstraction for tabular data in modern data processing — used across analytics, ML, and ETL workflows.

They provide:

  • Rows and columns like a database table or Excel sheet.
  • Rich APIs to filter, aggregate, join, and transform data.
  • Interoperability with CSV, Parquet, JSON, and Arrow.

Pandas

Pandas is a popular Python library for data manipulation and analysis. A DataFrame in Pandas is a two-dimensional, size-mutable, and heterogeneous tabular data structure with labeled axes (rows and columns).

Eager Evaluation: Pandas performs operations eagerly, meaning that each operation is executed immediately when called.

In-Memory Copy - Full DataFrame in RAM, single copy

Sequential Processing - Single threaded, one operation at at time.

Pros

  • Easy to use and intuitive syntax.
  • Rich functionality for data manipulation, including filtering, grouping, and merging.
  • Large ecosystem and community support.

Cons

  • Performance issues with very large datasets (limited by memory).
  • Single-threaded operations, making it slower for big data tasks.

Example

import pandas as pd

# Load the CSV file using Pandas
df = pd.read_csv('data/sales_100.csv')

# Display the first few rows
print(df.head())

Polars

Polars is a fast, multi-threaded DataFrame library in Rust and Python, designed for performance and scalability. It is known for its efficient handling of larger-than-memory datasets.

Supports both eager and lazy evaluation.

Lazy Evaluation: Instead of loading the entire CSV file into memory right away, a Lazy DataFrame builds a blueprint or execution plan describing how the data should be read and processed. The actual data is loaded only when the computation is triggered (for example, when you call a collect or execute command).

Optimizations: Using scan_csv allows Polars to optimize the entire query pipeline before loading any data. This approach is beneficial for large datasets because it minimizes memory usage and improves execution efficiency.

  • pl.read_csv() or pl.read_parquet() - eager evaluation
  • pl.scan_csv() or pl.scan_parquet() - lazy evaluation

Parallel Execution: Multi-threaded compute.

Columnar efficiency: Uses Arrow columnar memory format under the hood.

Pros

  • High performance due to multi-threading and memory-efficient execution.
  • Lazy evaluation, optimizing the execution of queries.
  • Handles larger datasets effectively.

Cons

  • Smaller community and ecosystem compared to Pandas.
  • Less mature with fewer third-party integrations.

Example

import polars as pl

# Load the CSV file using Polars
df = pl.scan_csv('data/sales_100.csv')

print(df.head())

# Display the first few rows
print(df.collect())

df1 = pl.read_csv('data/sales_100.csv')
print(df1.head())

Dask

Dask is a parallel computing library that scales Python libraries like Pandas for large, distributed datasets.

Client (Python Code)
   │
   ▼
Scheduler (builds + manages task graph)
   │
   ▼
Workers (execute tasks in parallel)
   │
   ▼
Results gathered back to client

Open Source https://docs.dask.org/en/stable/install.html

Dask Cloud Coiled Cloud

Lazy Reading: Dask builds a task graph instead of executing immediately — computations run only when triggered (similar to Polars lazy execution).

Partitioning: A Dask DataFrame is split into many smaller Pandas DataFrames (partitions) that can be processed in parallel.

Task Graph: Dask represents your workflow as a directed acyclic graph (DAG) showing the sequence and dependencies of tasks.

Distributed Compute: Dask executes tasks across multiple cores or machines, enabling scalable, parallel data processing.


import dask.dataframe as dd

ddf = dd.read_csv(
    "data/sales_*.csv",
    dtype={"category": "string", "value": "float64"},
    blocksize="64MB"
)

# 2) Lazy transform: per-partition groupby + sum, then global combine
agg = ddf.groupby("category")["value"].sum().sort_values(ascending=False)

# 3) Trigger execution and bring small result to driver
result = agg.compute()

print(result.head(10))

blocksize determines the parition. If omitted dask automatically uses 64MB

flowchart LR
  A1[CSV part 1] --> P1[parse p1]
  A2[CSV part 2] --> P2[parse p2]
  A3[CSV part 3] --> P3[parse p3]

  P1 --> G1[local groupby-sum p1]
  P2 --> G2[local groupby-sum p2]
  P3 --> G3[local groupby-sum p3]

  G1 --> C[combine-aggregate]
  G2 --> C
  G3 --> C

  C --> S[sort values]
  S --> R[collect to Pandas]

Pros

  • Can handle datasets that don’t fit into memory by processing in parallel.
  • Scales to multiple cores and clusters, making it suitable for big data tasks.
  • Integrates well with Pandas and other Python libraries.

Cons

  • Slightly more complex API compared to Pandas.
  • Performance tuning can be more challenging.

Where to start?

  • Start with Pandas for learning and small datasets.
  • Switch to Polars when performance matters.
  • Use Dask when data exceeds single-machine memory or needs cluster execution.
git clone https://github.com/gchandra10/python_dataframe_examples.git

Pandas vs Polars vs Dask

FeaturePandasPolarsDask
LanguagePythonRust with Python bindingsPython
Execution ModelSingle-threadedMulti-threadedMulti-threaded, distributed
Data HandlingIn-memoryIn-memory, Arrow-basedIn-memory, out-of-core
ScalabilityLimited by memoryLimited to single machineScales across clusters
PerformanceGood for small to medium dataHigh performance for single machineGood for large datasets
API FamiliarityWidely known, matureSimilar to PandasSimilar to Pandas
Ease of UseVery easy, large ecosystemEasy, but smaller ecosystemModerate, requires understanding of parallelism
Fault ToleranceNoneLimitedHigh, with task retries and rescheduling
Machine LearningIntegration with Python ML libsPreprocessing onlyIntegration with Dask-ML and other libs
Lazy EvaluationNoYesYes, with task graphs
Best ForData analysis, small datasetsFast preprocessing on single machineLarge-scale data processing
Cluster ManagementN/AN/ASupports Kubernetes, YARN, etc.
Use CasesData manipulation, analysisFast data manipulationLarge data, ETL, scaling Python code

#pandas #polars #daskVer 5.5.2

Last change: 2025-10-08