[Avg. reading time: 11 minutes]

Serialization-Deserialization

Serialization converts a data structure or object state into a format that can be stored or transmitted (e.g., file, message, or network).

Deserialization is the reverse process, reconstructing the original object from the serialized form.

(Python/Scala/Rust) Objects to JSON back to Objects (Python/Scala/Rust)

The analogy of translating from Spanish to English (Universal Language) and to German

JSON

JavaScript Object Notation (JSON)

A lightweight, human-readable, and machine-parsable text format.

Pros

  • Easy to read and debug.
  • Supported by almost all programming languages.
  • Ideal for APIs and configuration files.

Cons

  • Text-based -> larger size on disk.
  • No native schema enforcement.

import json

# Serialization
data = {"name": "Alice", "age": 25, "city": "New York"}
json_str = json.dumps(data)
print(json_str)

# Deserialization
obj = json.loads(json_str)
print(obj["name"])

AVRO

Apache Avro is a binary serialization format designed for efficiency, compactness, and schema evolution.

  • Compact & Efficient: Binary encoding → smaller and faster than JSON.
  • Schema Evolution: Supports backward/forward compatibility.
  • Rich Data Types: Handles nested, array, map, union types.
  • Language Independent: Works across Python, Java, Scala, Rust, etc.
  • Big Data Integration: Works seamlessly with Hadoop, Kafka, Spark.
  • Self-Describing: Schema travels with the data.

Schemas

An Avro schema defines the structure of the Avro data format. It’s a JSON document that describes your data types and protocols, ensuring that even complex data structures are adequately represented. The schema is crucial for data serialization and deserialization, allowing systems to interpret the data correctly.

Example of Avro Schema

{
  "type": "record",
  "name": "Person",
  "namespace": "com.example",
  "fields": [
    {"name": "firstName", "type": "string"},
    {"name": "lastName", "type": "string"},
    {"name": "age", "type": "int"},
    {"name": "email", "type": ["null", "string"], "default": null}
  ]
}

Here is the list of Primitive & Complex Data Types which Avro supports:

  • null,boolean,int,long,float,double,bytes,string
  • records,enums,arrays,maps,unions,fixed

JSON vs Avro

FeatureJSONAvro
Format TypeText-based (human-readable)Binary (machine-efficient)
SizeLarger (verbose)Smaller (compact)
SpeedSlower to serialize/deserializeMuch faster (binary encoding)
SchemaOptional / loosely definedMandatory and embedded with data
Schema EvolutionNot supportedFully supported (backward & forward compatible)
Data TypesBasic (string, number, bool, array, object)Rich (records, enums, arrays, maps, unions, fixed)
ReadabilityHuman-friendlyNot human-readable
IntegrationCommon in APIs, configsCommon in Big Data (Kafka, Spark)
Use CaseSimple data exchange (REST APIs)High-performance data pipelines, streaming systems

In short,

  • Use JSON when simplicity & readability matter.
  • Use Avro when performance, compactness, and schema evolution matter (especially in Big Data systems).
git clone https://github.com/gchandra10/python_serialization_deserialization_examples.git

Parquet vs Avro

FeatureAvroParquet
Format TypeRow-based binary formatColumnar binary format
Best ForStreaming, message passing, row-oriented reads/writesAnalytics, queries, column-oriented reads
CompressionModerate (row blocks)Very high (per column)
Read PatternReads entire rowsReads only required columns → faster for queries
Write PatternFast row inserts / appendsBest for batch writes (not streaming-friendly)
SchemaEmbedded JSON schema, supports evolutionEmbedded schema, supports evolution (with constraints)
Data EvolutionFlexible backward/forward compatibilitySupported, but limited (column addition/removal)
Use CaseKafka, Spark streaming, data ingestion pipelinesData warehouses, lakehouse tables, analytics queries
IntegrationHadoop, Kafka, Spark, HiveSpark, Hive, Trino, Databricks, Snowflake
ReadabilityNot human-readableNot human-readable
Typical File Extension.avro.parquet

#serialization #deserialization #avroVer 5.5.3

Last change: 2025-10-15