[Avg. reading time: 7 minutes]
Variety
Variety refers to the different types, formats, and sources of data collected — one of the 5 Vs of Big Data.
Types of Data : By Source
- Social Media: YouTube, Facebook, LinkedIn, Twitter, Instagram
- IoT Devices: Sensors, Cameras, Smart Meters, Wearables
- Finance/Markets: Stock Market, Cryptocurrency, Financial APIs
- Smart Systems: Smart Cars, Smart TVs, Home Automation
- Enterprise Systems: ERP, CRM, SCM Logs
- Public Data: Government Open Data, Weather Stations
Types of Data : By Data format
- Structured Data – Organized in rows and columns (e.g., CSV, Excel, RDBMS)
- Semi-Structured Data – Self-describing but irregular (e.g., JSON, XML, Avro, YAML)
- Unstructured Data – No fixed schema (e.g., images, audio, video, emails)
- Binary Data – Encoded, compressed, or serialized data (e.g., Parquet, Protocol Buffers, images, MP3)
Generally unstructured data files are stored in binary format, Example: Images, Video, Audio
But not all binary files contain unstructured data. Example: Parquet, Executable.
Structured Data
Tabular data from databases, spreadsheets.
Example:
- Relational Table
- Excel
ID | Name | Join Date |
---|---|---|
101 | Rachel Green | 2020-05-01 |
201 | Joey Tribianni | 1998-07-05 |
301 | Monica Geller | 1999-12-14 |
401 | Cosmo Kramer | 2001-06-05 |
Semi-Structred Data
Data with tags or markers but not strictly tabular.
JSON
[
{
"id":1,
"name":"Rachel Green",
"gender":"F",
"series":"Friends"
},
{
"id":"2",
"name":"Sheldon Cooper",
"gender":"M",
"series":"BBT"
}
]
XML
<?xml version="1.0" encoding="UTF-8"?>
<actors>
<actor>
<id>1</id>
<name>Rachel Green</name>
<gender>F</gender>
<series>Friends</series>
</actor>
<actor>
<id>2</id>
<name>Sheldon Cooper</name>
<gender>M</gender>
<series>BBT</series>
</actor>
</actors>
Unstructured Data
Media files, free text, documents, logs – no predefined structure.
Rachel Green acted in Friends series. Her role is very popular.
Similarly Sheldon Cooper acted in BBT. He acted as nerd physicist.
Types:
- Images (JPG, PNG)
- Video (MP4, AVI)
- Audio (MP3, WAV)
- Documents (PDF, DOCX)
- Emails
- Logs (system logs, server logs)
- Web scraping content (HTML, raw text)
Note: Now we have lot of LLM (AI tools) that helps us parse Unstructured Data into tabular data quickly.
#structured
#unstructured
#semistructured
#binary
#json
#xml
#image
#bigdata
#bigv