[Avg. reading time: 7 minutes]
Veracity
Veracity refers to the trustworthiness, quality, and accuracy of data. In the world of Big Data, not all data is created equal — some may be incomplete, inconsistent, outdated, or even deliberately false. The challenge is not just collecting data, but ensuring it’s reliable enough to make sound decisions.
Why Veracity Matters
-
Poor data quality can lead to wrong insights, flawed models, and bad business decisions.
-
With increasing sources (social media, sensors, web scraping), there’s more noise than ever.
-
Real-world data often comes with missing values, duplicates, biases, or outliers.
Key Dimensions of Veracity in Big Data
Dimension | Description | Example |
---|---|---|
Trustworthiness | Confidence in the accuracy and authenticity of data. | Verifying customer feedback vs. bot reviews |
Origin | The source of the data and its lineage or traceability. | Knowing if weather data comes from reliable source |
Completeness | Whether the dataset has all required fields and values. | Missing values in patient health records |
Integrity | Ensuring the data hasn’t been altered, corrupted, or tampered with during storage or transfer. | Using checksums to validate data blocks |
How to Tackle Veracity Issues
- Data Cleaning: Remove duplicates, correct errors, fill missing values.
- Validation & Verification: Check consistency across sources.
- Data Provenance: Track where the data came from and how it was transformed.
- Bias Detection: Identify and reduce systemic bias in training datasets.
- Robust Models: Build models that can tolerate and adapt to noisy inputs.
Websites & Tools to Generate Sample Data
Highly customizable fake data generator; supports exporting as CSV, JSON, SQL. https://mockaroo.com
Easy UI to create datasets with custom fields like names, dates, numbers, etc. https://www.onlinedatagenerator.com
Apart from this, there are few Data generating libraries.
https://faker.readthedocs.io/en/master/
https://github.com/databrickslabs/dbldatagen
Question?
Is generating fake data good or bad?
When we have real data? why generate fake data?