[Avg. reading time: 7 minutes]

Veracity

Veracity refers to the trustworthiness, quality, and accuracy of data. In the world of Big Data, not all data is created equal — some may be incomplete, inconsistent, outdated, or even deliberately false. The challenge is not just collecting data, but ensuring it’s reliable enough to make sound decisions.

Why Veracity Matters

  • Poor data quality can lead to wrong insights, flawed models, and bad business decisions.

  • With increasing sources (social media, sensors, web scraping), there’s more noise than ever.

  • Real-world data often comes with missing values, duplicates, biases, or outliers.

Key Dimensions of Veracity in Big Data

DimensionDescriptionExample
TrustworthinessConfidence in the accuracy and authenticity of data.Verifying customer feedback vs. bot reviews
OriginThe source of the data and its lineage or traceability.Knowing if weather data comes from reliable source
CompletenessWhether the dataset has all required fields and values.Missing values in patient health records
IntegrityEnsuring the data hasn’t been altered, corrupted, or tampered with during storage or transfer.Using checksums to validate data blocks

How to Tackle Veracity Issues

  • Data Cleaning: Remove duplicates, correct errors, fill missing values.
  • Validation & Verification: Check consistency across sources.
  • Data Provenance: Track where the data came from and how it was transformed.
  • Bias Detection: Identify and reduce systemic bias in training datasets.
  • Robust Models: Build models that can tolerate and adapt to noisy inputs.

Websites & Tools to Generate Sample Data

Highly customizable fake data generator; supports exporting as CSV, JSON, SQL. https://mockaroo.com

Easy UI to create datasets with custom fields like names, dates, numbers, etc. https://www.onlinedatagenerator.com

Apart from this, there are few Data generating libraries.

https://faker.readthedocs.io/en/master/

https://github.com/databrickslabs/dbldatagen

Question?

Is generating fake data good or bad?

When we have real data? why generate fake data?

#bigv #veracity #bigdataVer 5.5.3

Last change: 2025-10-15