Machine learning has been possible partly due to the accumulation of data, and within that data, an important step is that of data validation. May it be a data warehouse, database, or data lake migration, all require data validations. It mainly encompasses comparing the structured and the semi-structured data right from the source to the target and subsequently verifying that they match correctly after every step in the process.
The Data Validation Tool by Google
Looking at the importance of data validation, Google recently released the Data Validation Tool (DVT). This tool will primarily function as an open-sourced Python CLI tool that would provide an automated and repeatable solution for the process of data validation. The researchers have claimed that this tool would work in different environments with brilliant accuracy. The framework that was equipped for this tool is the Ibis. This would act as an intermediary link between the numerous data sources like BigQuery, Cloud Spanner, and so forth.
Working of the Tool
Data validation as such is a tiresome task, and cross-platform validation adds to the burden. Many consumers have to build and maintain customizable solutions for performing the functions of cross-platform validation. However, if this tool by Google is truly successful in its application, it could provide standardized solutions to the consumers and validate any data migrated recently. It would work for all the data stored in the Google Cloud. Not only this, DVT proves to be advantageous in the domain as it can be integrated into the pre-existing system, thereby saving tons of effort. It merges with the ETL pipelines that then provides the consumers with a seamless and automated validation. This tool works well with third-party databases, and it looks at their products and the file system. Names like BigQuery, Cloud SQL, FileSystem (GCS, S3, or local files), Hive, Impala, MySQL, Oracle, Postgres, Snowflake, Spanner, SQL Server, and Teradata are adequately connected by the DVT.
Functions Performed by the Tool
The DVT as a tool is being claimed to perform multi-leveled data validation functions. These functions include the following:
- Table level
The table row count, group by row count, column aggregation, and filters are added via the tool in this function.
- Column Level
In this, the schema and the column type are added.
- Row Level
This works only for BigQuery, wherein the hash comparison is made.
- Raw SQL Exploration
In this, the raw custom queries are run on different data sources for the best possible solution.
Usage of the DVT
The DVT doesn’t have complex working and is user friendly. The first thing that the user needs to get started with is creating a good connection. This can be done through any of the data sources that have been listed prior to the validation. Now, by default, if the user does not provide an aggregation, it automatically goes to court, which says that the tool would count the number of columns in the source table, then verify and match the count as provided in the target table. The DVT is highly customizable, and different tables and labels can be added to the validations. Moreover, the validation can then be saved to a YAML configuration file. This allows the user to modify the file later as and when required. BigQuery is the result handler for this tool.
The tool has been open-sourced for public usage, but the researchers are actively improving the tool and including new features for even better validation.