Data Testing Beta
Last updated
Last updated
Data Testing is a platform to test the data from various sources. Users can do comparison of two different sources and also perform single source validations.
There are different types of jobs listed below.
Compare means there are two sources of data. One will be the source and the other will be the target.
Supported data types : boolean/bool, str(string), int, float, double Note : User can map the columns and data type in target. If there mysql which has smallint target and redshift which has boolean source, so we can map that in boolean.
Tests:
Duplicates In Source – Based on the unique key passed it will check for duplicates in source.
Duplicates In Target – Based on the unique key passed it will check for duplicates in target.
Missing In Source – An examination that cross-references unique identifiers to determine records present in the target but absent from the source.
Missing In Target – An examination that cross-references unique identifiers to determine records present in the source but absent from the target.
Full Comparision – A comprehensive analysis that links two datasets via a unique key to meticulously uncover all discrepancies between source and target.
Users can get help from a list of columns or from the table with data type given besides mapping.
Note : This feature is not support in file connectors (S3,SFTP).
Evaluate means performing validations on a single source.
Inputs:
Row count verification – Matches the number of rows present in the data source with the provided value.
Column type verification – Matches the column's data type with the expected data type.
Non null columns verification – Checks for any null row for provided columns.
Duplicate columns verification – Checks for duplicate rows for provided columns.
Regex verification - Will perform check on all the row if it passed the regex.
Examples :
Email Address Validation :
^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$
US Phone Number Validation :^\(\d{3}\) \d{3}-\d{4}$
Credit Card Number Validation :^\d{4} \d{4} \d{4} \d{4}$
User-Function verification - Checks if custom buisness logic passes for all the rows. It supports lambda functions as well as user function. It should return true or false
Note : Only supports python language and built in packages.
Examples :
1 : Custom function
def validate_complex_data(data):
# Check if data is a string
if not isinstance(data, str):
return False
# Check length (assuming a hash length of 64 characters)
if len(data) != 64:
return False
import re
# Check if all characters are valid hexadecimal digits
hex_pattern = re.compile(r'^[0-9a-fA-F]{64}$')
if not hex_pattern.match(data):
return False
return True
2 : Lambda function
lamda x : x < 3000 and x > 1500
Users can get help from a list of columns.
Note : This feature is not support in file connectors (S3,SFTP).
Or from
The user needs to provide information like Host, Name, Port, Username, Password, Query, Access Key, Secret Key and Primary Key.
Source Name : A unique identifier for the API connection.
Request Type : Select the type of request (e.g., GET, POST, PUT, DELETE).
URL : The endpoint to which the request is sent.
Parameters : Users can add key-value pairs for query parameters.
Headers : Additional metadata sent with the request.
Body : JSON or other formatted data required for POST/PUT requests.
Call API : Executes the request and fetches data.
Extract Data : Parses the response using JSONPath or other extractors.
Configuration Steps
Step 1: Selecting API as Source Connection
Navigate to the Start Job section.
Under Source Connection Type, select API.
Enter a unique Source Name (e.g., API1).
Click on the ✅ button to confirm the source name.
Step 2: Configuring API Request
Enter the API Endpoint URL in the provided field.
Choose the appropriate Request Type (GET, POST, PUT, DELETE).
If required, add Parameters as key-value pairs.
Provide necessary Headers for authentication or content type.
For POST/PUT requests, include a JSON Body.
Step 3: Calling the API
Click the Call API button to execute the request.
Wait for the response to be displayed in the result section.
Step 4: Extracting Data from API Response
Use the Response Extractor to define extraction rules.
Select the Response Type (JSON, XML, etc.).
Apply the appropriate JSONPath/XPath to extract relevant values.
Click Extract to preview extracted data.
Step 5: Verifying Extracted Data
Click the Verify button to validate the extracted values.
If verification is successful, proceed to the next step.
Limitations & Constraints
🚨 Temporary Restriction: Currently, the system supports only Flat JSON responses. In the future, support for nested JSON will be added.
📏 Response Size: While we allow more than 6MB of JSON during comparison but when configuring it or using as part of table in function we don't allow outputs more than 6 MB.
✅ Allowed: Flat JSON (Correct)
[
{
"UserId": "U001",
"Food_Item": "Chicken Tenders",
"Serving_Size_g": 574,
"Calories": 269
},
{
"UserId": "U002",
"Food_Item": "French Fries",
"Serving_Size_g": 450,
"Calories": 365
}
]
❌ Not Allowed: Nested JSON (Incorrect)
{
"Users": [
{
"UserId": "U001",
"Details": {
"Food_Item": "Chicken Tenders",
"Serving_Size_g": 574,
"Calories": 269
}
},
{
"UserId": "U002",
"Details": {
"Food_Item": "French Fries",
"Serving_Size_g": 450,
"Calories": 365
}
}
]
}
Host URL : A unique identifier URL. That makes database connection available over the network.
Database Name : A name of the Database that is used to store the tables that the user needs to test.
Port : A port number on which connection for the database is open.
Username : An identity assigned to a user for access control and identification within a database.
Password : A secret string of characters used to authenticate a user's identity in conjunction with their username.
Query : A request for data retrieval, update, or manipulation issued to a database management system.
Primary Key : A unique identifier for each record in a database table, used to ensure data integrity and enable swift access.
Host URL : A unique identifier URL. That makes database connection available over the network.
Database Name : A name of the Database that is used to store the tables that the user needs to test.
Port : A port number on which connection for the database is open.
Username : An identity assigned to a user for access control and identification within a database.
Password : A secret string of characters used to authenticate a user's identity in conjunction with their username.
Query : A request for data retrieval, update, or manipulation issued to a database management system.
Primary Key : A unique identifier for each record in a database table, used to ensure data integrity and enable swift access.
Host URL : A unique identifier URL that makes the database connection available over the network.
Database Name : The name of the database that stores the tables needed for testing.
Port : The port number on which the database connection is open.
Username : A user identity assigned for access control and identification within the database.
Password : A secret string of characters used to authenticate a user's identity along with their username.
Query : A request for data retrieval, update, or manipulation issued to the database management system.
Primary Key : A unique identifier for each record in a database table, ensuring data integrity and enabling swift access.
Host URL : This is the address of the SFTP server which clients use to establish a connection.
Port : The network port that the SFTP service listens on for incoming connections.
Username : The account name used to log in to the SFTP server. It is part of the credentials required for authentication.
Password : The secure password associated with the username for authorization on the SFTP server.
File Path : The location within the SFTP server where files can be uploaded, downloaded, or manipulated.
Rules File Path : The specific location on the SFTP server where the file containing rules or configurations is stored.
Access Key : This is part of the credentials required to programmatically access Amazon S3 services. It is used with a secret key.
Secret Key : This key is used in conjunction with the access key to sign programmatic requests to the AWS S3 service for secure access.
Bucket Name : The name of the storage container on Amazon S3 where files and folders are stored.
Role : The set of permissions associated with a user or service when interacting with AWS S3, which determines what actions are allowed.
File Key : The unique identifier for a specific file within an S3 bucket, essentially its path within the bucket.
Rules File Key : The specific key that points to where the file containing rules or configurations for an application or service is stored in an S3 bucket.
Step 1 : Selecting a Job
Identify the relevant job from the list.
Click on the job name (e.g., MYSQL-Redshift Compare with Function) to open and configure the comparison settings.
Step 2 : Fetching Data
Click on the Fetch Data button to retrieve data from the selected source.
The fetched data will be displayed in the table for further processing and analysis.
Step 3 : Viewing Fetched Data
After fetching data, the retrieved records are displayed in a tabular format.
The table contains multiple columns, including tx_id
, acc_code
, cust_code
, and more.
Users can add new rows or clear the existing data using the options below the table.
Step 4 : Generating a Function
Enter the function logic in the text area, specifying the required transformation.
Choose whether the function applies to a Single Column or Multiple Columns.
Provide a Function Name for easy identification.
Click the Generate button to create the function logic.
Example: Creating a Single Column Function
Define the transformation logic for a Single Column function.
In this example, we are replacing tx_currency
values with their respective currency symbols.
The Python code defines a mapping (currency_map
) to replace USD
with $
, GBP
with £
, and so on.
Select Single Column mode and choose the column (tx_currency
).
Enter a function name (e.g., tx_currency replace
).
Click Generate to create the function logic.
Click Run to apply the function to the data.
Save the function for future use.
Example: Creating a Multiple Columns Function
Define the transformation logic for a Multiple Columns function.
In this example, we are:
Converting tx_datetime
to dd/mm/yyyy format.
Rounding off tx_amount
to the nearest whole number.
Select Multiple Columns mode.
Enter a function name (e.g., tx_datetime dd/mm/yyyy
).
Click Generate to create the function logic.
Click Run to apply the function to the data.
Save the function for future use.
Step 5 : Running the Function
After generating the function logic, review the generated code in the Code section.
Click the Run button to apply the function to the dataset.
The function will process the data and update the respective columns accordingly.
Step 6 : Viewing Transformed Data
After applying the function, the updated data is displayed in the table.
Check if the transformations were applied correctly (e.g., tx_currency
values replaced, tx_datetime
formatted, tx_amount
rounded).
If any modifications are required, return to the function section and reapply the necessary changes.
Step 7 : Saving the Function
After running and verifying the function, click the Save button to store it for future use.
A success message confirms that the function has been saved.
Saved functions can be accessed and reused from the Saved Functions section.
Step 8 : Accessing Saved Functions
Navigate to the Saved Functions section from the left menu.
All previously created functions are listed along with their details:
Function Name – The name of the function.
Function Type – Whether it applies to a single or multiple columns.
Created By – The email of the user who created the function.
Created At – The timestamp of function creation.
Click on a function name to open and modify it if needed.
Step 9 : Viewing Function Details
Click on a function from the Saved Functions list to view its details.
The function name, column mode (Single or Multiple Columns), and transformation logic are displayed.
If modifications are needed, update the function logic and save it again.
Step 1: Selecting Job Type
Navigate to the Start Job page.
Choose the job type you want to perform (e.g., comparing two data sources).
Enter a Job Name and Purpose for clarity.
Click Next to proceed.
Step 2: Configuring Source Connection
Select the Source Connection Type (e.g., Redshift, MySQL).
Provide connection details such as Host URL, Database Name, Username, and Password.
Click Verify to ensure the connection is valid.
Once verified, click Next.
Step 3: Configuring Target Connection
Select the Target Connection Type similar to the source.
Enter the necessary connection details.
Click Verify to confirm the connection.
Once verified, click Next.
Step 4: Mapping Columns
Scroll above to find the Column Code Transformation section.
If source and target columns have different names, map them accordingly.
Use the Primary Key field to ensure proper comparison.
Click Save or Save And Start.
Step 5: Fetching Functions
Navigate to the Column Code Transformation section.
Click on Fetch Functions to retrieve saved functions.
Choose the desired function for transformation.
Step 6: Adding the Function
Select the function you want to apply.
Choose Single Column or Multiple Columns mode.
Click Add to include the function in the transformation.
Click on function name to modify the function.
Click on red button to delete a function.
Step 7: Finalizing the Job
Review the added function to ensure it’s correctly applied.
Click Save And Start to execute the job.
The transformation will now be applied during data comparison.
User can click on any test to see the result.
Here if there are any failed validation it will contain details like which columns validation failed as shown below.
Here is a video showing configurations for database to database transformations.
Results for database to database transformations.
If the job is failed then in the info icon error message will be visible.