πΎBlob Storage & File Connectors
Configure blob or file storage connectors to validate file-based datasets stored in cloud platforms or on-premises systems.
Overview
Blob storage and file systems provide scalable, cost-effective storage for large datasets. Data Testing supports major cloud storage providers and file transfer protocols for comprehensive file-based data validation.
File-based connectors support various file formats (CSV, JSON, Parquet, etc.) and enable validation of data stored in cloud storage or on file servers.
Available Storage Connectors
Cloud-native object storage services:
Amazon S3
AWS
AWS data lake storage
Azure Data Lake Storage Gen2
Microsoft Azure
Azure cloud storage
Ideal for:
β Scalable file storage
β Data lake architectures
β Cost-effective storage
β Multi-region replication
Network file transfer and on-premises systems:
SFTP
SSH File Transfer
Remote file systems
Ideal for:
β On-premises file servers
β Legacy system integration
β Secure file transfer
β Password & key authentication
Common File Connector Configuration
Connection Parameters
Host/Endpoint
Storage endpoint or SFTP server address
β Yes
Port
Service port (S3: 443, SFTP: 22)
β Yes
Authentication
API keys, credentials, or certificates
β Yes
Bucket/Path
Storage location or directory path
β Yes
File Configuration
File Path/Prefix
Location of data files
File Format
CSV, JSON, Parquet, XML, etc.
Delimiter
Field separator for structured formats
Header Row
Whether file includes headers
Encoding
Character encoding (UTF-8, etc.)
π Amazon S3
Cloud-native object storage from AWS:
Amazon S3 is ideal for:
AWS cloud data lakes
Large-scale file storage
Multi-region data distribution
Integration with AWS analytics services
π· Azure Data Lake Storage Gen2
Enterprise data lake on Azure:
ADLS Gen2 is ideal for:
Azure cloud deployments
Enterprise data lakes
Hadoop file system compatibility
Integration with Azure Synapse and Power BI
π SFTP
Secure file transfer protocol:
SFTP is ideal for:
On-premises file servers
Legacy system integration
Secure file transfer
Remote team collaboration
π Security Best Practices
Essential Security Practices:
β Use IAM roles instead of static credentials
β Enable bucket policies for least-privilege access
β Use SSH keys for SFTP instead of passwords
β Enable encryption at rest and in transit
β Enable versioning for data protection
β Configure MFA delete protection
β Audit access logs regularly
β Implement network isolation
Supported File Formats
CSV
.csv
Tabular data
β Full
JSON
.json
Semi-structured data
β Full
Parquet
.parquet
Columnar storage
β Full
XML
.xml
Hierarchical data
β οΈ Limited
Text
.txt
Plain text
β Full
Excel
.xlsx/.xls
Spreadsheets
β οΈ Limited
Performance Considerations
File Size
Memory usage
Use compression, split large files
File Count
Processing time
Batch files, use prefixes
Network
Transfer speed
Use regional endpoints
Format
Parse time
Use efficient formats (Parquet)
Encoding
Processing
Ensure consistent encoding
Storage Connector Comparison
Cloud Provider
AWS
Azure
Any
Authentication
IAM/Access Keys
AD/SAS
SSH Keys/Password
Scalability
Unlimited
Unlimited
Server-dependent
Encryption
β Yes
β Yes
β Yes (TLS)
Versioning
β Yes
β Yes
Manual
Cost
Low
Low
Varies
Setup
Cloud-native
Cloud-native
Quick
π Quick Start
Choose your storage platform (S3, ADLS Gen2, or SFTP)
Configure access credentials with minimal required permissions
Enable encryption for data security
Identify target files and their location/path
Configure file format and parsing options
Test the connection before creating jobs
Create validation jobs for data quality checks
Related Documentation
Last updated