Amazon S3

Cloud object storage for large-scale data validation, perfect for AWS-based data lakes and analytics workloads.

Overview

Amazon Simple Storage Service (S3) is a scalable object storage service that enables you to store and retrieve any amount of data. The Data Testing connector integrates with S3 to validate file-based data stored in AWS.

circle-check

βš™οΈ Configuration Parameters

Parameter
Description
Required
Example

Access Key

AWS access key ID

βœ… Yes

AKIAIOSFODNN7EXAMPLE

Secret Key

AWS secret access key

βœ… Yes

wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY

Bucket Name

S3 bucket name

βœ… Yes

my-data-lake

Role

IAM role (alternative to keys)

⚠️ Optional

arn:aws:iam::123456:role/S3Access

File Key

S3 object path/key

βœ… Yes

data/exports/customers.csv

Rules File Key

Validation rules file location

⚠️ Optional

rules/validation_rules.json

Configuration Screenshot


πŸ” Authentication Methods

Programmatic credentials for API access

  1. Create an IAM user or get credentials

  2. Generate Access Key ID and Secret Access Key

  3. Enter both keys in configuration

  4. Click Test Connection

Most common method for applications and scripts.


πŸ“– Getting Started

Step 1: Create AWS Credentials

Create an IAM user with S3 access:

Step 2: Set Up S3 Bucket Access

Ensure your S3 bucket is accessible and contains your data files:

Step 3: Configure the Connector

  1. Navigate to your job configuration

  2. Select Amazon S3 as the data source

  3. Enter AWS credentials or IAM role

  4. Specify bucket name

  5. Enter file path/key

  6. Click Test Connection

  7. Verify file format and preview data


πŸ“Š Supported File Formats

Format
Extension
Use Case
Support

CSV

.csv

Tabular data export

βœ… Full

JSON

.json

Semi-structured data

βœ… Full

Parquet

.parquet

Columnar analytics data

βœ… Full

XML

.xml

Hierarchical data

βœ… Full

Text

.txt

Plain text files

βœ… Full

CSV Format Example

JSON Format Example


πŸš€ S3-Specific Features

S3 Object Paths

S3 uses a key-based naming system:

Cross-Region Validation

Validate data across AWS regions:

Versioning

S3 can track file versions for audit and recovery:


⚠️ File Size Limitations

circle-exclamation

Handling Large Files


πŸ” Security Best Practices

circle-exclamation

S3 Bucket Policy Example


πŸ’‘ Cost Optimization

Strategy
Benefit

Use storage classes

Store infrequent data in Glacier/Deep Archive

Enable versioning cleanup

Use lifecycle policies to delete old versions

Compress data

Use gzip or Parquet for smaller files

Batch operations

Process multiple files in single job

Archive

Move old data to cheaper storage


πŸ› Troubleshooting

Issue
Solution

Access denied

Check IAM permissions, verify credentials

File not found

Verify bucket name and file path/key

File too large

Check file size (max 5 GB)

Invalid credentials

Regenerate access key/secret

Connection timeout

Check network/firewall, verify bucket exists

Permission denied

Ensure IAM user has S3:GetObject permission


Last updated