Azure Data Lake Storage Gen2

Enterprise data lake on Microsoft Azure with hierarchical namespace, perfect for large-scale analytics and big data workloads.

Overview

Azure Data Lake Storage (ADLS) Gen2 is a cloud-scale data lake solution that combines the capabilities of Azure Blob Storage with distributed file system semantics. It's ideal for organizations running analytics at scale with Hadoop, Spark, and other big data frameworks. The Data Testing connector integrates seamlessly with ADLS Gen2 for comprehensive file-based data validation.

circle-check

βš™οΈ Configuration Parameters

Parameter
Description
Required
Example

Account Name

Storage account name

βœ… Yes

mydatalake

Account Key

Primary or secondary key

⚠️ Optional*

DefaultEndpointsProtocol=https...

Connection String

Complete connection string

⚠️ Optional*

key=value;

SAS Token

Shared Access Signature

⚠️ Optional*

sv=2020-08-04&...

Tenant ID

Azure AD tenant ID

⚠️ Optional*

xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx

Client ID

Azure AD app client ID

⚠️ Optional*

xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx

Client Secret

Azure AD app secret

⚠️ Optional*

β€’β€’β€’β€’β€’β€’β€’β€’β€’β€’

Filesystem/Container

Container name

βœ… Yes

raw-data

File Path

Path to data file

βœ… Yes

exports/customers.csv

Rules File Path

Validation rules location

⚠️ Optional

rules/validation.json

*Use one authentication method

Configuration Screenshot

ADLS Configuration

πŸ” Authentication Methods

Storage account connection using primary/secondary key

  1. Go to Azure Portal β†’ Storage Account

  2. Navigate to Access Keys

  3. Copy account name and key

  4. Enter in configuration

  5. Click Test Connection

Simple and direct for development/test environments.


πŸ“– Getting Started

Step 1: Set Up Azure Storage Account

Create a storage account with ADLS Gen2 enabled:

Step 2: Create Container and Upload Data

Step 3: Configure Authentication

Choose your preferred authentication method (see tabs above).

Step 4: Configure the Connector

  1. Navigate to your job configuration

  2. Select Azure ADLS Gen2 as the data source

  3. Enter storage account and authentication details

  4. Specify container name

  5. Enter file path

  6. Click Test Connection

  7. Verify file format and preview


πŸ“Š Supported File Formats

Format
Extension
Use Case
Support

CSV

.csv

Structured tabular data

βœ… Full

Parquet

.parquet

Columnar analytics format

βœ… Full

JSON

.json

Semi-structured data

βœ… Full

Avro

.avro

Data serialization

βœ… Full

ORC

.orc

Columnar format

βœ… Full

Text

.txt

Plain text files

βœ… Full

File Format Examples

CSV Format:

Parquet Format (Columnar):


πŸš€ ADLS Gen2-Specific Features

Hierarchical Namespace

ADLS Gen2 provides a true directory structure:

Directory-Level Operations

Access Control

Set permissions at file and directory level:


⚠️ File Size Limitations

circle-exclamation

Partitioning Strategy


πŸ” Security Best Practices

circle-exclamation

Register Azure AD Application

Storage Account Firewall


πŸ’‘ Integration Examples

Use with Azure Synapse Analytics

Use with Azure Databricks


πŸ’° Cost Optimization

Strategy
Benefit

Storage tiers

Use hot/cool/archive based on access patterns

Lifecycle policies

Automatically move old data to cheaper tiers

Compression

Use Parquet/ORC for smaller storage footprint

Deduplication

Remove redundant data

Archive

Move cold data to Archive tier


πŸ› Troubleshooting

Issue
Solution

Authentication failed

Verify account name, key, or service principal

Container not found

Check container name (case-sensitive)

File not found

Verify file path and container access

Permission denied

Check role assignments and SAS permissions

File too large

Verify file size is under 5 GB limit

Connection timeout

Check firewall rules, try using private endpoint


Last updated