Azure Data Lake Storage Gen2

Enterprise data lake on Microsoft Azure with hierarchical namespace, perfect for large-scale analytics and big data workloads.

Overview

Azure Data Lake Storage (ADLS) Gen2 is a cloud-scale data lake solution that combines the capabilities of Azure Blob Storage with distributed file system semantics. It's ideal for organizations running analytics at scale with Hadoop, Spark, and other big data frameworks. The Data Testing connector integrates seamlessly with ADLS Gen2 for comprehensive file-based data validation.

Perfect for:

✅ Azure cloud data lake validation
✅ Enterprise analytics platforms
✅ Synapse Analytics integration
✅ Multi-tenant data storage

⚙️ Configuration Parameters

Parameter

Description

Required

Example

Account Name

Storage account name

✅ Yes

mydatalake

Account Key

Primary or secondary key

⚠️ Optional*

DefaultEndpointsProtocol=https...

Connection String

Complete connection string

⚠️ Optional*

key=value;

SAS Token

Shared Access Signature

⚠️ Optional*

sv=2020-08-04&...

Tenant ID

Azure AD tenant ID

⚠️ Optional*

xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx

Client ID

Azure AD app client ID

⚠️ Optional*

xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx

Client Secret

Azure AD app secret

⚠️ Optional*

••••••••••

Filesystem/Container

Container name

✅ Yes

raw-data

File Path

Path to data file

✅ Yes

exports/customers.csv

Rules File Path

Validation rules location

⚠️ Optional

rules/validation.json

*Use one authentication method

Configuration Screenshot

🔐 Authentication Methods

Storage account connection using primary/secondary key

Go to Azure Portal → Storage Account
Navigate to Access Keys
Copy account name and key
Enter in configuration
Click Test Connection

Simple and direct for development/test environments.

Complete connection string with all parameters

DefaultEndpointsProtocol=https;
AccountName=mydatalake;
AccountKey=xxxxx...;
EndpointSuffix=core.windows.net

Convenient for copying entire connection details.

📖 Getting Started

Step 1: Set Up Azure Storage Account

Create a storage account with ADLS Gen2 enabled:

Azure Portal → Create Resource → Storage Account
- Account name: mydatalake
- Performance: Standard (or Premium for faster access)
- Redundancy: Choose based on requirements
- Enable hierarchical namespace: YES (required for Gen2)

Step 2: Create Container and Upload Data

Container: raw-data
Folder structure:
📦 raw-data/
├── 📁 customers/
│   ├── 📄 2024-01-01.csv
│   ├── 📄 2024-02-01.csv
│   └── 📄 2024-03-01.csv
├── 📁 orders/
│   └── 📄 orders.parquet
└── 📁 exports/
    └── 📄 backup.csv

Step 3: Configure Authentication

Choose your preferred authentication method (see tabs above).

Step 4: Configure the Connector

Navigate to your job configuration
Select Azure ADLS Gen2 as the data source
Enter storage account and authentication details
Specify container name
Enter file path
Click Test Connection
Verify file format and preview

📊 Supported File Formats

Format

Extension

Use Case

Support

CSV

.csv

Structured tabular data

✅ Full

Parquet

.parquet

Columnar analytics format

✅ Full

JSON

.json

Semi-structured data

✅ Full

Avro

.avro

Data serialization

✅ Full

ORC

.orc

Columnar format

✅ Full

Text

.txt

Plain text files

✅ Full

File Format Examples

CSV Format:

customer_id,name,email,country
1,John Doe,[email protected],US
2,Jane Smith,[email protected],UK

Parquet Format (Columnar):

Optimized for analytics queries
- High compression
- Efficient column selection
- Fast query performance

🚀 ADLS Gen2-Specific Features

Hierarchical Namespace

ADLS Gen2 provides a true directory structure:

Container: raw-data
Path: /2024/03/10/customers.csv
     ├── Year level
     ├── Month level
     └── Day level

Directory-Level Operations

Atomic operations on directories:
- Create/delete entire folder trees
- Rename directories atomically
- Set permissions at directory level

Access Control

Set permissions at file and directory level:

Container/Directory → Access Control (IAM)
- Owner: rwx (7)
- Group: r-x (5)
- Other: --- (0)

⚠️ File Size Limitations

Maximum File Size: 5 GB per file

Strategies for larger files:

Split files by date/partition
Use compressed formats (Parquet, ORC)
Archive old data separately
Use Blob tier for cold storage

Partitioning Strategy

Optimize for large datasets:
/data/2024/01/customers.csv
/data/2024/02/customers.csv
/data/2024/03/customers.csv

🔐 Security Best Practices

Security Essentials:

✅ Use Azure AD authentication (OAuth) for production
✅ Enable encryption at rest (Microsoft-managed or customer-managed)
✅ Enable encryption in transit (TLS 1.2+)
✅ Use private endpoints for network isolation
✅ Remove public access to containers
✅ Enable storage account firewalls
✅ Audit access logs regularly
✅ Rotate credentials every 90 days

Register Azure AD Application

# Register app for service principal auth
az ad sp create-for-rbac `
  --name "DataTestingValidator" `
  --role "Storage Blob Data Reader"

Storage Account Firewall

Azure Portal → Storage Account → Networking
- Firewall: Enabled
- Allowed services: (select as needed)
- Exceptions: Check "Allow Azure services"

💡 Integration Examples

Use with Azure Synapse Analytics

-- Query ADLS data from Synapse
SELECT 
  customer_id,
  COUNT(*) as orders
FROM OPENROWSET(
  BULK 'https://mydatalake.dfs.core.windows.net/data/customers.csv',
  FORMAT = 'CSV',
  PARSER_VERSION = '2.0'
) AS data
GROUP BY customer_id;

Use with Azure Databricks

# Read from ADLS in Databricks
df = spark.read.csv(
  "abfss://[email protected]/customers.csv",
  header=True
)
df.show()

💰 Cost Optimization

Strategy

Benefit

Storage tiers

Use hot/cool/archive based on access patterns

Lifecycle policies

Automatically move old data to cheaper tiers

Compression

Use Parquet/ORC for smaller storage footprint

Deduplication

Remove redundant data

Archive

Move cold data to Archive tier

🐛 Troubleshooting

Issue

Solution

Authentication failed

Verify account name, key, or service principal

Container not found

Check container name (case-sensitive)

File not found

Verify file path and container access

Permission denied

Check role assignments and SAS permissions

File too large

Verify file size is under 5 GB limit

Connection timeout

Check firewall rules, try using private endpoint

PreviousAmazon S3 NextSFTP

Last updated 12 days ago

hashtagOverview

hashtag⚙️ Configuration Parameters

hashtagConfiguration Screenshot

hashtag🔐 Authentication Methods

hashtag📖 Getting Started

hashtagStep 1: Set Up Azure Storage Account

hashtagStep 2: Create Container and Upload Data

hashtagStep 3: Configure Authentication

hashtagStep 4: Configure the Connector

hashtag📊 Supported File Formats

hashtagFile Format Examples

hashtag🚀 ADLS Gen2-Specific Features

hashtagHierarchical Namespace

hashtagDirectory-Level Operations

hashtagAccess Control

hashtag⚠️ File Size Limitations

hashtagPartitioning Strategy

hashtag🔐 Security Best Practices

hashtagRegister Azure AD Application

hashtagStorage Account Firewall

hashtag💡 Integration Examples

hashtagUse with Azure Synapse Analytics

hashtagUse with Azure Databricks

hashtag💰 Cost Optimization

hashtag🐛 Troubleshooting

hashtagRelated Documentation

Overview

⚙️ Configuration Parameters

Configuration Screenshot

🔐 Authentication Methods

📖 Getting Started

Step 1: Set Up Azure Storage Account

Step 2: Create Container and Upload Data

Step 3: Configure Authentication

Step 4: Configure the Connector

📊 Supported File Formats

File Format Examples

🚀 ADLS Gen2-Specific Features

Hierarchical Namespace

Directory-Level Operations

Access Control

⚠️ File Size Limitations

Partitioning Strategy

🔐 Security Best Practices

Register Azure AD Application

Storage Account Firewall

💡 Integration Examples

Use with Azure Synapse Analytics

Use with Azure Databricks

💰 Cost Optimization

🐛 Troubleshooting

Related Documentation