Enterprise data lake on Microsoft Azure with hierarchical namespace, perfect for large-scale analytics and big data workloads.
Overview
Azure Data Lake Storage (ADLS) Gen2 is a cloud-scale data lake solution that combines the capabilities of Azure Blob Storage with distributed file system semantics. It's ideal for organizations running analytics at scale with Hadoop, Spark, and other big data frameworks. The Data Testing connector integrates seamlessly with ADLS Gen2 for comprehensive file-based data validation.
Perfect for:
β Azure cloud data lake validation
β Enterprise analytics platforms
β Synapse Analytics integration
β Multi-tenant data storage
βοΈ Configuration Parameters
Parameter
Description
Required
Example
Account Name
Storage account name
β Yes
mydatalake
Account Key
Primary or secondary key
β οΈ Optional*
DefaultEndpointsProtocol=https...
Connection String
Complete connection string
β οΈ Optional*
key=value;
SAS Token
Shared Access Signature
β οΈ Optional*
sv=2020-08-04&...
Tenant ID
Azure AD tenant ID
β οΈ Optional*
xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
Client ID
Azure AD app client ID
β οΈ Optional*
xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
Client Secret
Azure AD app secret
β οΈ Optional*
β’β’β’β’β’β’β’β’β’β’
Filesystem/Container
Container name
β Yes
raw-data
File Path
Path to data file
β Yes
exports/customers.csv
Rules File Path
Validation rules location
β οΈ Optional
rules/validation.json
*Use one authentication method
Configuration Screenshot
ADLS Configuration
π Authentication Methods
Storage account connection using primary/secondary key
Go to Azure Portal β Storage Account
Navigate to Access Keys
Copy account name and key
Enter in configuration
Click Test Connection
Simple and direct for development/test environments.
Complete connection string with all parameters
Convenient for copying entire connection details.
Time-limited access using Shared Access Signature
Generate SAS token in Azure Portal
Copy full SAS token (including ? prefix)
Use for temporary or limited access
Tokens can be revoked anytime
Perfect for temporary access without exposing keys.
Service principal authentication (recommended for production)
Register Azure AD application
Grant application ADLS permissions
Enter Tenant ID, Client ID, Secret
Click Test Connection
Most secure method for production environments.
π Getting Started
Step 1: Set Up Azure Storage Account
Create a storage account with ADLS Gen2 enabled:
Step 2: Create Container and Upload Data
Step 3: Configure Authentication
Choose your preferred authentication method (see tabs above).
Step 4: Configure the Connector
Navigate to your job configuration
Select Azure ADLS Gen2 as the data source
Enter storage account and authentication details
Specify container name
Enter file path
Click Test Connection
Verify file format and preview
π Supported File Formats
Format
Extension
Use Case
Support
CSV
.csv
Structured tabular data
β Full
Parquet
.parquet
Columnar analytics format
β Full
JSON
.json
Semi-structured data
β Full
Avro
.avro
Data serialization
β Full
ORC
.orc
Columnar format
β Full
Text
.txt
Plain text files
β Full
File Format Examples
CSV Format:
Parquet Format (Columnar):
π ADLS Gen2-Specific Features
Hierarchical Namespace
ADLS Gen2 provides a true directory structure:
Directory-Level Operations
Access Control
Set permissions at file and directory level:
β οΈ File Size Limitations
Maximum File Size: 5 GB per file
Strategies for larger files:
Split files by date/partition
Use compressed formats (Parquet, ORC)
Archive old data separately
Use Blob tier for cold storage
Partitioning Strategy
π Security Best Practices
Security Essentials:
β Use Azure AD authentication (OAuth) for production
β Enable encryption at rest (Microsoft-managed or customer-managed)
-- Query ADLS data from Synapse
SELECT
customer_id,
COUNT(*) as orders
FROM OPENROWSET(
BULK 'https://mydatalake.dfs.core.windows.net/data/customers.csv',
FORMAT = 'CSV',
PARSER_VERSION = '2.0'
) AS data
GROUP BY customer_id;
# Read from ADLS in Databricks
df = spark.read.csv(
"abfss://[email protected]/customers.csv",
header=True
)
df.show()