Azure Data Lake¶
Azure data lake can be used to store any type of data whether it is:
Unstructured data json, txt, csv, you name it
Semi-structured data: json or xml, along with schema definition that is read at run-time
Structured data: databases
Azure Data Lake helps users to take the LET (Load Extract Transform) approach instead of the ETL approach (Extract Transform Load).
SQL language that you can use with Azure Data Lake is U-SQL (hybrid of T-SQL and C#). Extensible and modules can be written using C#.
Data Lakes sit on HDFS.
Azure data lake databases are not relational. Relationships are virtual.
Data Skew happens when data is partitioned in a what that one or more vertices have significantly more work to do than other vertices.
Extents¶
When uploading data, the file is split into Extents
usually 250MB
Normally, replicated 3 times
Vertices¶
U-SQL jobs are processed by vertices - A vertix is a virtual machine (dual core processor and 6GB RAM - at the point of writing) - Each vertix performs a particular task - Each vertix works on around 1 GB of data - resulting in about 4 Extents - A job normally used 5 vertices to execute the task. This number is configurable. - A vertex must complete in 5 hours - Maximum memory : 5 GB
Azure Data Lake Accounts¶
Consists of:
Store account
Analytics account
Azure Data Lake Analytics can also query data in other stores e.g. Azure Blob Storage, SQL Server in Azure etc.
Partitioning¶
Horizontal
Also known as:
sharding
fine grained partitioning
automatic partitioning
Metropolitan
North Wales
Manchester
Vertical (coarse grained partitioning or manual partitioning)
Also known as:
coarse grained partitioning
manual partitioning
Metropolitan | North Wales | Manchester
Metropolitan | | Manchester
| | Manchester
Normalization can also be used to do vertical partitioning but this goes one step further and can be used to partition data that is already normalized (for reasons like performance, availability, etc).
Functional
Data is grouped according to how it is used by each bounded context e.g. for an e-commerce system partition invoice data and product inventory data
MS recommends that partition size should be 1GB in size for maximum performance
Distribution Scheme¶
Partitioning determines how the data is stored. Distribution determines where the data is stored and you can control that when working with Databases/U-SQL.
In Azure Data Lake, distribution means how horizontal partitioning separates the data.
4 ways to horizontally partition the data:
Hash Keys: Data Lake hashes the values and distributes the data based on the hash value
Direct Hash: We are responsible for providing a numeric value and then generate a hash value using an analytics function like rank or dense rank. More control this way.
Range Keys
Round robin: no keys are needed. Data is distributed evenly.
Data skew can happen with hash keys, direct hash, and range keys. Round robin fixes this but can lead to more data reads as the data might not be sitting together.
Data Lake Gen1 vs Gen2¶
Gen1 is basically HDFS on the cloud
Gen2 is Azure Blob storage on HDFS