Azure Data Lake Storage Gen2
Azure Data Lake Storage Gen2 takes core capabilities from Azure Data Lake Storage Gen1 such as a Hadoop compatible file system, Azure Active Directory and POSIX based ACLs and integrates them into Azure Blob Storage. This combination enables best in class analytics performance along with Blob Storage’s tiering and data lifecycle management capabilities and the fundamental availability, security and durability capabilities of Azure Storage.
The simplest way to enable Hadoop applications to work with
cloud storage is to build a Hadoop File System driver which runs client-side
within the Hadoop applications. This driver emulates a file system, converting
Hadoop file system operations into operations on the backend of the respective
platforms. This is often inefficient, and correctness is hard to
implement/achieve. For instance, when implemented on an object store a request
by a Hadoop client to rename a folder (a common operation in Hadoop jobs) can
result in many REST requests. This is because object stores use a flat
namespace and don’t have the notion of a folder. For example, renaming a folder
with 5,000 items in it can result in 10,000 REST calls from the client . This
means 5,000 to copy the child objects to a new destination and 5,000 to delete
the original files. This approach performs poorly, often affecting the overall
time a job takes to complete. It is also error prone because the operation is
not atomic and a failure at any point will result in the job failing with data
in an inconsistent state.
Azure Data Lake Storage Gen2 moves this file-system logic
server side along-side our Blob APIs, enabling the same data to be accessed via
our BLOB REST APIs or the new Azure Data Lake Storage Gen2 file system APIs.
This enables file system operation like a rename to be performed in a single
operation. Server-side we are still mapping these requests to our underlying
Blob Storage and its flat namespace making this approach more optimal than the
first model and offering higher fidelity with Hadoop for some operations,
however it doesn’t enable atomicity of these operations.
In addition to moving the file system support server side,
we have also designed a cloud scale hierarchical namespace that integrates
directly in the Azure Blob Storage. Diagram 3 shows how once enabled the
hierarchical namespace provides first class support for files and folders
including support for atomic operations such as copy and delete on files and
folders. Namespace functionality is available to both Azure Data Lake Storage
Gen2 and Blob APIs allowing for consistent usage across both set of APIs. By
general availability the same data will be accessible using both BLOB and Azure
Data Lake Storage Gen2 APIs with full coherence.
Azure Data Lake Storage Gen2 is building on Blob
Storage’s Azure Active Directory integration (in preview) and
RBAC based access controls. We are extending these capabilities with the aid of
the hierarchical namespace to enable fine-grained POSIX-based ACL support on
files and folders. Azure Active Directory integration and POSIX-based ACLs will
be delivered during the preview.
For further queries & assistance
Prometix are
expertise in implementing Data Lakes and integrating with Analytics. Our certified consultants in Sydney &
Canberra have worked on numerous MS Teams deployment & app development.
As a
Microsoft Gold certified partner, we have extensive experience in delivering
Office 365 based records management-based solutions. We have Office 365
consultants in Melbourne & Sydney. For more information, please contact us
from enquiries@prometix.com.au
Comments
Post a Comment