Azure Data Lake Storage Gen2

Azure Data Lake Storage Gen2 takes core capabilities from Azure Data Lake Storage Gen1 such as a Hadoop compatible file system, Azure Active Directory and POSIX based ACLs and integrates them into Azure Blob Storage. This combination enables best in class analytics performance along with Blob Storage’s tiering and data lifecycle management capabilities and the fundamental availability, security and durability capabilities of Azure Storage.

The simplest way to enable Hadoop applications to work with cloud storage is to build a Hadoop File System driver which runs client-side within the Hadoop applications. This driver emulates a file system, converting Hadoop file system operations into operations on the backend of the respective platforms. This is often inefficient, and correctness is hard to implement/achieve. For instance, when implemented on an object store a request by a Hadoop client to rename a folder (a common operation in Hadoop jobs) can result in many REST requests. This is because object stores use a flat namespace and don’t have the notion of a folder. For example, renaming a folder with 5,000 items in it can result in 10,000 REST calls from the client . This means 5,000 to copy the child objects to a new destination and 5,000 to delete the original files. This approach performs poorly, often affecting the overall time a job takes to complete. It is also error prone because the operation is not atomic and a failure at any point will result in the job failing with data in an inconsistent state.

Azure Data Lake Storage Gen2 moves this file-system logic server side along-side our Blob APIs, enabling the same data to be accessed via our BLOB REST APIs or the new Azure Data Lake Storage Gen2 file system APIs. This enables file system operation like a rename to be performed in a single operation. Server-side we are still mapping these requests to our underlying Blob Storage and its flat namespace making this approach more optimal than the first model and offering higher fidelity with Hadoop for some operations, however it doesn’t enable atomicity of these operations.

In addition to moving the file system support server side, we have also designed a cloud scale hierarchical namespace that integrates directly in the Azure Blob Storage. Diagram 3 shows how once enabled the hierarchical namespace provides first class support for files and folders including support for atomic operations such as copy and delete on files and folders. Namespace functionality is available to both Azure Data Lake Storage Gen2 and Blob APIs allowing for consistent usage across both set of APIs. By general availability the same data will be accessible using both BLOB and Azure Data Lake Storage Gen2 APIs with full coherence.

Azure Data Lake Storage Gen2 is building on Blob Storage’s Azure Active Directory integration (in preview) and RBAC based access controls. We are extending these capabilities with the aid of the hierarchical namespace to enable fine-grained POSIX-based ACL support on files and folders. Azure Active Directory integration and POSIX-based ACLs will be delivered during the preview.

For further queries & assistance

Prometix are expertise in implementing Data Lakes and integrating with Analytics.  Our certified consultants in Sydney & Canberra have worked on numerous MS Teams deployment & app development.

As a Microsoft Gold certified partner, we have extensive experience in delivering Office 365 based records management-based solutions. We have Office 365 consultants in Melbourne & Sydney. For more information, please contact us from enquiries@prometix.com.au

 

Comments

Popular posts from this blog

Microsoft 365 Records Management Compliance

Collaboration with SharePoint VS MS Teams

Data Governance with Azure Purview