Overview of Azure Data Lake

Introduction

Data with 4 Vs – variety, volume, velocity and veracity is known as Big Data. As we know, there are lots of tools and techniques available to handle big data. Big data is shifting the data paradigm and increasing the expectations with data analytics. We should consider how to manage and store this massive data from its source and apply data analytic techniques to meet expectations. On the cloud, there are various ways to store the data. Microsoft Azure supports various types of data storage like Blob storage, File storage, Queue storage, Table storage and Azure Data Lake. Although, the use cases are relatively different to use any of these storages. However, Azure Data Lake is widely accepted to store massive size i.e. petabytes and zettabytes of data.

Azure Data Lake is a storage to store data as is, in its native form in the cloud. Data can be stored without introducing any change regardless of its size, structure, or how fast data is ingested. Azure Data Lake not only supports data storage but can also be used to apply analytical intelligence on stored data. Data Lake can store any type of data including massive datasets like high-resolution video, genomic and seismic datasets, IoT data, and data in structured, semi structured and unstructured format from a wide variety of industries. The supportive APIs can help to apply data analytical operations. This functionality stands first to Azure Data Lake in the list of other types of cloud data storage. Azure Data Lake gives you an opportunity to concentrate on achieving all your business goals without worrying about the handling of the data store and processing big datasets.

Azure Data Lake

Microsoft’s Azure Data Lake is as a hyper-scale repository for big data analytics and is compatible with the Hadoop distributed file system (HDFS). Azure Data Lake store has enterprise level capabilities like scalability, high availability, high throughput, security, manageability and durability etc., and can be accessed using WebHDFS (Rest APIs).   

Azure Data Lake can store data from all disparate sources, including cloud and on-premise sources of any size and structure. Any existing IT system can easily interface with data stored in Data Lake and refer to this data for further use.

Azure Data Lake

Figure 1 – Azure Data Lake

Azure Data Lake is broadly categorized into two areas: 1) data storage 2) data consumption and analytics.

It has all of the enterprise level capabilities for successful implementation and migration of applications. We will discuss some of the key capabilities:

HDFS compatibility. Azure Data Lake is compatible with Hadoop Distributed File System. With HDFS compatibility, organizations can easily migrate their existing application’s data to the cloud without introducing any change in HDFS directory structure.

Developers, data scientists, and data analysts can leverage WebHDFS APIs to process data and apply analytics to explore the data patterns in a single place without any constraints. Data Lake Store supports massive number of files storage where a single file can be petabyte in size, which is 200 times bigger than other cloud stores.

High throughput: Azure Data Lake supports massive throughput with parallel execution of massive queries so that results of queries (data patterns) can be returned quickly. Data Lake helps to run advance analytics on all stored large files/data under a single account. Azure Data Lake is built for small writes to manage high volume at low latency making the process optimized for near real-time data analytics on disparate data stored from various sensors, website logs, IoT and others.

Optimized data processing. Data Lake is capable of self-optimize the query execution by moving real query execution near to the source data without any data transfer. This helps in virtualizing the analytics with maximum the query performance and minimum the latency.

Enterprise ready. Azure Data Lake is a bundle of enterprise level capabilities, which are not limited to solution scalability, data replication to support high durability and encryption to secure the stored data. The major benefit for an enterprise is that Data Lake can be part of the existing data platform or solution as an important component, without introducing any change.

Scalable. Azure Data Lake supports high and instant scalable data processing.  The processing power required to execute analytic jobs concurrently can be available within seconds without expending effort to manage and tune any infrastructure. The Out of the box instant scalability is the key to getting massive throughput to support massive amounts of analytical workload. Data processing of various workloads, like data extract and load, machine learning and translation and image processing can be done using existing libraries available in .Net, R and Python languages. Also, U-SQL is an expressive and extensive language that supports scaling your code automatically by parallelizing the execution of code for the scale of processing needs.  

Secure. We know data security on the cloud is a concern for most of the cloud users and enterprises. Azure Data Lake has robust security provided to secure their users data. Azure Data Lake store protects data and extends on premise security at various levels to meet all security and regulatory compliance:

  1. SSL based data encryption while data in motion
  2. HSM-backed keys in Azure Key vault to secure data at rest
  3. Single sign-on and multi-factor authentication
  4. Seamless management of identities is built-in through Azure Active Directory
  5. Role based access control by using POSIX-based ACLs

Auditing and Support. For any successful program debugging of failures and errors, auditing and post production support are essential features. Microsoft’s Azure Data Lake provides all the required features for end to end successful implementation. Debugging of errors and operational support in distributed cloud environments is as easy as on-premise environments. Also, background services actively analyze all involved programs as they execute. Based on performance data captured, it gives recommendations to optimize the performance and help users to reduce cost.

Overall, Azure Data Lake store  maximizes the value of structure, semi structure and unstructured data  for enterprises. Also, Azure Data Lake is a secured, scalable cloud environment that supports  running massively-parallel analytics on top of stored data.

See all articles by Anoop Kumar

Anoop Kumar
Anoop Kumar
Anoop has 15+ years of IT experience, mostly in design and development of Enterprise Data warehouse and Business Intelligence solutions. Currently, Anoop is working on various Big Data and NoSQL based solution implementations. Anoop has written many online technical articles on Big Data, Hadoop, SQL Server and SSIS. On an education front, he has a Post Graduate degree in Computer Science from Birla Institute of Technology, Mesra, India. Disclaimer : I help people and businesses make better use of technology to realize their full potential. The opinions mentioned herein are solely mine and do not reflect those of my current employer or previous employers.

Get the Free Newsletter!

Subscribe to Cloud Insider for top news, trends & analysis

Latest Articles