Azure Data Lake Gen1 & Gen2
What is Azure Data Lake?
Azure Data Lake is Microsoft’s fully managed, low-cost and secure analytics platform for your entire organization. Azure Data Lake Gen2 supports upload, ingestion, and processing of large datasets in the cloud without incurring any infrastructure or management costs. With Data Lake you can build a single spot for all data to live where it can be accessed by any tool and analyzed using any type of analytic application.
Azure Data Lake is an open source and serverless data platform that allows users to store massive amounts of data in their cloud. The service also features built-in analytics tools that process both structured and unstructured data in parallel, and users can store their data for free for the first month of use.
The Azure Data Lake Analytics vs Gen1 vs Gen2 tool is a visual comparison of the features and capabilities of three versions of Azure Data Lake Store. The tool aims to help you evaluate your use cases for each version, and determine which version best fits your needs.
Azure Data Lake Gen 1 and Gen 2 are two different versions of the same Azure service. Azure Data Lake Gen 1 provides a storage solution to data lakes, while Azure Data Lake Gen 2 provides a scalable data processing framework.
Azure Data Lake Gen 1
Azure Data Lake Gen 1 is a petabyte-scale data lake in the cloud that brings together structured and unstructured data from disparate formats onto a single, secure, highly available storage platform.
Azure Data Lake Gen 1 helps you embrace the concept of “data gravity” by enabling you to combine all of your data into one location, instead of separating it into different silos.
Azure Data Lake Gen 1 is a fully managed Hadoop-based solution for big data analytics in the cloud. With Azure Data Lake Gen 1, customers can connect and process massive volumes of data on Linux and Windows using Apache Hadoop technologies—Hive and Pig, Spark, R, Python, Scala and others.
Data Lake Gen 1 is a fully managed service that provides an on-demand storage pool within Azure. It’s designed to ingest massive amounts of data from any source, and include data processing components in the same place by integrating a unified analytics platform.
Supporting a broad portfolio of standard API interfaces and protocols, the Data Lake Gen1 data store provides write-connectivity to any data source and massive scalability via support for up to billions of files.
Azure Data Lake is a cloud-native, feature-rich data platform that extends beyond simple Hadoop. It provides you with the ability to build and deploy highly scalable, self-describing data services. Advanced capabilities like deep analytics, machine learning and advanced security are built in.
Azure Data Lake Gen 2
Azure Data Lake Gen 2 is our new cloud-based big data solution that enables you to combine structured, semi-structured, and unstructured data in one place.
It supports a number of different file formats, including Apache Parquet and Apache ORC, as well as allows you to use any SQL or custom made language to query your data. You can also use Azure Data Lake Gen 2 to try out specialized analytical engines like Apache Spark and Apache Apex.
Azure Data Lake Gen 2 is a solution for large-scale, data-intensive workloads that require massive parallelism and the ability to process data at rest. Azure Data Lake Gen 2 supports an large variety of unstructured data formats, including text files like CSV, JSON and Parquet files, as well as images and videos.
Azure Data Lake Gen 2 is a fully managed data lake that allows users to store and process massive amounts of unstructured data. It works by using the technologies of Azure Blob Storage and Microsoft HDInsight. Because it uses HDInsight, there are no limits on the amount of data that can be stored or processed in an Azure Data Lake Gen 2 account.
Azure Data Lake Gen 2 is the next generation of the Azure Data Lake Store platform. Azure Data Lake Gen 2 brings significant enhancements to the service, including reduced latency and cost, support for all data types, cross-language support, and built-in security features. This article describes the differences between Azure Data Lake Gen 1 and Gen 2.
Difference Between Gen 1 vs Gen 2
|ADLS GEN 1||ADLS GEN 2|
With Azure Data Lake Gen 1, you could only store and process unstructured data like text and binary files.
But with Azure Data Lake Gen 2, you can also store structured data such as JSON documents, tables, or CSV files in your account.
With Azure Data Lake Gen 1, you could only process data using Hadoop MapReduce scripts.
But with Azure Data Lake Gen 2, you can also use Apache Spark and Apache Hive to process structured data in your account.
With Azure Data Lake Gen 1, you could only store your data in perpetuity.
But with Azure Data Lake Gen 2, you have the option of setting up a time-to-live policy that automatically deletes your data after a certain period of time.
With Azure Data Lake Gen 1, you pay per gigabyte of storage used.
But with Azure Data Lake Gen 2, you pay only for the amount of data processed by your job scripts.
With Azure Data Lake Gen 1, you could only store up to 1 TB per account.
But with Azure Data Lake Gen 2, there’s no limit on how much data you can store in one place—as long as it doesn’t exceed the available storage capacity of your subscription.
Azure Data Lake Gen 1 didn’t have any built-in security features.
But with Azure Data Lake Gen 2, you can use Kerberos authentication to protect your data from unauthorized access.
How to Create Azure Data Lake Storage Gen 1?
Process To Create Azure Data Lake Storage Gen 1 :
Step 1: Open the Azure portal at https://portal.azure.com and sign in with your Microsoft account.
Step 2: Click on the New button in the top right-hand corner, then select Data Lake Store from the drop-down menu that appears.
Step 3: Enter a name for your Data Lake Store and select the Azure Region where you want to create it.
Step 4: Click Create.
Step 5: Once the Data Lake Store has been created, click on its name in the Azure portal.
Step 6: Click on the Access keys tab and copy the Primary key and Secondary key to a safe place.
Step 7: Click on the Access keys tab and copy the Primary key and Secondary key to a safe place.
Step 8: To use these keys, you will first need to create an access policy.
How to Create Azure Data Lake Storage Gen 2?
If you already have Gen 1 Account and want to upgrade it to Gen 2, follow the steps below:
Step 1: Log into your Azure portal.
Step 2: Click on the Data Lake Store option in the left-hand menu.
Step 3: Click on Access keys under your Data Lake Store account name and copy both keys to a safe place.
Step 4: Click on the Manage access policy button and select the New policy option.
Step 5: Enter a name for your policy in the Policy name field.
Azure Data Lake Storage BI
It is a cloud-based analytics solution that enables you to store data as objects in containers. It’s designed for big data workloads and provides an easy way to access your data from anywhere on any device.
Data Lake Storage Gen 2 is the next generation of this service and includes some new features, like faster processing speed and increased scalability.
Azure Data Lake Storage Architecture
The Azure Data Lake Storage architecture has three layers:
The object layer includes the Azure Blob storage service and a new object-based file system that can be used to store unstructured data. With this layer, you can store large amounts of data that is not suited for relational databases but does need to be analyzed using SQL queries.
The file system layer includes a new object-based file system. This layer provides the ability to store unstructured data and perform analysis using SQL queries.
The query engine layer includes Apache Spark, which enables you to run interactive SQL queries on your data lake in near real-time.
Azure Data Lake Gen 2 Disaster Recovery Module
The disaster recovery module for Azure Data Lake Gen 2 enables you to protect existing data lakes and restore them in the event of an outage.
With this layer, you can back up your data lake to another storage account that has a different region or subscription. This ensures that your data is always available, even if a disaster occurs in one location.
The disaster recovery module includes Azure Backup, which can be used to back up data from Azure Data Lake Gen 2 to an on-premises location or another cloud storage provider. You can also use Azure Backup with other Microsoft services such as SQL Server, SharePoint, and Exchange.
To conclude, Azure Data Lake Gen 1 and Gen 2 are two different architectures of the same data lake. They both operate on the same concept of storing large amounts of unstructured data in a storage account and you can use tools like Azure SQL Data Warehouse for querying them.
The biggest difference is that Gen 2 supports new types of workloads such as Spark and Azure Data Factory. With that being said you should choose which one is best based on your needs.
The main reason for going with Azure Data Lake Gen 2 is its performance and better scaling capabilities. Also with the help of Azure Data Lake Analytics, you can get access to your data without any delay. On the other hand, if you are looking for a better storage solution and good interoperability (SQL server storage), Azure Data Lake Gen 1 is the right choice for you.
Azure Data Lake Gen 2 is faster, scales up to 10X, and lowers costs than Azure Data Lake Gen 1. It supports Hadoop APIs like Spark and Hive and comes with fully managed load balancing, security, and high availability. The new version also provides support for the HDFS API.