What is Polybase inAzure?

What is Polybase in Azure Data Factory?

What is polybase in Azure?

Polybase is an extension to Azure Data Factory that helps you combine structured and unstructured data for analytics. 

Polybase performs read and write operations with PolyBase RDBMS, enabling you to query and transform your data wherever it’s stored, using SQL-based language queries, or by accessing the underlying data directly through relational tables.

Polybase conversion capabilities are a new way to bring data from computing back to your on-premises data hub. 

Polybase is built into Azure Data Factory, so it is easy to configure and use.

You can enable Polybase in your pipeline by setting the Retrieve mode field in the Azure Data Factory Wizard when creating or editing a data flow task.

Polybase enables customers to leverage and combine data from relational and nonrelational sources such as SQL Server, Oracle, MySQL, and Apache Hadoop.

This technology enables Azure Data Factory to provide a single integration engine for all formats of structured or unstructured data.”

Polybase is a fully managed Hadoop-compatible data management service for SQL Server. 

Polybase makes it easy to access the power of Azure SQL Database from Spark, Hive, and Presto. You can now use Microsoft R, Python, and Scala in your Spark, Hive, and Presto jobs

PolyBase is a technology for managing relational data stored in SQL server databases. 

You can use PolyBase to create an extract, transform and load (ETL) jobs that combine the simultaneous use of multiple relational sources such as SQL Server, Oracle Database, and other ODBC data sources with Azure Blob storage or Azure Data Lake Store entities.

An ETL process using PolyBase consists of two steps: Extract and Load

In the first step, PolyBase extracts data from a relational source.

The second step uses the extracted data to load it into Azure Blob storage or Azure Data Lake Store entity.

You can use any of the following programming languages when writing ETL jobs:

SQL Server Transact-SQL (T-SQL): The recommended language to write ETL jobs using PolyBase is T-SQL. 

You can use T-SQL statements in conjunction with the OPENROWSET function to extract data from a relational source and load it into Azure Blob storage or Azure Data Lake Store entity.

HiveQL: You can also use HiveQL to write ETL jobs. This is the recommended language for writing ETL jobs using PolyBase on Azure HDInsight Hadoop clusters. 

You can use T-SQL, HiveQL, or any other programming language (such as Python) to run ETL jobs on Azure Databricks clusters.

PrestoQL : You can use PrestoQL to write ETL jobs. This language is available on Azure Databricks clusters and Azure HDInsight Hadoop clusters. 

You can also use T-SQL, HiveQL, or any other programming language (such as Python) to run ETL jobs on Azure Databricks clusters


Advantages of using Polybase

  1. You can use Polybase to connect to any data source that supports SQL Server.
  2. This includes relational databases, such as Azure SQL Database, Azure SQL DW and Azure DB for PostgreSQL; cloud-based stores such as Redshift or Snowflake; and on-premises databases running in hybrid environments.
  3. You can use the same programming languages to write ETL jobs on Hadoop clusters as you do on Databricks clusters: T-SQL, HiveQL or any other programming language (such as Python).
  4. You can use Polybase with PrestoQL and HiveQL without needing an ODBC driver.
  5. You can use Polybase to augment existing ETL jobs by running them on Hadoop clusters.
  6. You can use Polybase with PrestoQL and HiveQL without needing an ODBC driver.
  7. You can run SQL queries against data stored on Hadoop clusters and Azure SQL DW from within the same application, using a single query language.

Disadvantages of using Polybase

  1. Polybase is not a replacement for Hadoop. You cannot use it to query data directly from HDFS or Hive tables.
  2. You can only use T-SQL to write ETL jobs on Hadoop clusters as you do on Databricks clusters: T-SQL or any other programming language (such as Python).
  3. There are some limitations in the SQL syntax that you can use with Polybase. For example, you cannot use subqueries or set operators (like UNION).

Polybase External Tables

External tables are read-only views of data stored in Azure SQL DW or on Hadoop clusters. 

They act as a filter, so you can use them to narrow down the number of rows returned by your query or to modify the format of the output.

External tables are useful for scenarios where you want to query data from Hadoop or Azure SQL DW without having to move the data into a relational database like Azure SQL DW. 

For example, you can use them to filter out rows that don’t match a certain condition or transform the output into another format before returning it in your query.

To create an external table, you must first create a linked server in Azure SQL DW. 

You can then run the CREATE EXTERNAL TABLE statement to define your external table. 

Here’s an example of how to create an external table that uses PolyBase to query data from Hadoop.

Why is polybase so fast ?

The speed of polybase is due to a combination of factors.

First, it uses the same data structures as Hadoop (HDFS and Hive) which means that it can avoid having to re-encode your data into another format.

Second, it uses the native connectors provided by Hadoop and Azure SQL DW so there’s no overhead from using an external service like Azure Data Factory.

Finally, because you’re using PolyBase for queries rather than moving your data around, you don’t have to pay for any additional compute resources or storage space that would be required if you were querying directly from Azure SQL DW instead of through Hadoop.

How to enable polybase?

To enable PolyBase, you need to install the Azure SQL Data Warehouse connectors for Hadoop on your cluster. 

These can be found in the Microsoft Download Center here:

https://www.microsoft.com/en-us/download/details.aspx?id=47992

First, create a SQL Server instance in Azure that will be used as your master database.

Second, set up the permissions so that Hadoop can access this database.

Finally, install the Azure SQL Data Warehouse connectors for Hadoop on your cluster. Once you have done this, you can start using PolyBase from within your Hive queries.

Polybase enabled Instance

You can enable PolyBase on an Azure SQL Data Warehouse instance by using the following steps:

  1. Connect to your Azure SQL Data Warehouse master database with SSMS.
  2. Right-click on the master database and select “Scriptable Install” from the context menu
  3. In the dialog box that appears, you will see a list of available components for installation
  4. Select all of them
  5. Click next
  6. The script will run in SSMS

Conclusion

In this tutorial we have given a detailed idea on what is polybase in Azure?, What is polybase in Azure data factory, while talking about the benefits of polybase and also the disadvantages of polybase.

We have also walked through the process of creating a new Azure SQL Data Warehouse cluster and adding data to it. 

We also showed how to import Hadoop files into your SQL Data Warehouse database as tables, which you can then query using standard SQL syntax.

Azure SQL Data Warehouse is a fully managed Hadoop-as-a-Service solution. 

With it, you can leverage the power of Hadoop while still having full control over your data.

You can also use Azure SQL Data Warehouse as a fully managed solution, so you don’t have to worry about managing Hadoop or other infrastructure. 

This makes it the perfect fit for organizations that want to take advantage of the benefits of Hadoop without having to manage their own clusters and infrastructure.