Azure Data Factory Pipeline
Azure Data Factory Pipeline Activities, Creation & Tutorials
What Is Azure Data Factory Pipeline?
Azure Data Factory (ADF) is a data management service that integrates and orchestrates various data transfer tools and technologies in the Microsoft cloud.
ADF provides a graphical user interface (GUI) to create and manage workflows by using components called operations.
Operations can also be created programmatically.
Azure Data Factory Pipeline manages the data movement between data stores.
It consists of steps, each of which performs an individual activity such as moving data from one store to another, filtering it, or performing complex Analysis Services queries.
You can control how many copies of the data are sent to a particular destination by configuring multiple steps in a pipeline.
The Azure Data Factory Pipeline is a collection of activities that run in sequential order and transforms source data from one format into another or aggregate data across a set of pipelines.
The activities in a pipeline run sequentially, so the output of activity A becomes the input of activity B.
A data factory can have multiple pipelines that you can use to process and transform data in different ways.
For example, you might have one pipeline that imports data from a source system into an Azure SQL database and another pipeline that uses the imported data as a basis for running Analysis Services queries against it.
The data factory provides a visual representation of your Azure Data Factory Pipeline.
You can use the visual interface to create, edit, and delete pipelines; view status information; and monitor how long it takes each pipeline to run.
When you’re building your data factory, you can also use the data factory designer to define how Azure Data Factory should handle common deployment scenarios such as staging environments or production deployments by using different types of linked services.
Want to Learn Azure Data Bricks from our Experts? Click the link below to watch a Free demo
Types of Activities Performed by Azure Data Factory Pipeline
Data Movement Activities: These activities are used to move data from one place to another.
For example, you can use a data movement activity to copy data from Azure Blob storage into your on-premises database or send e-mail notifications when pipeline events occur.
Data Transform Activities: These activities are used to transform data, such as by filtering or enriching it.
You can use a data transform activity to ensure that the data you’re working with is valid, making sure that every row has a corresponding ID column and removing rows that don’t.
Query Activities: These activities are used to query data from various sources and return results to the pipeline for further processing.
For example, you could use a query activity to find all customers who haven’t purchased in more than six months (based on your last order date).
Control Flow Activities: These activities are used to control how data flows through the pipeline.
For example, you could use a control flow activity to send certain rows to an error output port and other rows to a success output port based on whether a customer has made more than five purchases in the last year.
Types of Control Flow Activities in Azure Data Factory Pipeline
Append Variable: This activity appends one or more variables to the end of a row.
For example, you could use an append variable activity to append customer ID numbers to the end of each customer row based on their order date.
Filter: This activity filters rows based on a specified condition and adds them to an output port.
Execute Pipeline: This activity executes a pipeline, which is a sequence of activities that are executed in order.
If: This activity compares one or more rows to specified values and adds them to an output port based on whether they match.
Validation Activity: This activity validates a row against one or more specified rules and adds the row to an output port if it passes the validation.
For example, you could use a validation activity to validate that each invoice amount is greater than zero before adding invoices to an output port.
Set Variable: This activity sets a variable to the value of one or more input ports.
You can use variables as placeholders for values that you need to use later in your workflow, such as when you want to pass an input parameter into another activity.
Webhook Activity: This activity runs a webhook that you specify. For example, you could use a webhook activity to send an email when a row is added to an output port.
Types of Azure Data Factory Pipeline Tools
These tools are used to manage data flow between activities and create a pipeline. For example, the tool can be used to add new rows of data into the pipeline or remove old rows from it.
ETL Tool: ETL stands for Extract, Transform and Load. It is a process of extracting data from one place, transforming it into another format, and loading it into an application or database.
Data Warehouse: Azure data warehouse is a database designed to support decision-making by providing quick and easy access to historical data.
It stores the entire history of all your business data so that you can query it easily to formulate reports or create predictive models.
Batch Workflow Schedulers: A batch workflow scheduler is a software program that manages and executes jobs in a queue.
It usually takes the form of an application or web page with buttons and drop-down menus, which you can use to start and stop certain processes at different times.
Real-Time data streaming tools: A data streaming tool is a piece of software that receives, processes, and analyzes real-time data as it comes in.
It can help you monitor your business’s operations and detect any problems or opportunities right away.
Data Lakes: A data lake is a repository for large amounts of raw, unstructured data.
It’s similar to a data warehouse, in that it stores raw and processed information from different sources so that you can use it later.
How to connect Azure Data Factory Pipeline ?
After you create an Azure Data Factory and connect it to your source and target data stores, you can start building pipelines.
A pipeline is a series of activities that process data between the sources, transformations, and sinks specified in your data factory.
You can use pipelines to move data from one place to another or perform complex analyses on it.
Looking for Azure Data Factory Training? Enroll for a free demo class
Create a new pipeline by clicking Create in your Azure data factory.
- Select the type of pipeline you want to create.
- Enter a name for your pipeline and select an existing data factory or create a new one.
- Choose the Azure Data Factory account and region where you want to create your pipeline.
- Select a data store or data warehouse as the source for your pipeline.
- Select an Azure Data Lake Store as the source for your pipeline.
- Click Create.
- Click Get credentials to get your authentication keys for the data store or data warehouse that you selected.
Advantages of Azure Data Fctory Pipeline
- You can run complex data pipelines with a single command.
- Pipelines can be created in minutes.
- You can create automated, reproducible processes that are easy to maintain and update across your organization.
- You can use pipelines to create jobs that run on a schedule or when events occur.
- Pipelines help you manage the lifecycle of your data, from ingestion to transformation and analysis.
- You can also use pipelines with Azure Databricks for interactive analytics on streaming data.
- You can share your pipelines with the rest of your organization.
- You can use built-in functions to create custom data transformations that suit your needs.
- You can use Databricks’ library of connectors to connect with a variety of data sources, including Amazon Redshift, PostgreSQL and MySQL.
- Pipelines provide a consistent experience across all of Azure Databricks, including the command-line interface (CLI), notebooks, and the web interface.
- Pipelines are available in all editions of Databricks.
Disadvantages of Azure Data Factory Pipeline
- You can only create pipelines from notebooks, not from the command-line interface.
- Pipelines do not include all of the functions available in the R language.
- Pipelines are available only in the Enterprise Edition of Azure Databricks.
- Pipelines can only run on Apache Spark and not other libraries like TensorFlow.
- Pipelines are limited in terms of their ability to perform complex tasks like data cleansing, transformations and ETL integration.
- You cannot run a pipeline on an existing dataset but must instead create a new one.
- Pipelines are limited to one processor core, while notebooks can be assigned up to 128 cores.
- Pipelines don’t support Python or Scala.
Conclusion
The pipeline is the heart of Azure Data Factory and it drives the processing of your data.
A pipeline is composed of a source, a job, and transformations or activities that you want to perform on the data.
The source materializes the raw data from a source system and places it in a staging folder.
Jobs process the staging files as necessary, applying one or more transformations and passing them to an output sink.
A pipeline consists of three stages: Source, Transformations, and Output.
Pipelines are an integral part of Azure Data Factory and are used for automating data movement.
Pipelines can be used for simple or complex data movements and can be configured using the Azure portal or by using PowerShell cmdlets.
Overall we have seen everything about pipelines in Azure Data Factory and how they work.
We can say that it is a very useful feature, which helps you in automating your data movements.
Azure Data Factory pipelines are limited to one processor core and don’t support Python or Scala.