Introduction
Azure Data Factory is a cloud based integration service that allows you to transfer data from disparate cloud and on premise data sources to various data storage destinations. This is similar to another on premise ETL tool, SQL Server Integration Service (SSIS), provided by Microsoft.
SSIS has been used to design and develop a robust ETL to extract, transform and load the data either in a data warehouse or any other data storage. Similar to SSIS, Microsoft launched Azure Data Factory data service to define work flows to achieve data transfer and data transformation activities on the cloud. Azure Data Factory not only supports data transfer but also supports a rich set of transformations like deriving the columns, sorting data, combining the data, etc. Azure Data Factory is a bit different in terms of how data flows from the source to destination compared to on premise based SSIS. Azure Data Factory supports Extract-Load and Transform-Load flows before the data is published for its consumption rather than a traditional flow Extract-Transform-and-Load.
Azure Data Factory Components
We have four major components to achieve end to end data integration on the cloud using Azure Data Factory. These components are as follows:
- Pipeline – Pipeline is the data-driven workflow. A pipeline can consist of one or more workflows and a factory can have one or more pipelines. The workflow has an activity or a group of activities to be performed. This kind of grouping helps to improve manageability of a group of activities as a set.
- Activity – Activity is an action to be performed on data. there are two types of activities, one is data movement activities and the second is data transformation activities. Also, Azure Data Factory provides flexibility to customize the task using .Net.
- Datasets – a dataset is the data referred by an activity. An activity will reference one or more datasets as input and one or more datasets as output.
- Linked Services – A linked service is the service which helps to connect the data source. Once the connection establishes with a data source, an activity can be performed on a dataset available on the data source. Generally, this linked service is like a connection string.
Data Orchestration Using Azure Data Factory
To transfer or move the data from the source to destination we need to define a workflow. This data-driven workflow is known as a pipeline. This pipeline consists of three steps – Connect & Collect, Transform & Enrich and Publish.
Figure 1. Azure Data Factory Data Orchestration
Connect & Collect – This is the first part of flow – Extract-Load. In this flow, we need to collect data from disparate sources to a centralized cloud location. The data sources can be a cloud storage or on premise storage or combination of both. In this step, all components of data factory will be used. A pipeline will be defined with one or more activities to perform some action on source datasets. The whole execution will be done with the help of linked services as this linked service will help to connect the sources.
Transform & Enrich – in this step, data enrichment will be achieved. All the data collected in previous steps on a cloud storage will be available to apply all kinds of transformation and enrichment. We can apply all the business rules to enrich the data and make transformed data ready to publish.
Publish – this is the last step in any defined pipeline where transformed data will be published. The final data will be sent to the destination storage, which is known as the data sink. This published data will be available for consumption. All users or business members can connect to the data sink and use the data as per their need.
Pipeline Scheduling
Once we create the pipeline, we can schedule the execution of this pipeline. This scheduling will help to publish the data more frequently and data will be available for consumption in real time. All data consumers can see the data without any delay. The available frequencies are Minute, Hour, Day, Week, and Month to execute the pipeline.
Overall, Azure Data Factory is a cloud based data integration service to transform & enrich, and publish the data, which is extracted from disparate data sources. This service provides the opportunity to ingest data from disparate data sources, transform and analyze the data by using other services like Azure HDInsight Hadoop, Spark, Azure Data Lake Analytics, and Azure Machine Learning, and publish output data to data sinks like Azure SQL Data Warehouse, Azure Cosmos DB, Azure Table Storage for business members and application users to consume.
In future articles, we will learn more about Azure Data Factory’s functionality and will accomplish some hands on exercise.