Database Journal
MS SQL Oracle DB2 Access MySQL PostgreSQL Sybase PHP SQL Etc SQL Scripts & Samples Tips Database Forum

» Database Journal Home
» Database Articles
» Database Tutorials
MS SQL
Oracle
DB2
MS Access
MySQL
» RESOURCES
Database Tools
SQL Scripts & Samples
Tips
» Database Forum
» Slideshows
» Sitemap
Free Newsletters:
DatabaseDaily  
News Via RSS Feed


follow us on Twitter
Database Journal |DBA Support |SQLCourse |SQLCourse2
 

Featured Database Articles

MS SQL

Posted August 28, 2017

WEBINAR: On-demand webcast

How to Boost Database Development Productivity on Linux, Docker, and Kubernetes with Microsoft SQL Server 2017 REGISTER >

Understanding Azure Data Factory – A Cloud Based Integration Service

By Anoop Kumar

Introduction

Azure Data Factory is a cloud based integration service that allows you to transfer data from disparate cloud and on premise data sources to various data storage destinations. This is similar to another on premise ETL tool, SQL Server Integration Service (SSIS), provided by Microsoft.

SSIS has been used to design and develop a robust ETL to extract, transform and load the data either in a data warehouse or any other data storage. Similar to SSIS, Microsoft launched Azure Data Factory data service to define work flows to achieve data transfer and data transformation activities on the cloud. Azure Data Factory not only supports data transfer but also supports a rich set of transformations like deriving the columns, sorting data, combining the data, etc. Azure Data Factory is a bit different in terms of how data flows from the source to destination compared to on premise based SSIS. Azure Data Factory supports Extract-Load and Transform-Load flows before the data is published for its consumption rather than a traditional flow Extract-Transform-and-Load.

Azure Data Factory Components

We have four major components to achieve end to end data integration on the cloud using Azure Data Factory. These components are as follows:

  1. Pipeline – Pipeline is the data-driven workflow. A pipeline can consist of one or more workflows and a factory can have one or more pipelines. The workflow has an activity or a group of activities to be performed. This kind of grouping helps to improve manageability of a group of activities as a set.
  2. Activity – Activity is an action to be performed on data. there are two types of activities, one is data movement activities and the second is data transformation activities. Also, Azure Data Factory provides flexibility to customize the task using .Net. 
  3. Datasets – a dataset is the data referred by an activity. An activity will reference one or more datasets as input and one or more datasets as output.
  4. Linked Services – A linked service is the service which helps to connect the data source. Once the connection establishes with a data source, an activity can be performed on a dataset available on the data source. Generally, this linked service is like a connection string.

Data Orchestration Using Azure Data Factory

To transfer or move the data from the source to destination we need to define a workflow. This data-driven workflow is known as a pipeline.  This pipeline consists of three steps – Connect & Collect, Transform & Enrich and Publish.

Azure Data Factory Data Orchestration
Figure 1. Azure Data Factory Data Orchestration

Connect & Collect – This is the first part of flow – Extract-Load. In this flow, we need to collect data from disparate sources to a centralized cloud location. The data sources can be a cloud storage or on premise storage or combination of both. In this step, all components of data factory will be used. A pipeline will be defined with one or more activities to perform some action on source datasets. The whole execution will be done with the help of linked services as this linked service will help to connect the sources.

Transform & Enrich – in this step, data enrichment will be achieved. All the data collected in previous steps on a cloud storage will be available to apply all kinds of transformation and enrichment. We can apply all the business rules to enrich the data and make transformed data ready to publish.

Publish – this is the last step in any defined pipeline where transformed data will be published. The final data will be sent to the destination storage, which is known as the data sink. This published data will be available for consumption. All users or business members can connect to the data sink and use the data as per their need.

Pipeline Scheduling

Once we create the pipeline, we can schedule the execution of this pipeline. This scheduling will help to publish the data more frequently and data will be available for consumption in real time. All data consumers can see the data without any delay.  The available frequencies are Minute, Hour, Day, Week, and Month to execute the pipeline.

Overall, Azure Data Factory is a cloud based data integration service to transform & enrich, and publish the data, which is extracted from disparate data sources. This service provides the opportunity to ingest data from disparate data sources, transform and analyze the data by using other services like Azure HDInsight Hadoop, Spark, Azure Data Lake Analytics, and Azure Machine Learning, and publish output data to data sinks like Azure SQL Data Warehouse, Azure Cosmos DB, Azure Table Storage for business members and application users to consume.

In future articles, we will learn more about Azure Data Factory’s functionality and will accomplish some hands on exercise.

See all articles by Anoop Kumar



MS SQL Archives

Comment and Contribute

 


(Maximum characters: 1200). You have characters left.

 

 




Latest Forum Threads
MS SQL Forum
Topic By Replies Updated
SQL 2005: SSIS: Error using SQL Server credentials poverty 3 August 17th, 07:43 AM
Need help changing table contents nkawtg 1 August 17th, 03:02 AM
SQL Server Memory confifuration bhosalenarayan 2 August 14th, 05:33 AM
SQL Server Primary Key and a Unique Key katty.jonh 2 July 25th, 10:36 AM