Moving data processing to the cloud – part 1 – Azure Data Factory
- Processes, standards and quality
Moving implemented business processes to the cloud-based solutions could be a huge step for a company. But is it really that huge? Processing data in the cloud can surely offer a lot of opportunities to develop new business solutions, but also reimplement the existing ones and optimise the costs. However, moving every application the company runs to the cloud may be a step difficult to make. Fortunately, this process can go like a hot knife through the butter. Business processes can be split into parts and every part can be either moved or redeveloped to be hosted on the cloud separately. Let’s analyse the following case.
An already big online store is experiencing growing popularity among consumers, thanks to its offer. All the products available online are stored in the warehouses across the country. The purchased products are sent from a warehouse which is the closest to the buyer to ensure fast delivery. All sounds good, but the company’s managers are asking for an additional report which would give them relevant information to To make this report valuable from the first day of the application running, some of the data needs to be recalculated from previous years. This task requires the stock availability analysis which is a huge challenge for the in-house system as it requires a lot of computing power.
Currently, to monitor the warehouses’ situation, the company’s data flow is performed in the following way. Every day, one of the company’s process is fetching data from each warehouse. The data is contained in a simple flat text file stored on the FTP server. It means all the files need to be downloaded, processed, and saved in an accessible place.
This process was modelled years ago and worked well when there was not so much data. Currently, it consists of the Windows Service application installed on one of the servers, which is triggered every day to perform all the following steps:
- Download all the csv files from the warehouses.
- Parse the files and insert data to the temporary table in a database.
- Run the stored procedure to calculate reports.
Dataflow in a current solution
The requirements for the new report according to which the data from the previous year is to be processed is presented to the team of programmers. It becomes quickly obvious that it would take hours or even days to load, parse, and recalculate all of the gathered data again from the whole previous year.
The team analyses a business problem and comes with a solution. It is cloud-based, easily scalable and can be quickly implemented. It is presented to the Head of IT and accepted as it won’t affect the company’s current systems and would be a great first step to move the rest of the systems to the cloud.
Dataflow in the solution designed by the development team
Azure Data Factory
The main workflow is orchestrated by the Azure Data Factory (ADF), which allows to model pipelines to fulfil all the business needs. It comes with integrated connectors to multiple sources of data – like SQL servers, warehouses, flat file sources, blob storage or lake. The ADF does not only allows to move the date from one place to another, but it is also able to start a complex data processing on Azure Data Lake Analytics or on Spark – both HD Insight or Databricks. For the process to be fully automated, the ADF needs to have the ability to start the pipeline regularly. To achieve this, the ADF comes with two types of triggers: manual and scheduled. For the described solution, the scheduled trigger fits perfectly. It is worth mentioning that the manual trigger could be fired by the REST API call, PowerShell or .NET application. 
Azure Data Factory can be split into 4 components: pipelines, datasets, connections, triggers. Starting from the end, the triggers are as the name suggests, ways to trigger the pipeline. Connections hold means to connect with the external data sources like a database or Azure Data Lake. Datasets are data resources which are mapped to the specific resource present in configured connection. Finally, a pipeline is where the things are done – data moved, transformed, and finally saved.
In order to deliver all the required reports, the dev team is restricted only by two factors – the source of data and the target.
As it was mentioned before, the source of data is the FTP server. It is not a problem though as it requires only the creation of the FTP connection and FTP file storage dataset (which will map files from the FTP server).
Add FTP Connection
When the dataset is configured, it is required to be assigned to a particular resource (in this case a file) specifying a full path to it. Fortunately, if the schema of the resources is the same, there is no need to create a permanent link with a hardcoded file name. There, in the screenshot, it can be seen that the file name is parametrised so it is passed directly from the pipeline.
Add the Dataset based on the file from FTP
By taking these two easy steps, the dev team managed to implement a way to access the source files. Thanks to following the steps, they were also able to create a target connection with a dataset and a connection to the Azure Data Lake where the raw files will be stored.
As it was mentioned before, the on-premise database is used as a target. There is no difference for the ADF, whether the data is fetched or saved on the Azure data store or on-premise. If the on-premise connection is used for some reasons, the only additional step required is to create a gateway between the ADF and the local environment. It can be done by setting up the self-hosted Integration runtime .
Creating the pipeline in the ADF
The next step is to put everything together – all the pieces are connected in the ADF pipeline. Let’s examine the pipeline created by the development team to fetch, save, and process daily data.
Daily based pipeline to fetch and process data
The pipeline consists of several activities. It starts with one parameter visible in the screenshot, which is a date value. The first activity is a simple lookup to grab configuration which stores the names of the files to be downloaded from the FTP server. The lookup activity simply maps itself to the dataset. It means that it can fetch the data from all available sources. In this case the configuration file saved on the Azure Data Lake was used.
The next step includes two activities are performed concurrently – the first one to set the output file name, which is required as the parameter to the U-SQL script, and the second which is the ForEach loop. The ForEach loop iterates through all the entries which were returned by lookup activity. Inside this activity it performs copy data activity and moves the data from the FTP server into the Azure Data Lake (ADL). Thanks to the @concat function, it is possible to combine all the variables like a base file name and date into final a file name.
ForEach loop to fetch all files
When all the files are available on the ADL, the script can be executed on the Azure Data Lake Analytics service. The script can be also stored on the Azure Data Lake service which makes it easy to connect and maintain. This could potentially be the final step of the process as the ADLA could save the results to the final destination. Unfortunately, at the time of writing in March 2019, the ADLA can only save data to Azure Data Lake (files or databases). In this case, the results are saved back into the Lake and copied to the on-premise database through Copy Data activity.
The challenge faced by the company and the team of programmers will be continued in the second (and final) part of this text, where I present how to implement all the reports required by the business.
Azure Data Factory – summary
The Azure Data Factory service is one of the fastest developing services on the Azure Cloud. At the time of writing this (March 2019), the ADF has over 80 implemented data connectors, it allows to create scalable and hybrid solutions using both cloud and on-premises services. Azure Data Factory gives the development teams the possibility to create pipelines to orchestrate activities without writing lots of code.