Azure Data Factory – Copy Behavior

In this blog post, we will dwelve into the different behaviors of Azure Data Factory’s Copy activity by providing practical examples.

What is Copy behavior?

The copy behavior in Azure Data Factory’s Copy activity refers to how the data is transferred and organized during the copy process. There are three main copy behaviors available in the Copy data activity:

Preserve Hierarchy / None (maintains the original folder structure):

  • Source Folder Structure:
    • RootFolder/Subfolder1/File1.csv
    • RootFolder/Subfolder2/File2.csv
  • Output Folder Structure:
    • Subfolder1/File1.csv
    • Subfolder2/File2.csv 

Flatten Hierarchy (auto-generated names for the files, disregards the original folder structure):

  • Source Folder Structure:
    • RootFolder/Subfolder1/File1.csv
    • RootFolder/Subfolder2/File2.csv
  • Output Folder Structure:
    • data_{guid}.csv
    • data_{guid}.csv

Merge files (combines content of all files in the subfolders and its children):

  • Source Folder Structure:
    • RootFolder/Subfolder1/File1.csv
    • RootFolder/Subfolder2/File2.csv
  • Output Folder Structure: MergedFile.csv 

Pipeline creation

The new pipeline has been created.

Navigate to the “Source” tab now. Click on the “New” button to start creating a new dataset

From the list of available source types, select “Azure Blob Storage” as the desired source and click “Continue”.

Choose “Delimited Text” as the file format for your dataset configuration and click “Continue”.

Click on the “New” button to create of a new Linked Service. Fill in the required details, such as the storage account from your subscription, as shown in the image below.

Click on the “Test Connection” button to verify the connection between the data factory and the storage account. Once the connection test is successful, proceed to click on the “Create” button to create the Linked Service.

Continue setting the properties by specifying the file path for the Source Dataset and then click “OK”.

After clicking on “OK,” ensure that you have the following settings properly configured:

  • Verify that the “Wildcard file path” option is enabled, allowing for pattern-based file selection.
  • Fill in the wildcard paths, as depicted in the image below, to specify the desired file patterns or patterns for batch processing.

Let’s now move to the “Sink” section and create a Dataset and Linked Service for the file destination.

  • Click “New” to create a Sink Dataset.
  • Select “Azure Blob Storage” as the destination type and choose “CSV” as the file format.
  • Fill in the necessary details for the connection and file configuration.

For the Linked Service in the Sink section, we can utilize the same Linked Service that was created for the Source Dataset since both the input and output folders reside in the same storage account. 

After configuring the Source and Sink settings for the Copy data operation, you need to publish the changes. By publishing the changes, you ensure that the updated configuration and connections are applied to the Azure Data Factory pipeline. 

This allows the pipeline to execute with the newly defined Source and Sink datasets, ensuring the correct data movement and integration processes.

Click on “Publish all” to initiate the publishing process. After clicking “Publish all,” click on the “Publish” button to publish the pipeline and the associated datasets.

Copy Behavior Scenarios

Let’s consider an example where we have an input folder structured as shown in the image below. The folder hierarchy consists of a “2023” (year folder), which contains a “06” (month folder), and within the “06” folder, we have three subfolders: “16,” “17,” and “18 (representing days).” Each of these subfolders contains a CSV file that was generated in that specific day.

Our objective is to use the Copy data activity in Azure Data Factory to move these files to an output folder while leveraging the Copy Behaviors.

Scenario #1 – Flatten Hierarchy

We will modify the Copy behavior to leverage Flatten Hierarchy. After making the necessary changes, we will publish the updated configuration.

To initiate the data transfer, we will execute the pipeline using the “Add Trigger” option -> “Trigger Now” to run the pipeline immediately.

Result: Upon applying the Flatten Hierarchy Copy behavior, the resulting structure in the “output” folder would have autogenerated names for the files.

Scenario #2 – Preserve Hierarchy

We will modify the Copy behavior to leverage Preserve Hierarchy. After making the necessary changes, we will publish the updated configuration.

Result: Upon applying the Preserve Hierarchy Copy behavior, the resulting structure in the “output” folder would preserve the original folder hierarchy, which means that the output folder will contain exactly the structure of the input folder (including he subfolders).

Scenario #3 – None

We will modify the Copy behavior to leverage None. After making the necessary changes, we will publish the updated configuration.

 

Result: The structure of the output folder is the same with the structure of the input folder.

None & Preserve Hierarchy behave in a similar manner!

Scenario #4 – Merge files

We will modify the Copy behavior to leverage Merge files. After making the necessary changes, we will publish the updated configuration.

Result: Upon applying the Merge files Copy behavior, the resulting structure in the “output” folder will consolidate all 3 files into a single file, regardless of the original folder hierarchy. 

Subcase #4.1 – Merge Files and specify a name for the output file

What would happen if we wanted to obtain a merged file with a specific name instead of using the default autogenerated name when using the Merge behavior in Azure Data Factory’s Copy activity?

Let’s proceed by opening the output dataset and adding a specific file name (export in our case), as illustrated in the image below. Once the file name is configured, we will publish the changes and trigger the pipeline execution.

Result: Upon applying the Merge files Copy behavior and adding a name for the output file in the output dataset, the resulting structure in the “output” folder will consolidate all 3 files into a single file called export, regardless of the original folder hierarchy. 

Subcase #4.1 – Preserve Hierarchy / None / Flatten Hierarchy and specify a name for the output file

When configuring the Copy behavior in Azure Data Factory, if you intend to either flatten or preserve the hierarchy of the output files, it’s important not to specify a specific name for the output file. Otherwise you will encounter the below error.

In conclusion, Azure Data Factory’s Copy activity provides a versatile set of Copy behaviors that allow you to handle various data transfer scenarios with ease. Throughout this blog post, we explored the different Copy behaviors, such as Merge, None, Flatten Hierarchy, and Preserve Hierarchy, and their implications in data movement.