Bringing your Dataverse data to Azure Synapse

Bringing Dataverse Data To Azure Synapse

Often when we set up a Business Application based on a Dataverse database, we are asked to make data available for reporting purposes or simply to feed a third-party database that will be consumed later.
We could imagine at first to build an interface to send the data and no matter the target system, but this requires a certain architecture and integration complexity: we would have to replicate all actions such as creation, update and deletion while making sure to manage errors 🙂
If you followed the latest announcements at the MS Build held on 25-27 May 2021, you may have noticed the new feature: “Azure Synapse Link for Dataverse” (formerly known as Export to data lake.). In this post we will look at how to use this new released feature to dump data into an Azure Data Lake Gen2 and exploit it from Azure Synapse.

Why?

The first question is finally to ask why we will have to replicate the data to a third-party database.

  • Business Intelligence & Analytics:

There are now many solutions that can provide some degree of analysis such as Customer Insight, Sales Insight or even AI Builder which uses the Dataverse database directly to provide these services.
When we want to do data analysis, we can use standard features like views, dashboards, reports, charts etc. but often this may not be enough either because of their limitations or because we want to cross-reference different data sources.
Analysing the data or doing live reporting is not really a clever idea because as you know, Dataverse is a transactional database!

  • Own Database:

    Even though the format of the data and the Dataverse Web Service is quite simple to use, some companies want to be able to export this data outside the Dataverse database. This can be the case when they want to run Machine Learning, implement advanced analytics capabilities, perform some integration using data flows or simply feed a common database and have some freedom on this data.

    Another aspect to consider is the scalability of the solution and the cost involved as Dataverse can quickly become a data generator (especially for customer interactions).

What?

We already know about the ability to export Dynamics 365 (online) data to an Azure SQL Database, stored in a customer-owned Azure subscription, using the Data Export Service available on Microsoft AppSource but as of October 31, 2019, a new feature has been announced, that you may have already seen this or observed that an application user with the name Athena exist in your environment: Exporting CDS data to Azure Data Lake is Generally Available.
And more recently, we now can set up a seamless integration between Dataverse and Azure Synapse Analytics (Preview).

The solution is amazingly simple to implement and allows replicating information not only at initialization but also in an incremental way while replicating CUD operations, which is perfectly suited to the various scenarios. The principle is quite similar with the use of the Azure Data Lake Gen2 because this is precisely what is used for streaming data with Synapse!

  • Azure Data Lake Gen 2 is well-known as a solution to store a large amount of data which can be structured or unstructured and is based on Azure Blob storage technology that’s why it’s scalable and cost-effective. Another positive point is that Data Lake is a reference in the world of big data (it interfaces easily with Apache Hadoop environments) and is increasingly used within companies, we can find connectors developed by many editors of data analysis solutions (MicroStrategy, Oracle) as well as Microsoft of course :)!
  • Azure Synapse, unlike a Datalake, does not actually store data, it is a set of services and technologies that allow to analyze/understand data based on data warehouses and big data systems (Spark in particular), to build pipelines for data integration in an ETL mode and integrates perfectly with other Microsoft services such as Power BI (and now Dataverse 😉 ).
Azure Synapse architecture

How?

The first thing you need to do is to create a Synapse workspace by creating a new Azure Synapse Analytics service! Be careful about the region you choose as this can cause problems depending on the location of your instance.

Note: We will create the Azure Gen2 Data Lake at the same time but if you want to use an existing one, you can do so because you can create the DataLake alone and select it when creating Azure Synapse.

SEARCHING Synapse Service

When creating the workspace, make sure you select the same region as your Dataverse instance and the previously created/existing Data Lake Gen2:

creating Synapse workspace

You can now enter the password that will be used to access to the workspace SQL Pools. Note that you can also set up an encryption at the workspace level (and directly use a key stored in Azure Key Vault 🙂 ):

Setting sql admin and password

Once everything is done, we can go to the maker portal.

Now we can start configuring our link to the data lake:

CREATING A NEW LINK TO DATA LAKE

You have certainly seen that the button is still named “New Link to Data Lake”, I do not really know if it will be renamed because in the end, we create a link to a Data Lake but with this “Synapse layer” :).
Once you click on “New link to data lake” you will be asked to enter the Azure Synapse workspace information (you must check the appropriate box, otherwise it will only be a link to a data lake 🙂 ) :

Creating link from dataverse to azure synapse

As I said above, you must be careful with the Data Lake Region because it can cause you problems (I had to recreate mine :p ). This is also mentioned by Microsoft, which is an interesting point in case you have customer constraints for the localization of the data!

The storage account must be created in the same region as the Power Apps environment you'll use the feature in.

Now we just need to select the tables we want to export, in our case we will only take account and contact entities. Note that we have the possibility to use advanced parameters that consist of choosing how the data will be written to the Data Lake. For this we have two options:

  • Append Only: This option will append the CUD operation performed to the corresponding partition in the lake. This means that we will have a history of the actions, as in the case of a delete, we will still have a row to indicate that this specific record has been deleted. In this case, the partition strategy is set to Year.
  • In Place Update: In this case, an upsert operation (update or create) is performed on the incremental data in the destination. Thus, if the row exists, it will be updated in the corresponding file, otherwise it will be created. This is the default setting to use.

    To understand this, you can select the Append Only option for the contact entity and for more information here is the official link: In-place updates vs. append-only writes.

SELECTING TABLES AND WRITING MODE

This allows you to view the initial synchronisation and its progress: 

Initial SYNC DATALAKE

You will notice that you can navigate directly to the Data Lake or the Azure Synapse workspace from the command bar:

Go TO Data lake Or synapse

If we decide to open the Azure data lake, we can see the following content with folders for each table we are synchronizing with the chosen granularity (month or year), the associated snapshots are present inside and of course the metadata for each table.

DATA LAKE GEN 2 CONTENT

As you can see, the file is generated by month according to the “PartitionStrategy” defined and it’s based on the “createdon” field:

TABLE FOLDER CONTENT

If we open the CSV file we can see our two records without the headers, only the data!

CSV FILE CONTENT

Tables and metadata are well defined in the “model.json” file and we also notice that there are the dates of the initial synchronization and the different partitions:

Now let’s try the connection to the Azure Synapse Analytics workspace! By clicking on the button, we go directly to the workspace, and we can see our Dataverse tables, which can then be queried very simply using SQL:

QUERYING DATA FROM DATAVERSE WITH Azure Synapse

Well, now that we’ve got everything set up, I wanted to try by creating a custom Table and generate a bit more data. For the continuous sync it’s quick, I created over 90k records (using the Xrm ToolBox Mockaroo DataMunger from Carl Cookson) and there is a slight difference between the creation date within Dataverse (on the left) and the creation date from the sync (on the right):

SYNCHRONISING ACCOUNT RECORDS FROM DATAVERSE TO AZURE SYNAPSE

To add a table, it is quite simple we need to add it like the previous ones via the “Manage tables” button and after a few moments you can see that the synchronisation is already done.

Note: You must ensure that the “Enable Change Tracking” is enabled on the table.

Adding new table

As mentioned above, I enabled the “Append only” option for the contact table and performed a few updates and then a delete on the same record and you can see that all operations are performed in sequence:

data FROM DATAVERSE TABLES APPENDED TO THE FILE

You can see that the whole record is added and not just the update (here I only updated the firstname and you can see that the lastname is present on each row). So, we have a history of all the actions that are performed. According to the documentation, there would be an isDeleted column = True for delete operations but I haven’t managed to get my hands on it yet.

There is still one crucial point to be addressed, which is how do I deliver all this configuration in different environments? 🙂

For that, we will first install the “Export to data lake” solution from App Source which will allow us later to deploy our “export” configuration, which is a good point for deployment strategies! We can go directly to the portal to get the solution:

OPENING APP SOURCE FROM MAKER PORTAL

Once on the App Source portal, simply search for the appropriate solution and fill in the form to start the installation. You will be redirected to the administration portal page and simply select the correct environment and click “Install”:

INSTALLING DATA LAKE CORE SOLUTION

If you are wondering what is in this solution, it seems that there is only one component, a simple table that has no form, so it is not accessible via the User Interface:

DATA LAKE CORE SOLUTION CONTENT

We can then create a new solution and add our link to Azure Synapse:

ADDING SYNAPSE LINK TO SOLUTION
SYNAPSE LINK INCLUDED IN SOLUTION

After performed a fast export of the solution, we can see that we have an entity “msdyn_exporttodatalakeconfig” and that the field “msdyn_schema” contains all the configuration in JSON format. We notice that this will not be enough to deploy because many values must be updated if we want to point to another Data Lake in another environnement :/.

  <msdyn_exporttodatalakeconfigs>
    <msdyn_exporttodatalakeconfig msdyn_exporttodatalakeconfigid="5bb9ae2a-ecc2-eb11-bacc-000d3a1fc67a">
      <iscustomizable>1</iscustomizable>
      <msdyn_name>datalakegen2dataverse</msdyn_name>
      <msdyn_schema>{"SubscriptionId":"xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx","ResourceGroupName":"rg-synapse-dataverse","StorageAccountName":"XXXXX","BlobEndpoint":"https://XXXXX.blob.core.windows.net/","QueueEndpoint":"https://XXXXX.queue.core.windows.net/","TableEndpoint":"https://XXXXX.table.core.windows.net/","FileEndpoint":"https://XXXXX.file.core.windows.net/","FileSystemEndpoint":"https://XXXXX.dfs.core.windows.net/","Entities":[{"Type":"account","PartitionStrategy":"Year","RecordCountPerBlock":125,"AppendOnlyMode":false},{"Type":"adc_customentity","PartitionStrategy":"Year","RecordCountPerBlock":125,"AppendOnlyMode":false},{"Type":"contact","PartitionStrategy":"Year","RecordCountPerBlock":125,"AppendOnlyMode":true}],"RetryPolicy":{"MaxRetryCount":12,"IntervalInSeconds":5},"DestinationType":"Adls","WriteDeleteLog":true,"IsOdiEnabled":false,"SchedulerIntervalInMinutes":60}</msdyn_schema>
      <statecode>0</statecode>
      <statuscode>1</statuscode>
    </msdyn_exporttodatalakeconfig>
  </msdyn_exporttodatalakeconfigs>

We can now import the solution like any other basic Dataverse solution, then go back to the Azure Synapse Link for Dataverse configuration screen to click on Import from Solution:

IMPORTING DATALAKE CONFIGURATION FROM SOLUTION

If we wait a little bit, we can see that the link has been established and the synchronisation is successful!

I hope you enjoyed this blog post and that either showed you the usefulness of this feature or saved you some time 🙂

Leave a Reply

Your email address will not be published.