Kisha Gulley was once kicked out of a Facebook group for mothers with autistic children after a c...Read More
Microsoft announced Azure Synapse Link for SQL. In this article, MVP Dennes Torres explains the new feature.
The Link technology in Synapse is used to synchronize data from different data sources to Synapse Analytics. Synapse Link for CosmoDB and Synapse Link for Dataverse were already released. Now it’s time for a new and very important one: Azure Synapse Link for SQL.
The Azure Synapse Link for SQL can be used with Azure SQL Database and SQL Server 2022 to synchronize tables using these databases as sources and a Synapse Dedicated Pool as a destination.
When using Synapse Link, you are not really creating a Data Warehouse, but you are executing a big step in this direction.
The process between production and the Data Warehouse is called ETL – Extract, Transform and Load. You must execute these three steps: Extract from production, Transform the extracted data for the data warehouse structure, and Load the data into the data warehouse.
Sometimes these steps can be executed in a different sequence and many times it’s more complex with even more steps. The steps can also be expressed as an ELT process, Extract, Load and Transform. As you may notice, the difference is the sequence of the steps. In this case, you first load the data, and, after that, you transform the data.
The Load process is usually done to a staging area, where you will execute the transformation of the data to the format required by a data warehouse.
Synapse Link doesn’t execute any transformation on the data. It extracts the data from the source and loads it into the destination. In summary, using Synapse Link, you create a staging environment using very simple steps to transfer the data.
The staging environment is important because it reduces the workload over the production environment. If you transform the data before loading, the production environment would be affected by the transformation workload. On the other hand, loading the data to a staging environment and leaving the transformation for later, the transformation workload will be executed over the staging environment.
Another benefit of this staging environment is the possibility to use this environment for reporting, instead of run reports over production. In some cases, you may would like to use Synapse Link specially for that, even if you are not planning the further transformations into a data warehouse yet.
Besides relieving the production environment from the transformation workload, the staging environment can also help in scenarios with multiple data sources with similar structures. For example, imagine a company with a main office and many different branches. Each branch has its own database, but the data stored has the same schema, for the same business.
In this scenario, Azure Synapse Link for SQL can be used to synchronize the data from all the branches to a centralized staging environment. This becomes an additional benefit of Azure Synapse Link for SQL and the staging area it can easily create.
The structure of Azure Synapse Link for SQL uses two key elements which are always present in the different scenarios where you can implement it.
Change Feed: The change feed is a new feature included in Azure SQL Database and SQL Server 2022 to provide support to Azure Synapse Link for SQL. You may already know SQL Server has two different features to allow the control of changes in a production environment: the Change Data Capture, also known as CDC and Change Tracking. However, Microsoft decided to implement a third method to ensure the workload over the production environment would be as light as possible. This is the purpose of the Change Feed.
Integration Runtime: The IR is the same one used for pipeline execution. The IR is responsible for the start/stop operations on the source database. It can be used as an autoresolved IR or a self-hosted IR, according to the scenario which will be implemented.
Azure Synapse Link for SQL can be implemented in 3 different scenarios:
Azure SQL Database: The fact both services are online inside Azure makes the implementation easier. Azure SQL Database makes some steps easier.
SQL Server 2022 in an Azure VM: SQL Server 2022 requires some additional steps for the use of Azure Synapse Link for SQL. The fact it’s installed in an Azure VM makes the process easier, allowing use of an Autoresolved integration runtime. This scenario worked well during my tests, but it’s not confirmed yet as a fully supported scenario. Officially, for SQL Server 2022, a self-hosted integration runtime is required.
SQL Server 2022 on premises: This is the scenario requiring the most work. Since the server is on premises, you will need to use a self-hosted integration runtime installed on premises to send the data to the landing zone in the cloud. The services in the cloud will not access the on-premises environment. The self-hosted IR will be the responsible for start/stop operations, and the SQL Server on premises will send the data to the cloud.
The basic architecture is always the same: an integration runtime is used to control the start/stop of the Link on the source database. The source database needs to support the change feed feature, it means it needs to be an Azure SQL Database or SQL Server 2022.
The source database will send all the new or changed records to a landing zone. The Synapse Workspace will load the data from the landing zone to the target database.
If the source database is an Azure SQL Database, the landing zone will be internal to the Synapse Workspace. There will not be access to the landing zone. The Azure SQL Database will need to have a system managed identity which will be used to access and save the data in this internal landing zone.
Internally, Synapse Analytics has an Ingestion Service responsible for loading the data from the landing zone to the Dedicated SQL Pool
If the source database is a SQL Server 2022, the landing zone will be explicitly created as an ADLS Gen 2 Account. During the start of the link the integration runtime (IR) will create a database scoped credential to allow SQL Server to access the ADLS Gen 2 Account. This credential will be using a SAS token to authenticate to the ADLS Gen 2 Account.
The start/stop process is executed by the IR, which will connect to the source database using the authentication provided on the linked service.
The authentication to the source SQL can use SQL Standard Authentication or Managed Identity. If it uses managed identity, the Azure AD user will need to receive access the source SQL.
Finally, if the source SQL is on premises, the IR will need to be a self-hosted IR in the on-premises environment.
The Azure Synapse Link for SQL provisioning is done using Synapse Studio. Some steps are slightly different depending on the provisioning scenario.
Linked Services are used by Synapse Analytics to establish a connection to external services. Every linked service uses an integration runtime (IR) to execute actions on the connected service.
You can select the Azure SQL from a subscription you have access to, or you can manually insert a connection string. The result for the service will be the same. This creates an interesting level of flexibility:
SQL authentication is a very well-know authentication. The other types are the following:
System Assigned Managed Identity: This is the Managed Identity automatically assigned to the Synapse Workspace by Azure
User Assigned Managed Identity: This is the customized managed identity assigned to the workspace by the developer. You can assign custom managed identities on the Identity menu item on the workspace, as you may notice on the images below. You can assign multiple managed identities, because in it’s possible to control the permissions to each linked service in an isolated way. This option is not supported by Synapse Link at the moment
Service Principal: Service principal is a special kind of identity. This identity is often used by applications to connect to azure resources. Create this kind of identity by creating an app registration on the Azure Active Directory. This option is not supported by Synapse Link at the moment
Even if the authentication used is the SQL authentication, the user making the configuration doesn’t need to know the password, the password could be retrieved from Azure Key Vault.
You may notice that although a linked service has 4 authentication options to an Azure SQL Database, only two of them, SQL authentication and System Assigned Managed Identity, are supported at the moment. This once again highlights the fact that Linked Services are used for many purposes in Synapse Analytics, not only for Azure Synapse Link for SQL.
When using a System Assigned Managed Identity, Identity with Azure SQL Database, you will need to add it as db_owner of the Azure SQL Database. The code bellow illustrates this with a System Managed Identity.
CREATE USER maltasynapselink FROM EXTERNAL PROVIDER ;
ALTER ROLE [ db_owner ] ADD MEMBER maltasynapselink ;
The access to Azure SQL Database is controlled by a firewall. Using Azure Synapse Link for SQL, it’s Synapse which will access the Azure SQL Database. In order to allow this access, there are three options:
Enable access to Azure Services: This option opens the firewall to any service inside Azure, either the service is in your tenant or not. It’s the easiest option, used very often, but has the highest risk. I wrote about this in this article https://www.red-gate.com/simple-talk/cloud/azure/eight-azure-sql-configurations-you-may-have-missed/ Open the firewall to the Synapse Ips: Synapse has a specific range of Ips in Azure. You can open the Azure SQL Database firewall to this specific range of Ips. It’s safer than enabling access to all Azure services, but you can’t be sure that only your Synapse Workspace will be able to pass through the firewall.
In the future, the use of Private Endpoints may become available for Azure Synapse Link for SQL, but at the moment it’s not supported.
I delivered a technical session about Azure SQL networking, you can watch on this link https://www.youtube.com/watch?v=RKZy5MkqFDQlist=PLNbt9tnNIlQ5pVwZFRVpoBG8uQTs8aIczindex=20
The image above contains the Source managed identity ID. When using Azure SQL Database as a source, it needs to have a managed identity. Synapse will provide access permissions to this identity to access the internal landing zone, as explained before in this article.
You can read more about Azure SQL Managed Identity here https://www.red-gate.com/simple-talk/blogs/azure-sql-and-managed-identity/
You don’t need to use one single Link Connection for all your tables. You can break down your tables in multiples Link Connections. One reason to do so is the number of transactions each table receives. You can create Link Connections with a higher number of cores for the heaviest tables and a different Link Connection for the tables with a smaller number of transactions.
You can read more about table distribution here: https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-distribute?WT.mc_id=DP-MVP-4014132
You can read more about table design here: https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-overview?WT.mc_id=DP-MVP-4014132
Important: Initially the schema and table name will be the same as in the source system. However, the schema will not be created by Azure Synapse Link for SQL. If the schema doesn’t exist in the target dedicated pool the execution will generate an error. If you are using a schema other than DBO, you need to create the schema before starting the link
There are two additional options you can set before publishing the Link Connection. The image below illustrates these options
When this option is selected, if the target tables already exist, they will be dropped. This means the synchronization will need to start again, transferring all the data from the source to the target.
This is something you would not like to happen. However, it’s important to keep in mind how the Stop button for the Link Connection works. It’s easy: Don’t stop the link. Every time you stop the link, when you start it again, it will be required to drop the destination tables and start from the beginning.
There is no need to stop the link. If you would like to include or exclude tables, you can do this without stopping the link. If you remove a table from the Link Connection and add it again, the table will be re-seeded, but only this single table, not all tables in the Link Connection.
Very Important: You never stop a Link Connection. If you stop a Link Connection, you will need to drop the target tables and start the synchronization again from the beginning and this will be heavy. Interruptions can happen, you can stop the SQL Pool, but there is no need to stop the Link Connection and you shouldn’t do it.
This option will ensure the records will be transaction consistent across the tables in the same Link connection.
Of course, not all tables in your source database will have this transactional relation. You don’t need to limit yourself to a single link connection for each source. You can break down your tables among multiple link connections, grouping the tables according to the transactional consistency needed.
After configuring the Link Connection, you must publish the changes on the workspace before starting the link.
The Publish button is always highlighted when there are pending changes in the workspace. Once you click the Publish button, you can review the changes on the Publish All window and click the new Publish button on this window.
After publishing, you can click the Start button on the Link Connection. Once started, as mentioned before on this article, you don’t stop the Link Connection. If you stop the Link Connection, you will need to start the synchronization again from the beginning.
Once you start the Link Connection, you can monitor it on the Monitor tab, Link Connections menu item.
The monitoring information is displayed in two different levels: The existing link connections and the tables inside each link connection.
There are two levels of possible errors:
Important: Even if all the tables have errors, they will not appear as errors on the Link window. The message on the Link window advises you to check the monitoring even if it appears the Link is running correctly. This is a good advice, as you may notice in the image below, from the Link window, in the Integrate tab.
Once you click one existing link connection inside the Link Connections monitoring window, you can see the tables inside the link connection, like the example below.
There are three possible statuses for the tables inside the Link Connection:
There are very few differences when implementing the Azure Synapse Link for SQL with SQL Server 2022. Let’s analyse them.
As explained before, a database scoped credential will be created to allow SQL Server to access the landing ADLS Gen 2 account. If your database doesn’t have a master key yet, you can create it using the statement below.
create master key encrypt by password = 'YourPassword'
After starting the Link Connection, you can check the database scoped credential by using the query below.
select * from sys . database_credentials
Sometimes, when restoring a database from a backup or after some other activities, the database can end up without a database owner.
This is not a good practice and the lack of the owner in the configuration can cause some other problems as well.
You can use the statement below to set the owner for the database. In the statement below, I’m defining SA as the owner. This is not a good practice. You may need to plan better which login you will set as the database owner.
alter authorization on Database :: AdvWorks2019 to sa
You can read more about the importance of the database owner choice on this article: https://www.red-gate.com/simple-talk/blogs/careful-trustworthy-setting/
When creating a Link Connection using SQL Server as a source, the landing area is provisioned together the Link Connection. As you may notice on the image below, you will need to configure an ADLS Gen 2 Account as a landing area.
Some considerations at this point:
You rotate the key without stopping the Azure Synapse Link for SQL (what is something you should never do, as mentioned before). The image below shows the Rotate Token button on the Link Connection properties window.
The Rotate Token window is slightly different than the window to create the token. When rotating the token, you have the option to create a new SAS key or input one manually. This means that you can decide about using an Azure Storage Account Policy later, after the container is already in use.
Creating the key manually allows you to create the key linked to an azure storage policy, which can make it easier to automate the rotation of the key. You can read more about the benefit of Storage Account policies on this link: https://www.red-gate.com/simple-talk/blogs/azure-storage-policies-control-sas-from-the-server-side/
If you decide to create the SAS key manually, it’s important to mind the SAS key will need the following permissions: RACWDL (Read, Add, Create, Write, Delete and List)
The Azure Synapse Link for SQL is not the entire way from production to a data warehouse, but it makes the ETL/ELT process way easier by synchronizing the data with the Synapse dedicated SQL Pool, relieving the production environment from further ETL workload, and allowing you to proceed from this point with the transformation of the data for the data warehouse.