Loading on premise data to Azure SQL Database
Talend Open Studio for Data Integration Version: 6.3.1
Java Compiler: 1.7
OS: Windows 8
This article will walk through the following steps to load data from on premise Oracle instance to Sql Server instance running on Microsoft Azure.
1. Find out the JDBC details for Sql Server on Azure
2. Download jdbc driver to configure in Talend
3. Create Open JDBC connection using external jar
4. Update the component properties for salesforce and connect to it to target
5. Complete the mapping and execute the job
This article will not cover the installation and setup of Talend Open Studio. The assumption is talend already installed and is working correctly. For details on how to install and configure Talend Open Studio see this post.
In the previous post we saw how to load data from Salesforce to oracle database using SalesforceInput, tMap, tOracleOutput, tLogRow. Today we will try to connect a sql server instance running on Microsoft Azure and loads a huge chunk of data from on-premise to cloud instance. If you do not have an account then sign it up here. Once it is successfully created you can see $ 150 credited to your account. This amount can be used for the resources that you are planning to use and has a validity of one month :).
As you can see below I have created two instances of SQL Server located at South India and South UK. The sole purpose of two different geographic location was to test speed and latency.
Get the database connection details to use inside Talend.
I had tried using the default driver comes with Talend and found that it is extremely slow and loads just 20 records per second. Below screenshots are from default sql server connection.
Retrieve schema will show you schema and objects details.
Since it was damn slow I decided to use the latest JDBC driver provided by Microsoft. To use this driver we have to create generic jdbc connection. Update the azure instance parameters as shown below. Notice I have added the external jar “sqljdbc42.jar”. The only limitation with this driver is that you cannot retrieve the schema like we did above.
At the moment I have a customer table on both the database and they have one to one mapping with a simple transformation to uppercase the customer name.
After execution you can see the number of rows processed per second. This time this was super fast :). It took around around 12 minutes to process 710K records. I am sure this can be further tuned if I could have used a LAN instead of WiFi. There are many other parameters that can be taken in to consideration to bring it down from 12 to 6 min.
I was also monitoring the usage of DTU consumption, CPU percentage and few other parameters that was relevant to my POC. And you can notice it was hitting around 8 to 9% of total DTU allocated as part of S2 Standard (50 DTU). If you want to know more about DTU refer this link.
Thats all for today.
Let me know if you have any questions.