Memory Usage in SQL Server for R

While R is an open source language, there are a number of different versions of R and each handles memory a little differently. Knowing which version is being used is important, especially when the code is going to be migrated to a server. As part of a SQL Server implementation, there are three different versions of R which come into play. The first is standard open source R, commonly known as CRAN R. This is the standard open source version of R which runs code in memory and is single threaded. The next version which will be installed as part of a SQL Server Installation is Microsoft R Open. This version of R was written to take advantage of the Intel Math Kernel Libraries [MLK]. Using the libraries speeds up many statistical calculations which use matrix operations. It also adds multi-threading capability to R as the rewrite provides the ability to use all available cores and processors and process in parallel. More information on how it works and how much faster Microsoft R Open is compared to standard R is available here. To use Microsoft R Open, once it is installed, in Rstudio should automatically start using it. To check out what version of R that is in use, within R Studio, go to Tools->Global Options and look at the R version.

rstudioversion

R studio here is using Microsoft R Open. To take a look at the version in Visual Studio which has R Tools installed, go to Rtools and Options. Microsoft R Open is open source and can be used by anyone to speed up the execution of their R code. The version of R which is implemented within SQL Server is not this version of R. The R server version is proprietary, and Microsoft is not giving it way for free, as to run it on a server you need to purchase either SQL Server 2016 or R Standalone. That version is R Server. There is full compatibility on all of the versions of R. Code written in either CRAN R or Microsoft R Open will work in R Server. To write code for R Server, the R Client needs to be installed. The Visual Studio Screen shows this version of R installed on a PC which also is running SQL Server 2016 Developer edition.

vsr

R and Memory Consumption

One of R’s strengths and weaknesses is the fact that R runs in memory. This is good thing because it means R is very fast. It is not such a good thing when you need more memory than your machine has. When reading through the product information surrounding Revolution Analytics’ version of R [Revo R] which Microsoft bought, one of the features is the ability of this version of R to be able to use not only memory but disk, by rxapplying chunking logic. The code is broken into smaller pieces, executed then put back together. Using Revo R, it’s possible to do this by using the specifically designed functions which all start with rx which implement this functionality, known as ScaleR. This does not mean that all R code running on SQL Server will be using these functions. Chances are most of it won’t, especially if the R code is being migrated from the existing data science team. Prior to being bought by Microsoft, Revolution Analytics charged money for their product. If your organization didn’t have a license, chances are no one was writing rx Functions. For people who don’t know anything about R, and have been asked to create packages to run R, there is an easy way to tell if the code has been written to use the chunking. If there are no words in the code that start with rx, like rxSummary, rxCube or rxHistogram to name a few, the code will not use chunking.

 

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

 

 

Power BI Auditing

Auditing is one of the recent features in Power BI Preview which can be a great tool for monitoring what is going on in Power BI. There were a number of people who asked for this feature on the Power BI Ideas site, which shows that Microsoft is responding to what is posted there. When I read about auditing, I thought that it was a Power BI feature. While that is true, the scope of auditing is much bigger than Power BI. Auditing is designed for Office 365. Yes, you can turn it on for Power BI, as well as for SharePoint and a number of other Office applications. Since auditing is really an Office 365 tool, one has to be an Office 365 admin to be able to be able to access the tool, as it is turned on from within the Office 365 tenant. To be able to audit anything, you need to wait 24 hours until after it has been turned on to receive any data. It doesn’t go back in time, which is unfortunate as usually you figure out that you want an audit after something happens, and then it is too late. Best to just turn it on now. If you have free users and Pro Users, you will notice that you only get auditing for the pro users.

Accessing the Power BI Audit Log

Clicking on the gear in the upper right corner of the Power BI Service will take you to the Admin portal. If you select audit logs, the screen has a button on it which opens Office 365 Admin Center. Clicking on the search box will show all of the applications which have audit logs. Scroll down to Power BI activities.

powerbiaudit

As you can see by looking at the available Power BI options, there are a number of options to choose from. If you select the top item PowerBI activities, then everything gets selected. After doing that click outside of the menu for the menu to go away. Select a date and time range of your choosing, select specific users if you wish, then click on the Search button. Depending on how big your date range is, this may take some time to load. Once you see the results, you have the ability to filter as well.

auditdeletepowerbi

The screen starts responding as soon as you start typing. I got as far as Del, and the screen changed to show only the items Deleted.It’s a pretty neat feature and a simple way to monitor what is going on in Power BI.

ITDevConLogo

I am going to be talking about more methods to administer Power BI at IT/Dev Connections. If you are attending, please come by and say hello. I would love the chance to meet more people who read my blog. For those of you who are not in Vegas this week, I will be posting more information on Power BI and R to provide supplemental material for my sessions. Check back later in the week or please subscribe to my blog for the highlights.

 

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

 

Analyzing JSON in U-SQL

In USQL there are built-in extractors for parsing text, comma delimited or tab delimined files. Once again, parsing JSON becomes problematic. There is a solution built into USQL, write some C# code to extend it or use someone else’s C# code to extend USQL. Since I wanted to parse JSON, fortunately there are libraries available on github containing the information required to do it. Download the github package and open up the Microsoft.Analytics.Samples project in Visual Studio. When I did this the first time, there was a problem loading the Newtonsoft.Json reference, so I right clicked on the references and downloaded the missing parts again. Build the solution and check out the code in the directory …Examples\DataFormats\Microsoft.Analytics.Samples.Formats\bin\Debug\ . There will be two DLLs, Microsoft.Analytics.Samples.Formats.dll and Newtonsoft.Json.dll. These dlls then need to be registered in Data Lake Analytics and locally if you chose to run your USQL locally. As at some point the goal is to run from within Data Lake analytics, you will need to copy both of these dlls to the data lake. I created a folder for the dlls called Assemblies, and ran this command


USE DATABASE [master];
CREATE ASSEMBLY [Newtonsoft.Json] FROM @"/Assemblies/Newtonsoft.Json.dll";
CREATE ASSEMBLY [Microsoft.Analytics.Samples.Formats] FROM @"Assemblies/Microsoft.Analytics.Samples.Formats.dll";

Notice I told the USQL where to find the dlls, in the Assemblies folder. This step only needs to be completed once per data lake. After this job successfully runs, then the dlls which allow the JSON to be parsed, can be referenced.

Here is my sample JSON, which I have copied to the folder Samples/Data/TestNew.Json, in the Data Lake

{
"appInstanceId": "357ced1e-cf05-459c-9317-794bq24f61c2",
"firmwareVersion": "1.0.2.4",
"serialNumber": "254542-694967",
"Side": "0",
"Latitude": "33.8848744",
"Longitude": "-128.403276",
"GeneratedDate": "2016-10-04T21:18:19Z"
}

Now that I have added the JSON to the Data Lake and the assemblies have been added, I can write some USQL to Parse the JSON. First I will need to reference the libraries, then create a schema, as there is no schema for a Data Lake. After those steps are completed, it’s possible to write SQL to query a JSON file. There is no UI to look at the results, so the results will be writing to a file. I am going to output the data to a csv file called JSONOutput.csv. Here’s the code to do that.

REFERENCE ASSEMBLY [Newtonsoft.Json];
REFERENCE ASSEMBLY [Microsoft.Analytics.Samples.Formats];

DECLARE @infile string="/Samples/Data/TestNew.json";

@logSchema =
EXTRACT name string
, appInstanceId string
, firmwareVersion string
, serialNumber string
, Side string
, Latitude float
, Longitude float
FROM @infile
USING new Microsoft.Analytics.Samples.Formats.Json.JsonExtractor();

@testthis = SELECT appInstanceId
, COUNT(*) AS LocationCount
FROM @logSchema
GROUP BY appInstanceId;

OUTPUT @testthis
TO "/Samples/Data/JSONoutput.csv"
USING Outputters.Csv();

vsrunjson

Using Visual Studio, I am running the USQL Job. There isn’t much data to parse, and you can see in the summary widows that it took 21 seconds to prepare, and 33 seconds to run.

When go to the web and look at the Data Lake Analytics page, I can also see that the job completed. I have noticed that this appears pretty close to the same time on the web and on visual studio.

azuredlscreen

Clicking on the bar graph represented by today will allow me to select the job which ran, showing the same screen as appears in Visual Studio.

Thanks to Erik Zwiefel and Mark Vaillancourt b | t both of Microsoft for helping me figure out the process to use JSON in Data Lake Analytics, as I didn’t understand the steps which are required to parse JSON. I hope this blog makes it possible for you to figure out how to make it work.

 

 

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

 

 

Using Visual Studio for U-SQL Data Analytics Jobs

The pricing for USQL is based upon how many Analytic Units and Completed Jobs. To decrease the amount of money being spent, it would be most efficient if only completed jobs ran on SQL, not the 27 times the job was run to debug it. Fortunately, all of the debugging can be performed locally and only working jobs need to be run. Another thing that you may notice if you are exclusively using the Azure Portal for running Data Lake Analytic jobs is there is no way to actually save a job. Once the job is completed, you can review the job then click on the View Script button. Don’t rely on the button though, because for reasons unknown, sometimes the View Script Button is not enabled, meaning that it is not possible to see what ran.

Data Lake Analytics Setup for Visual Studio

There are a few steps required before any code is run. If the Data Lake Analytics Tools are not installed within Visual Studio, download them here and install them. When the tools are installed, the menu item Data Lake appears in Visual studio. The second step is to model your PC with the same file structure as your data lake. The default location which the Data Lake tools will look for your data structure is C:\Users\<<insertyourname>>\AppData\Local\USQLDataRoot . What this means is if you have folders and subfolders created in your data lake, your PC needs to have the same structure, including the data.

Running Data Lake Jobs Locally

If you take a look at the screen picture of Visual Studio with the data lake installed, you will notice a series of buttons at the top of the screen. The middle button currently is set to (Local). The drop down box at the top of the top of the screen will allow you to set the job to either your Azure Data Lake Analytics instance or locally. If it is running locally there will be no charges incurred on Azure. Also in Visual Studio, of course you can save the name of the USQL file.

vsdatalakescript

 

When the context is switched to the Data Lake Analytics instance from Azure, you will probably want to check out the Summit button. There is only one option, Advanced. In this window, you can change the Job name. It is default set to the name of the script being run, but if you are running the same script over and over, you may wish to change this name so that the different instances can be identified. Parallelism can also be set to the value that is actually being used in the job. Take a look at the job view, which is the tab to the left of Script. This screen shows the processes in use when the job is run and set the value accordingly. You will be charged for the Parallelism value that is set, not the amount actually used. Setting to a lower value can decrease the cost of running a job.

The tab on the far left, shows the job with the same view shown in the Azure Portal Stream Analytics job. That screen is shown below.

usqlscreenRunning on Visual Studio also has the benefit of less changing between screens than the Azure Portal, which is another reason to develop here. Now that I have this environment set up, I plan on writing all of my Data Lake Analytics jobs here, as I find the development environment works better for me. Let me know what you think of it by commenting below. If you are interested in finding out more about running Data Lake Analytics Jobs, especially if you are trying to parse JSON, please subscribe to my blog as that topic will be in an upcoming post.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

Event Hub Troubleshooting

When creating an Azure Event Hub, chances are there will be no errors. This is not always a good thing, as it may mean that the errors exist but do not appear. Maybe the event hub is sending data, but the data cannot be read by a stream analytics job. Maybe the event hub really is working, but nothing appears in the dashboard. If any of these problems sound familiar, this post should help.

Testing the Event Hub

If you don’t have a source of data, like a raspberry pi or a sensor sending data, you can use this guide to create a C# program to send data to your event hub. Chances are though, this code is going to have to be modified even more than the instructions indicate, because the data sent is not in JSON. While it is not a requirement that data sent to the event hub be in JSON, if you want to read it with stream analytics that is one of the acceptable formats needed. If you are using the code provided and you want to insert a record into a database field input01, the message needs to be changed to the following to add the double-quotes and brackets required by JSON.
var message = "{'input01':\"" + Guid.NewGuid().ToString()   + "\"}";

Validating the Event Hubs Receive Messages

To ensure that the event hub is actually receiving data, validation can only be done in the old Azure portal. The service bus icon is two down from the HDInsight elephant. Double-clicking on the service bus namespace will bring up the a list of event hubs. Double-clicking on that will show this screen. This screen picture was taken roughly at 7:10. How many messages are there at 7:00? None.

eventhubbefore

This screen print was taken at 7:17. Notice anything different about the message count at 7:00?

eventhubafter

Oh look, there are 144 Messages which came in between 7:00 and 7:05. This means that everything really was working, I just needed to wait to see them appear. The wait time tends to vary from 10-20 minutes. Perhaps nothing is wrong. Lucky if this is you as you can stop reading

Stream Analytics with Event Hubs

If you are using an event hub to pass data to a stream analytics job, step one, make sure the stream analytics job is started. Created does not mean started as it should say Running as shown in the clip below.

The input for this stream is set to an event hub which has a standard subscription. The basic subscription, which is of course cheaper, has one default consumer group. With a standard subscription multiple consumer groups can be created and more importantly named. When setting up the inputs there is a blank for the name of the consumer group. If you have a basic subscription this will be empty. If it is empty, then the event hub won’t pass data to the stream analytics job. Perhaps there is a way to get a basic event hub to work with a stream analytics job, but I couldn’t make it happen. When I created an event hub with a standard subscription and created a consumer group and added that name to the input of a streaming analytics job, it worked.

If you have found these troubleshooting tips helpful, please subscribe to my blog, as I will be passing along more tips in my next post which will detail the steps of how to get data from the event hub to a Azure Database.

 

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

 

 

Group Workspaces and File Sharing in Power BI

When working on Power BI Reports, much of the time the task is spread among multiple people. If this is the case, you probably will also want to provide limited access to the PBIX files used in your Power BI reports. It would also be nice to have the ability to check out a file to insure that one person doesn’t over-write the changes of someone else. It would be nice to track changes that have been made to the files too. Based on this description, it may sound like the next topic will be implementing source control software, like TFS or Subversion. Instead, all you need to do is use the Group Workspaces functionality with Power BI.

Report Development Teams needs Power BI Group Workspaces

Power BI is licensed by the individual. This means Power BI reports created within a My Workspace belong to that person. This can be a problem if say that person goes on vacation or quits. If the reports are all contained within one person’s account and that individual’s account is deleted when they separate from the company, what do you think happen to all of the reports? Answer they are gone too. Maybe a call to Microsoft Power BI Support will be able to help if that occurs, or maybe not. It’s would be better to create the ability within the Power BI environment where the company owns the company reports. Even if the Power BI reports are all created by one person, I would create a workgroup. After all that one person might want to go on vacation sometime, and for disaster recovery reasons someone else needs to be able to access the reports Of course by that I mean, create a Workspace within Power BI, and add other people to it. When you create a workspace within Power BI, not only do you get a location where multiple people can visualize the same reports, a Group One Drive is also created where PBIX files can be shared within the group.

Source Control with Power BI Reports

PowerBIGroupWorkSpaceMany of the functionality people associate with source control programs live inside the group one drive which is created for Power BI. Looking at the picture of the group screen, which was created when a Power BI Workspace was created, you will see that this group contains 7 members and four files. The members of this groups are the only ones who have access to the files. The file AcmeThree.pbix is selected, and clicking on the ellipse(…) brings up a menu for the file. Notice one of them is Check Out. If I check out a file, the icon next to the name changes, providing a visual queue to all who wish to edit the file that it is being working on. The menu option for me would change to Check In, providing the ability to check the file in to the directory, allowing others to check out the file and work on it. Notice Version History also exists. This feature allows previous versions of the file to be loaded, which means that changes made to a file can be rolled back.

Using One Drive with Workgroups provides a convenient way for groups of users to work on reports, ensuring that the reports are stored in a secure location and can be access by multiple people within the organization. If you are looking for way for a team of people to collaboratively work on reports, create a Group Workspace in Power BI then use the One Group functionality to provide a secure method for report development.

 

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

U-SQL and Azure Data Lake Analytics

There are a number of different SQL Flavors–HQL, PL/SQL, MySQL, U-SQL, T-SQL — all of which are a derivative of Ansi-SQL, which is I suppose in today’s parlance, A-SQL. Many people have not heard of U-SQL, which Microsoft introduced on September 28, 2015. Since the announcement was in the Visual Studio Blog, a number of data people may have missed it. U-SQL is meant to combine the ease SQL with the functionality of C# to create a language which can process any kind of data, like videos or text, by creating the ability to customize the code and infinitely scale. This is very useful if for example all of the data is stored in an Azure Data Lake.

Using U-SQL in Azure Data Lake Analytics

In my previous series on Stream Analytics, I wrote some U-SQL. That U-SQL didn’t look much different than Ansi-SQL, which is sort of the point of porting the functionality to a different yet familiar language. Another application which heavily uses U-SQL is Azure Data Lake. Data Lake stores its data in HDInsight, but you don’t need to write hive to query the data, as U-SQL will do it. Like Hive, U-SQL can be used to create a schema on top of some data, and then query it.

For example, to write a query on this csv file stored in a Data Lake, I would need to create the data definition for the data, then I could easily write a statement to query it.

PopsicleDataLake

@searchlog =
EXTRACT SaleDate string,
SaleLocation string,
Lemon   int,
Orange     int,
Temperature     int,
Leaflets int,
Price                      string
FROM "Samples/Data/Popsicle.tsv"
USING Extractors.Tsv();


@testthis = SELECT SaleLocation
, COUNT(*) AS LocationCount
FROM @searchlog
GROUP BY SaleLocation;


OUTPUT @testthis
TO "Samples/Data/Output/SaleLocCount.csv"
USING Outputters.Csv();

In this U-SQL code, I am creating a structure for the data, querying some fields, and writing the output to another file. Make sure that you don’t forget the semi-colons as that will cause errors. Also if any of your fields are blank you will have to code for that as well. From with Data Lake Analytics, the U-SQL is run as a job, creating a new file. Note the time that it took to finish the job.

USQLJob

 

The reason data is stored in a Data Lake is to provide a single storage location for the data, which will be used in analytics. U-SQL provides a powerful tool for getting the data out.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

 

Data Factory – Troubleshooting Errors

IBM PunchCard computer reader

Punch card and punch card reading machine

If you have been reading my Blog Series on Data Factory, you will notice that I didn’t talk about what to do when you have errors, until now. Data Factory is different from most other programming languages which you may be familiar with, such as C#, Java, R, SSIS or VB. In those languages one can step through the code and look at the variables or values while the code is running. perform any of the standard set of troubleshooting techniques which are used in other languages to help resolve errors. Data Factory doesn’t provide a way to really determine what is going on internally when the code is running. Debugging Data Factory harkens back to day when people lined up punch cards and waited to get the output at the other side.

Data Factory Error Codes

Unfortunately, while developing Data Factory I became very familiar with errors. All of the errors show up at the end and provide very little insight as to what in the process failed. Here’s an example.

Database operation failed on server ‘Sink:DBName01.database.windows.net’ with SQL Error Number ‘40197’. Error message from database execution : The service has encountered an error processing your request. Please try again. Error code 4815. A severe error occurred on the current command. The results, if any, should be discarded.

I didn’t get any results to discard, so that helpful hint was not applicable, ever. I was able to resolve this error a few times by resolving different reasons why the error might have occurred by looking through the code and guessing at what line could possibly have caused the error. Generally speaking these errors were found after the first few things I tried did not work. Metaphorically, I had to wait for my punch cards to be read through the machine to see if I had correctly guessed what might be wrong. I have heard this process described as “black box”, but I think a more accurate description would be a punch card computer, as black box is too cool of a name for a process this heinous. In one instance, this error was received when the data in the field was longer than the field definition. That took a while to find since I had over 25 fields to review. Another time I got this same error when I had a typo in a field name, making the message appear completely arbitrary.

Data Factory Error 1000

Batch Execution failed. The response from the Machine Learning service at endpoint…(excluded specific job reference codes here)… {“Exception”:{“ErrorId”:”LibraryException”,”ErrorCode”:”1000″,”ExceptionType”:”ModuleException”,”Message”:”Error 1000: TLC library exception: Exception of type ‘Microsoft.Numerics.AFxLibraryException’ was thrown

If you search on the internets, you will notice that Error 1000 comes up a lot. The reason for that is it is a catch-all error number. There are 999 actual error messages coded and anything that isn’t one of those errors is error 1000. This problem could be anything. In my case it was because my web service wanted me to strongly type the data coming in, instead of having the data default to text, which worked outside of the web service. I did not resolve my problem after reading the error message and coming up with a logical guess based on my code. I had to create an Official support ticket to answer this question as the answer eluded me.

Data Factory Troubleshooting Error Strategies

Regrettably, there is no really good way of resolving errors. It’s not possible to look at anything in process and see how the data is being processed. Instead errors pop up when the execution fails. There are a few things you can do though. If you are calling an Azure Machine Learning Web service, I recommend modifying your input data so that if it looks like a number but it is read in as text, to use the meta data editor to type the data. Make sure you test the batch execution prior to loading it in Data Factory. For other pipelines, I employed a binary sort to determine what field might have the error. I commented out half the fields, and half the fields until I could determine what field had an error. Eventually I figured out the field which must be short. I also decreased the size of data using the code to 3 rows so I wouldn’t have to wait so long for it to fail.

Data Factory Series

This post is the end of my five part series on Azure Data Factory. I hope you have found it both interesting and useful when trying to learn Data Factory. If you have found this series interesting, please subscribe to my blog to be notified of the latest topics. Given that I plan on doing a lot of speaking in the very near future on topics such as R, SQL Server 2016 and Power BI, those topics are going to be on my blog in the very near future.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

Data Factory – Migrating Data from Azure Blob Storage to SQL Server

There are instances where data resides in Azure Blob Storage and the data is needed in a SQL database. For example, if one ran a Machine Learning experiment in Data Factory, the results would be stored in Azure Blob storage, and for analysis purposes, it may make a lot more sense to move the data to SQL database. Moving data around in Data Factory, means writing JSON. In this example we will be using an Azure SQL DB, but it is not essential that the data be stored in Azure. An on-premises SQL Server could also be used, as long as a gateway was added for the connection, the other steps would be the same. There are five different Data Factory elements required to move data from an Azure blob to a database: a pipeline for the data, a data set containing the definition for the blob, a linked service for the blob, a data set containing a definition for the SQL Data, and a linked service to connect to the SQL database.

JSON Data Service

The data to be moved to SQL is stored in a blob storage container called OutputML, and both the linked service and that data set are included in a previous post on running an ML experiment.  In this linked service, the JSON creates the field definition to be written to a table in a SQL database called CensusMLOutput. There are fewer data types than there are in SQL, meaning the JSON here doesn’t exactly match the table definition, but the less granular datatypes are accepted by SQL.

"name": "OutputML",
"properties": {
"structure": [
{
"name": "Age",
"type": "Int32"
},
{
"name": "workclass",
"type": "string"
},
{
"name": "education-num",
"type": "Int32"
},
{
"name": "marital-status",
"type": "String"
},
{
"name": "occupation",
"type": "String"
},
{
"name": "relationship",
"type": "String"
},
{
"name": "race",
"type": "String"
},
{
"name": "sex",
"type": "String"
},
{
"name": "hours-per-week",
"type": "Int32"
},
{
"name": "native-country",
"type": "String"
},
{
"name": "Scored Labels",
"type": "Int32"
},
{
"name": "Scored Probabilities",
"type": "Decimal"
}
],
"published": false,
"type": "AzureSqlTable",
"linkedServiceName": "LinkedServiceOutput",
"typeProperties": {
"tableName": "CensusMLOutput"
},
"availability": {
"frequency": "Hour",
"interval": 1
},
"external": false,
"policy": {}
}
}

JSON for Linked Service Output

The Data set defined references a Linked Service named LinkedServiceOutput. In this JSON the information needed to connect to the database where the table is for the code to write to it.

{
"name": "LinkedServiceOutput",
"properties": {
"description": "",
"hubName": "GingerDataFactoryTest_hub",
"type": "AzureSqlDatabase",
"typeProperties": {
"connectionString": "Data Source=jytr4gph.database.windows.net;Initial Catalog=MLData;Integrated Security=False;User ID=gingerg;Password=**********;Connect Timeout=30;Encrypt=True"
}
}
}

The code includes my id and a password, which is encrypted when the linked service is saved. Now that we have the data components defined, all that is required is for a Azure Data Factory pipeline to move the data.

 

JSON Data Factory Pipeline to Move Data to SQL

The pipeline PipelineCopyMLOutput is pretty straightforward, as it defines the action which should take place, copy and implements it. One thing to note that unlike copying a csv file, the data in a table is appened, meaning every time that this pipeline runs, more data will be added to the table. This code does not contain anything to prevent data from being duplicated, which will happen if the input does not change.


{
"name": "PipelineCopyMLOutput",
"properties": {
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "BlobSource",
"skipHeaderLineCount": 1
},
"sink": {
"type": "SqlSink",
"writeBatchSize": 0,
"writeBatchTimeout": "00:00:00"
}
},
"inputs": [
{
"name": "OutputDataSetBlob"
}
],
"outputs": [
{
"name": "OutputML"
}
],
"policy": {
"timeout": "01:00:00",
"concurrency": 1,
"executionPriorityOrder": "NewestFirst",
"style": "StartOfInterval"
},
"scheduler": {
"frequency": "Hour",
"interval": 1
},
"name": "Copy Activity"
}
],
"start": "2016-08-24T16:44:00Z",
"end": "2016-08-25T19:00:00Z",
"isPaused": true,
"hubName": "GingerDataFactoryTest_hub",
"pipelineMode": "Scheduled"
}
}

To run all of this JSON, you can wait an for it to run on schedule or run it Ad-hoc, which I detail in this post .

Data Factory Workflow

Combing all of the Data Factory components which are included in this and in previous posts, the entire work flow diagram is shown below. In the first pipeline, data is copied from the database to blob storage. Next the blob storage data is used to run an Azure ML experiment which outputs data to blob storage. Lastly the results from the experiment are copied to a database.  Notice all of the lovely green checks which exist in the diagram.

DataFactory

This blog series on data factory has covered everything from creating Azure components needed to using Data Factory to run a ML Web service and sending the results to the data base.  In my next and last post for a while on Data Factory, I will be discussing troubleshooting, an essential process in getting all the code to work. To be notified when new posts appear, please subscribe to my blog to receive the latest. I hope that you have found this to be useful. If so, please leave me comments or message me on Twitter as I would love to hear what others are doing with Data Factory.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

 

Data Factory – Executing an Azure Machine Learning Web Service

My last blog post showed how to copy data to a blob storage account, which I needed to do to be able to call an Azure Machine Learning [ML] Web Service. When calling a ML Web Service, the data must be in an Azure Blob Storage account. Once a ML model has been trained, and a web services has been created, it’s ready for production. Calling the experiment in Data Factory allows the ML to be run with tens of thousands of rows as part of a scheduled process. Prior to inserting the ML web service in Data Factory, make sure that you test it to ensure there are no errors with the web service, as Data Factory does not expose all of the ML errors which may be encountered by the web service.

Creating Azure Machine Learning Data Factory Pipelines

ML DF PipelineTwo new steps need to be added to the existing Data Factory Pipeline, one to call the ML Web Service and one for the output. The ML pipeline requires two pieces of JSON code, a linked service to make the connection to the web service and a pipeline to invoke the job and specify the inputs and the Outputs. For the Output, the first step requires no JSON as first a blob storage container in Azure needs to be created to store it. The next steps involve writing JSON to create a linked service to connect to it and lastly an Output dataset needs to be defined.

Calling Machine Learning Service

The Linked Service for ML is going to need some information from the Web Service, the URL and the API key. Chances are neither of these have been committed to memory, instead open up Azure ML, go to Web Service and copy them. For the URL, look under the API Help Page grid, there are two options, Request/Response and Batch Execution. Clicking on Batch Execution loads a new page Batch Execution API Document. The URL can be found under Request URI. When copying the URL, you do not need to include any text after the word “jobs”. The rest of the URL, “?api-version=2.0”. Copying the entire URL will cause an error. Going back to the web Services page, The API Key appears on the dashboard section of Azure ML and there is a convenient button for copying it. Using these two pieces of information, it is now possible to create the Data Factory Linked Service to make the connection to the web service, which here I called AzureMLLinkedService

{
"name": "AzureMLLinkedService",
"properties": {
"description": "Connecting ML Experiment”
"hubName": " GingerDataFactoryTest_hub",
"type": "AzureML",
"typeProperties": {
"mlEndpoint": "https://ussouthcentral.services.azureml.net/workspaces/fbe056b6d4c74d7f9d1954367dc3fa61/services/xxa56efd75b745e28cd0512822d17eae/jobs",
"apiKey": "**********"
}
}
}

We will need another linked service for the Output, which takes the data from the experiment and writes it to a blob. The field names in the experiment are listed.

{
"name": "OutputML",
"properties": {
"structure": [
{"name": "Age", "type": "Int32" }
,
{ "name": "workclass", "type": "string" }
,
{ "name": "education-num", "type": "Int32" }
,
{ "name": "marital-status", "type": "String" }
,
{ "name": "occupation", "type": "String" }
,
{ "name": "relationship", "type": "String" }
,
{ "name": "race", "type": "String" }
,
{ "name": "sex", "type": "String" }
,
{ "name": "hours-per-week", "type": "Int32" }
,
{ "name": "native-country", "type": "String" }
,
{"name": "Scored Labels","type": "Int32"}
,
{"name": "Scored Probabilities","type": "Decimal"}
],
"published": false,
"type": "AzureSqlTable",
"linkedServiceName": "LinkedServiceOutput",
"typeProperties": {
"tableName": "ExperimentMLOutput"
},
"availability": {
"frequency": "Hour",
"interval": 1
},
"external": false,
"policy": {}
}
}

The API key will show the actual value until you save it, at which point it will change to the stars you see here. This Linked Service will be referenced in the next bit of JSON for the pipeline


"name": "PipelineML",
"properties": {
"description": "Use Azure ML Model",
"activities": [
{
"type": "AzureMLBatchExecution",
"typeProperties": {
"webServiceInput": "InputDataSetBlob",
"webServiceOutputs": {
"output1": "OutputDataSetBlob"
},
"globalParameters": {}
},
"inputs": [
{
"name": "InputDataSetBlob"
}
],
"outputs": [
{
"name": "OutputDataSetBlob"
}
],
"policy": {
"timeout": "02:00:00",
"concurrency": 3,
"executionPriorityOrder": "NewestFirst",
"retry": 1
},
"scheduler": {
"frequency": "Hour",
"interval": 1
},
"name": "MLActivity",
"description": "Execute Experiment",
"linkedServiceName": "AzureMLLinkedService"
}
],
"start": "2016-08-19T10:30:00Z",
"end": "2016-08-20T23:30:00Z",
"isPaused": true,
"hubName": " GingerDataFactoryTest_hub ",
"pipelineMode": "Scheduled"
}
}

Lastly another Dataset needs to be created to process the output. The data will be written to a file called Output.csv, which is in a folder called mloutput01/ which is located in the Blob storage container, which is the same one I used previously for the input folder used earlier.   This file will be overwritten every single time this is run.

{
"name": "OutputDataSetBlob",
"properties": {
"published": false,
"type": "AzureBlob",
"linkedServiceName": "AzureBlobStorageLinkedService",
"typeProperties": {
"fileName": "output.csv",
"folderPath": "mloutput01/",
"format": {
"type": "TextFormat",
"columnDelimiter": ","
}
},
"availability": {
"frequency": "Hour",
"interval": 1
},
"external": false,
"policy": {}
}
}

If you add this code onto the previous Data Factory code, you can take data from the database and use it to run a Azure ML experiment and run as much data as you want through the experiment.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur