Data Factory – Copying Data from a Database to an Azure Blob Store

My previous post provides instructions on Getting Started with Data Factory. To borrow a line from Ron White, “I told you that story so I could tell you this one”. If you followed the steps in that post, you have a Data Factory resource. Time to use it. This post describes how to process some data by copying data currently stored in the database to a Blob Storage account. To complete that task, you will need to write some JSON. Please be careful when you do this as writing JSON may involve cursing JSON. Since Jason was in the list of top four male baby names in the US from 1973 – 1982, chances are you may work near someone named Jason when writing JSON, who may think the cursing is directed at him.

Author and Deploy Linked Services

Author and DeployTo be able to pass data from a database, a connection needs to be established to a database. That connection is made by creating a Linked Service. In Data Factory, if you create a New data store, a new Linked Service JSON template will be created. After Clicking on the New Data Store Icon, a list of databases will appear. In this example I am going to pick an Azure SQL Database. If I wanted to access a local SQL Server Database, a gateway to connect to it would also be required. The JSON Linked service template will appear under the menu Drafts and will be labeled Draft-1. A copy of the coded is included below.

{"name": "AzureSqlLinkedService",
"properties": {
"type": "AzureSqlDatabase",
"description": "",
"typeProperties": {
"connectionString": "Data Source=tcp:<servername>.database.windows.net,1433;Initial Catalog=<databasename>; UserID=<username>@<servername>;Password=<password>;Integrated Security=False;Encrypt=True;Connect Timeout=30"
}
}
}

If nothing is changed and the Deploy icon at the top of the screen is clicked, a new item AzureSqlLinkedService will appear under Linked services. There will be a circle with an exclamation point next to it as the connection does not contain valid data, but you can deploy it. To get rid of the error, replace all of the information default provided after the colon with valid values, starting with the name, which has been defaulted to AzureSqlLinkedService.The name property dictates what the Linked service will be called and  cannot be change the name later. To change the name later, highlight the name of the service click on the Clone,  change the name, deploy, highlight the old name, click on the …More text at the top of the screen, select delete, then confirm the deletion the Linked service by clicking on the Yes Button. Conversely, just select a good name before clicking on the Deploy icon. Within the connection string, replace all of the items between < and > with valid values. If there are any errors after deploying, fix the data, click on Deploy and the circle with the exclamation point will go away. Invalid database names are accepted, as long as they are in the right format. If the name is not valid, an error will be received when data factory is run.

Because we need a connection to a database and a Azure Blob, two Linked Services are required, one for each different type. Prior to completing this step, create an Azure Blob storage account by clicking on Add on All Resources. Create the second Linked service, like the first. Click on New data store then select Azure Storage. Using the template for an Azure Blob Storage linked services, I have modified it below adding the “hubName” as it is required


{
"name": "GingerAzureBSLinkedService01",
"properties": {
"description": "Test Azure Blob Storage Account for DF",
"hubName": "GingerDataFactoryTest_hub",
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=gingerblobstorage01;AccountKey=**********"
}
}
}

The hub name is the name of your Data Factory resource you created earlier, and the name will be from you storage account. You will need to get the actual Account Key and replace it. Once the JSON is deployed, the value will appear as a line of asterisk again.

Create a Data Factory Dataset

Two datasets representing the data accessed in the two linked services must be written in JSON. Click on …More and select New dataset. Unlike Linked Services, you actually do have an option to create a New dataset as the name is consistent. Creating the JSON for the database is a bit of a pain as you have to define every single column accessed and as well as the datatype for each one. Of course both order and capitalization are important, as they must exactly match what is in the database, or you will get an error after the code is run. You won’t see any errors when you type it. Here is a sample

{
"name": "InputDataSet",
"properties": {
"structure": [
{
"name": "Age",
"type": "Int32"
},
{
"name": "workclass",
"type": "string"
},
{
"name": "education-num",
"type": "Int32"
},
{
"name": "marital-status",
"type": "String"
},
{
"name": "occupation",
"type": "String"
},
{
"name": "relationship",
"type": "String"
},
{
"name": "race",
"type": "String"
},
{
"name": "sex",
"type": "String"
},
{
"name": "hours-per-week",
"type": "Int32"
},
{
"name": "native-country",
"type": "String"
}
],
"published": false,
"type": "AzureSqlTable",
"linkedServiceName": "InputLinkedServiceAzureDB01",
"typeProperties": {
"tableName": "vCensusInfo"
},
"availability": {
"frequency": "Hour",
"interval": 1
}
}

In this sample, I am actually getting data from a view, not a table. Even though all the code references a table, using a view instead works fine. The linked service name does of course represent the Linked service created in the previous step.

Fortunately, it is not necessary to list every column when inserting values to a Blob so the JSON for that is much easier. While it is possible not to enter a file name, if you ever wish to use the data in the blob store later, you will need to create one. This JSON will write a file input.csv to a blob store to the path mlinput01.  If this is run twice, the file will be overwritten the second time.

{
"name": "InputDataSetBlob",
"properties": {
"published": false,
"description": "Input Blob Dataset to feed Azure ML",
"type": "AzureBlob",
"linkedServiceName": "AzureBlobStorageLinkedService",
"typeProperties": {
"fileName": "inputdata.csv",
"folderPath": "mlinput01/",
"format": {
"type": "TextFormat",
"columnDelimiter": ","
}
},
"availability": {
"frequency": "Hour",
"interval": 1
},
"external": false,
"policy": {}
}
}

JSON for a Data Factory Pipeline

The last JSON required needs to provide the action, in our example copying from a database to a Azure Blob store. Click on …More and select New pipeline.

The JSON code looks like this

{
"name": "PipelineTemplate",
"properties": {
"description": "<Enter the pipeline description here>",
"activities": [],
"start": "<The start date-time of the duration in which data processing will occur or the data slices will be processed. Example : 2014-05-01T00:00:00Z>",
"end": "<The end date-time of the duration in which data processing will occur or the data slices will be processed. Example: 2014-05-05T00:00:00Z>"
}
}

That JSON looks really simple. Unfortunately, for this to work a lot more JSON is required, especially under the activities


{
"name": "PipelineCopy01",
"properties": {
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "SqlSource"
},
"sink": {
"type": "BlobSink",
"blobWriterAddHeader": true,
"writeBatchSize": 0,
"writeBatchTimeout": "00:00:00"
}
},
"inputs": [
{
"name": "InputDataSet"
}
],
"outputs": [
{
"name": "InputDataSetBlob"
}
],
"policy": {
"timeout": "01:00:00",
"concurrency": 1,
"executionPriorityOrder": "NewestFirst",
"style": "StartOfInterval"
},
"scheduler": {
"frequency": "Hour",
"interval": 1
},
"name": "Copy Activity"
}
],
"start": "2016-08-07T00:01:00Z",
"end": "2016-08-08T00:01:00Z",
"isPaused": false,
"hubName": "GingerDataFactoryTest_hub",
"pipelineMode": "Scheduled"
}
}

Once the JSON has all of the curly braces and commas necessary, it can be deployed without a red circle. Scroll back over to the right to get the the Action menu again and select Diagram. The diagram should appear as shown below.

Ad-Hoc Running Data Factory Pipelines

This job is scheduled to run on the hour, but chances are you want to run it whenever you have the code done, rather than waiting. To run the pipeline, underneath the Contents section, click on the Datasets. A list of datasets will appear. Select the one for the Azure Blob storage output, which in my example is called InputDataSetBlob. Select a value under slices, then click on it and another window will appear on the right. Click on one of the items on that window. Another screen will appear on the right. On this window on the top left corner click on Run. Ideally at this point if your spelling and capitalization all works out, after some duration, the status should say Succeeded. To validate that the transfer actually took place, go to your blob storage account, and you should see the file created. You can even download the file to make sure it actually has data in it. If you don’t change the name, and run it again, the data will be overwritten, and no error will occur, so feel free to run this as much as you like.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

Data Factory – Getting Started

For those unfamiliar with Azure Data Factory, it is the application in Azure which moves data and calls processes to use the data. In a lot of ways this makes data factory a lot like SSIS. In addition to moving data around, data factory also contains all of the monitoring tools found in an SSMS job. If you add those things together you get a pretty good picture of what data factory does.

Setting up Data Factory in Azure

Data Factory is a resource which lives in a Resource Group in Azure. To add it to the list of available resources, click on the plus button in the AzureDataFactoryCreateAzure portal and search for Data Factory. Select the icon and this window will appear. Click on the create button at the bottom of the screen to enter the setup information. First pick a name for the service, then select from the dropdown box the appropriate subscription from the options you have available. Like everything else in Azure the services exist within in a container, either pick an existing container to use, or create a new one. Generally speaking I will create a new one so if for any reason later on I want to destroy what I have done, I can just delete the container. Pick the location based on two factors, Azure Data Factory is not available everywhere so you are limited to use only the ones where it is available. If you pick one where it isn’t available, you will get an error message letting you know why you cannot create the resource. Whenever possible within Azure to pick the same resource where your data lives. There are charges within Azure if you migrate data across resources and no charge if you stay in the same resource. You may want to go look at where the data lives which will be used in Data Factory before deciding where to put it. I always check the Pin to Dashboard option so that I can find the resource later, but it is not required and can be done later. Click on the create button to create a Data Factory Resource. If you have selected Pin to Dashboard you will see a little window which says Deploying Data Factory. This little window goes away once Data Factory is completed, and you will have an entry in the list of resources for Data Factory.

Data Factory Tiles

Clicking on the Data Factory resource icon will bring up the following window. Initially the only button you will care about is the one on the far left StartingWithDataFactoryAuthor and Deploy. This is where you will create the Data Factory Components. The authoring is all done in JSON. It is not the most intuitive language in the world and is very picky about brackets. There is some intellisense, and when you click on a bracket, it will highlight the bracket which it lines up with, which is somewhat helpful. In my experience, JSON can be frustrating, as it is rather picky. I don’t think my experience is unique because Microsoft has created two titles which write the JSON components for you, Copy Data and Sample Pipelines. The Copy data function has a wizard to walk through gathering the data required to write the JSON. I have had mixed success in getting it to work. Sometimes it did, and sometimes I got an error. It is in preview, so I imaging over time the performance will improve. Sample pipelines has a few samples of tasks you might want to complete that can be deployed in your environment. Diagram view is very important as it opens a window on the right hand side of the screen showing you a picture of what has been created in JSON. Monitor and Manage brings up an entirely new tab, which is welcomed by those of us who are tired of sliding to the right to see new windows. This screen shows a diagram and what has run. Metrics and Operations open a window showing what has run.

I hope you found this post handy for creating an Azure Data Factory resource. In a later post I will discuss what you can do with Azure Data Factory and provide more detail around the Contents section of this screen and discuss Datasets, Pipelines and Linked Services. Please subscribe to my blog to be notified of my latest posts.

 

 

 

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

 

 

 

 

 

 

 

 

 

 

 

 

 

On-Premises Data Gateway – the Replacement for Power BI Enterprise and Personal Gateway

 With the announcement July 6, Microsoft has merged the number of gateways used in Power BI. The new On-Premises Gateway not only replaces both of those applications, but also replaces the data gateway used to access on premise data for other applications, such as Azure Machine Learning. This reminds me of Lord of the Rings as There is now One Gateway to bring them all. One Gateway to bind them.

Replacing the Old Power BI Gateways with the New On-Premises Data Gateway

When installing the new gateway, the place to look for the new installer is not on some Microsoft Download page, but from within Power BI PowerBIGatewayInstallScreenitself. The new gateway needs to be installed from Power BI, rather than a generic location. As the gateway is providing data for your account, installing it in this fashion will record the gateway used on this account. To download the gateway, login to the Power BI Service and click on the <<insert icon here>> the download menu and select the Power BI Gateway. This gateway is truly the one gateway to rule them all as it contains the ability to install either the On-Premises Data Gateway, which is the direct replacement for the Enterprise Gateway, or the Power BI Personal gateway. You can install both gateways, but you have to pick one to install first. Both gateways can also run on the same device. Once the software is installed, you can convert your existing gateways to the new gateway, so there is no need to enter the configuration twice.

New Gateway, New Security Model

While the personal gateway works in a similar fashion as before bu using a single user’s credentials, the new gateway uses a different security model than was employed in the past. Instead of using the credentials of the Id which installed the software, it uses a new service account NT SERVICE\PBIEgwService. OnPremSecurityDiagramThis account is for the gateway only, it does not need to be granted any data access. The new application is creating a connection to the Azure Service Bus, acts as the clearing hours for all data requests, including Power BI and other applications occur. The application gateway is where the credentials for accessing the data are applied. This security model is used not only for Power BI, but for other applications like Azure Machine Learning where there is a need to get data from a on-premises source and load it to the cloud. Previously it was not possible to have the gateway used for Power BI to be loaded to the same machine as the other data gateway that was used for Azure Machine Learning. Now that they use the same gateway, this restriction is removed.

Functional Differences in the New Gateway and the Previous Gateways

The underlying security model and application is different, but operationally are the new gateway does not behave differently from the Enterprise and Personal Gateways Power BI, there is just a new application do to what they did. If you want to use the data access rights of a single specific user to refresh and update data, the Personal Gateway will do that. The Enterprise Model provides for the ability to inherit the security of the user accessing the data. If you have a tabular model stored locally which is used as a data source for Power BI, the users will see the data they were granted access to from within the tabular model. The scheduled update features are not changed and are still accessed from the same area as they are in Power BI Service.  Unfortunately, if you are interested in connecting to Azure Machine Learning or Data Factory, you will still need a different gateway, and that gateway cannot be installed on the same PC as a Power BI Gateway.  I will have more information on that gateway in a later post.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

 

 

Why Developers Should Not Deploy Their Own Code

Code is a very expensive business asset, and needs to be treated that way. The code needs to be stored in source control application which is in a SafewithPapersecured, well-known location, and the process to release it to production needs to be documented and understood. Like backups, it’s important to ensure that code in source control can be modified and installed before a crisis arises where there is a time crunch to fix a huge production issue. To ensure that the code stored in source control is the expensive important business asset that is counted on to make the business operate every day, the code needs to be deployed by someone other than the person who wrote it.

Save Money by Validating Code in Source Control

When I first worked in a location which had another team deploy code I thought it was pointless bureaucracy, which did nothing but slow down the progress. Watching the problems caused by simple processes which went wrong changed my mind. Checking code in an out of source control is a simple process, whether you are using an open source application like Subversion, or have a full blown TFS Server. If no one checks that the code in source control is the code which is deployed, all sorts of bad things can and do happen. Being the poor slob who came in when everything was a mess, getting stuck not only with figuring out some old code was made even worse when I found out that the code in source control, was not the code in production and area I didn’t have access to view. Unfortunately for me, this step did not occur until after I’d changed what I thought was the released code. Writing the code twice and/or going on a code hunt for the right version became a necessary part of the process, adding needless hours to an already complicated task. If only the code in production was deployed from source control, this mess would have been avoided.

Improving Code Quality

All sorts of things can happen when one person writes and deploys. I know someone who worked in the IT department for a large cell phone company. At the time, working there meant free phone service. One of the devs was a heavy user of the free phone service and so was his large extended family. His job was to maintain the billing code. After several questionable incidents at work, HR got involved and he was perp walked out of the building. Due to the circumstances surrounding his departure, his cell phone accounts were checked to ensure from this point on, he would get a bill. Although his account showed a number of active phones, his balance was always zero. The code in source control was checked and there was nothing in it which provided a reason why his bill was zero. Upon further investigation, my friend noticed the version number in production did not match the version number in source control. The code in source control was compiled and a huge balance appeared for the former employee. If someone else had deployed the code in source control, this chicanery would not have been possible.

Code Deployment Needs to be a Well-Understood Process

Today in many companies, the code may exist a lot longer than employment of the person who wrote it. Given the life of the code, there needs to be well established obvious processes to deploy it. Recently I heard from someone who told me about their SQL Server 2012 SSIS project which used package deployment instead of project deployment because only some of the SSIS packages are deployed to production. The packages are installed in many different locations, and they all exist in one project. This project organization idea turns a simple one button deployment task into an involved process requiring copiously maintained documentation to ensure that everyone involved knows what to do and where to deploy which code. Most ETL code runs at night, and often times that means a person on call is woken up to fix it. This tired person complicated job is compounded when the code deployment moves from a straightforward, one button deploy process to a byzantine location determined by copious documentation. I can see many potential errors which would all be avoided if the organization was changed from one SSIS project containing everything, to projects containing locally grouped packages which are created and deployed via a project to folders in an Integration Services Catalog. If the person who developed this project had to explain and document the process they were using to another person who was doing the deployment, chances are this kind of project organization would be exposed like a Sooky Non-Sparkly Vampire to sunlight, and would be burned to ash.

Ensuring the code is in source control and can be modified and moved to production are important steps in maintaining code. That code can be a stored proc or a webservice, what it is not important, securing it is. Having someone other than the developer deploy the code to production ensures that this valuable asset is truly protected and can live on as long as the company needs it.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

T-SQL Tuesday #80 – SSIS Projects, Packages, and Deployments

http://michaeljswart.com/2016/06/t-sql-tuesday-079-its-2016/comment-page-1/#comment-186750Recently I was talking to someone who is just getting started with SSIS and wanted to know about the package deployment model. I hadn’t thought about a package deployment model in a while. TL;DR Do not use package deployment for SSIS. Why? Because SSIS code deployed in a project model is easier to maintain and migrate. Maintaining a large number of packages can be painful, something I know too well.

Making Maintenance Difficult One Package at a Time

Prior to SQL Server 2012, there was no project deployment. SSIS code was all deployed as packages. These packages could be stored within MSDB or they could be stored and run from the file system. In disorganized places like the one where I worked, they were deployed in both. Assuming nothing much changed since I left, they have all versions of SQL Server which were released prior to the day the new IT Director started in 2012. There was DTS on SQL Server 97, 2000 and SSIS on 2005, 2008 and 2008 R2. No reason to upgrade anything which still worked was their motto. When space was a problem, one could always go build another server. I think the LAN administrator was happiest when he was able to justify building a new server as he could spend hours shopping for parts on the internet and building the latest server.

I was given the task of supporting all of the SSIS code, which of course broke periodically. There were 300 packages on the myriad of different servers all named package1 which were deployed every way possible. As a bonus one could not trust the open source software control package to have the latest code, unless it was one I worked on previously. The hunt for where the code and the config file used to drive it, and getting access to where it was stored, was just part of the maintenance process. If one package called another package, then both packages needed to be found and reviewed. Each of those packages would have different ways of connecting to the same database too. My favorite was when one package called another package and they each used different IDs to access the exact same database. Ah the joys of troubleshooting SSIS Packages. SSISProjectIt is was on the top five list of the reason I was very happy when I quit that job.

SSIS Projects

If you create write SSIS code and use a project deployment model, you can create one data connection for all the packages which need one. The code is deployed to one place, the Integration Services Catalog. All the related code is deployed to one folder. If you need to change a connection which all the packages use, you can do it in one place. You want to pass some parameters for all of the packages to use? No problem. SSIS project deployment offers some great advantages. I cannot think of any reasons to use a package deployment for SSIS 2012 and beyond. If you are writing SSIS code it is how you want to deploy packages.

 

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

Views Calling Views – A Bad Idea

Recently I talked to someone about a view calling another view, a practice which is known as nested views. TL;DR Do not write Nested Views. Give this person a little credit, he had heard it was a bad idea, but he had a really good reason. There is no such thing. There is also no reason to take my word for it. Here are a number of other very good posts which detail why nested views are bad from a performance perspective.

7 sins against TSQL Performance

Nested Views Causing Query Performance Woes

Poor Little Misunderstood Views

Proving Views Can Hurt Performance

T-SQL Anti-Patterns: Nested Views

I understand that on the internet, no matter how cockamamie your idea is, you can find someone or some post which will agree with you. That is why I trust sources who I know and trust rather than the random chatter of places like Stack Overflow.

Non-Performance Reasons Not to Use Nested Views

In addition to the performance reasons, there are other reasons not to use nested views. Supporting and maintaining nested views can be a nightmare. If there is an issue with the accuracy of the data, finding the problem is just that much harder. And what about when one needs to go investigate or fix a problem? Then someone has to test all of the places the view is called. I worked with some code recently where a view was created on a set of data just to create 3 new values based up 3 case statements. That view was nested many levels below where the field was actually called. I know people from a development background look at nested views as modular development. Don’t. TSQL is not the same as object oriented code.

Good Use Cases for Creating Views

Views are not always a bad idea. If you are creating an Analysis Services or Power BI project, it is a good idea to create views of the tables you are going to be using, even if there is no difference between the view and the table. Why? That way if the table is changed, your project will still work. Exposing a single view to users who are going to use self-service BI against it can also be a good idea. None of these are reasons to nest views, ensuring that the source data is anything other than a table.hammer_and_bolts

If all you have is a Hammer Everything Looks like a Nail

It takes a while to learn how to write good SQL and I have seen people rely on a view because it takes a while longer to write good SQL. Relying on a view though is like having one tool in the toolbox. If you don’t have a screwdriver, one may be tempted to use a Hammer instead. Too often nested views are written because sometimes writing SQL is hard and it is easier to just write a query and save it as a view rather than use a CTE or a derived table. Do not succumb to the temptation. There is a lot of code written where someone as in a hurry and they were planning on going back and fixing it someday. Someday appears with the same frequency as a lottery. Make it easier on the person who has to figure out or maintain the code left in your wake. Don’t write nested views. Your replacement will thank you for it.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

 

Creating New SSRS and SSIS Projects for SQL Server 2016

SSDTNow that SQL Server 2016 has been released, it is time to start creating new SSIS and SSRS projects for it. Since SQL Server 2014, SSIS has migrated to Visual Studio. The latest version, Visual Studio 2015, has a free Community edition, and can be found here. If you have it installed and try to create a new Reporting or Integration Services Project, you will notice that there are no templates listed which will allow you to create one of these projects.

Making SSIS and SSRS Projects for SQL Server 2016

To create SSIS projects in Visual Studio, you will need to click on this link to download the SQL Server Data Tools [SSDT] in the language of your choice. Visual Studio must not be running during the install. After about 5 minutes, when the install completes you will have a new application installed, SQL Server Data Tools 2015. You will still have the Visual Studio 2015 application as well, providing two methods for creating new packages. which means that you can click on this icon instead of opening up Visual studio. SSDT also contains the templates for database projects, so you can now start using Visual Studio.

Creating SSIS and SSRS Packages for Different Versions of SQL Server

Visual Studio SQL Server versioningIn this version of SQL Server Data Tools, Microsoft has finally addressed the common problem of needing to maintain multiple versions of SSIS packages for the different server versions. No longer do you need three different applications to maintain code for SQL Server 2012, 2014 and now 2016. All of these versions are supported with SSDT for Visual Studio 2015. SQL Server will detect which version the code was last saved in so that you don’t have to worry about accidently migrating code. You also have the ability to create an SSIS package in 2012, 2014 or 2016. To select the version you want, right click on the project and select Properties. Under Configuration Properties->General as shown in the picture, the TargetServerVersion, which defaults to SQL Server 2016, has a dropdown box making it possible to create a new package in Visual Studio 2015 for whatever version you need to support. Supporting the ability to write for different versions, is a great new feature and one which I am really happy is included in SSDT for Visual Studio 2015.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

Resolving Errors Running R code on SQL Server

computer_With_ErrorSQL Server 2016 contains the ability to not only to run R code from within SQL Server Management Studio, but to also use an R client to run code which executes on SQL Server, using SQL Server’s memory instead of the client. To make this work the following must be loaded on your PC.

  • Open source R tools
  • Microsoft R Open
  • R Client
  • R Studio or Visual Studio 2015 (Pick one, I’m using Visual Studio)

 

For those people who have read most of the documentation out there to set up R on your PC, you will notice this is a longer list. There is a difference between just running R and running R on SQL Server. Why? Because R Server is not Open Source R but an enhanced version of R containing features which are not found in the open source version, including the ability to run R code on the SQL Server from within the R UI, which is R Studio or Visual Studio 2015.

SQL Server needs R Client 8.0.3

I was working on SQL Server 2016 on two different environments so I got two different errors. Running SQL Server 2016 Enterprise Edition on a Server I got the error [Microsoft][ODBC Driver Manager] Function sequence error. On my laptop, I received this error.

RInteractiveError

If you look at the code from the interactive window, you will notice that the error occurred with trying to run rxSummary. In both cases I didn’t get the error when I changed the compute context to SQL Server from local, but when I tried to run a function which runs on the server. In both cases the R tools where installed prior to installing SQL Server 2016. The Open Source R tools install to C:\Program Files\R\R-3.3.0 (your version number may be higher). The Microsoft R Open installs to C:\Program Files\Microsoft\MRO\R-3.2.5. To use the libraries needed for the RevoScaleR libraries included in R Server, the version of Microsoft R required is Microsoft RRE, which is installed here C:\Program Files\Microsoft\MRO-for-RRE\8.0. Unfortunately, SQL Server 2016 shipped with version 8.0.3 not 8.0.0. If you are getting data and using a local compute context, you will have no problems. However, when you want to change your compute context to run on SQL Server, you will get an error.

While I received a different error on the server than my laptop, the reason for both messages was the same. Neither computer was running version 8.0.0.3 of the R client tools. On the server I was able to fix the error without downloading a thing. After installing a stand-alone version of R Server from the SQL Server Installation Center, the error went away and I got results when trying to run rxSummary. Unfortunately, it was not possible for me to run R Server on my laptop, as R Server is disabled from within the Installation Center. I believe that is because I have SQL Server 2016 developer edition on a laptop, not on a server. I needed to do something else to make it work.

Problems with Installing R Client Tools

On June 6th, Microsoft released R Client Tools. This will install version 8.0.3 on the client so it will be compatible with SQL Server. Here’s the link. This is where it got tricky. In order, to get the tools, you need to have an id for Visual Studio. No problem, I have two Visual Studio Accounts, a work one and a non-work one. I was already logged in to my work computer, so I just clicked the link, and got this screen.

ScreenPrintNoTools

No downloads for me?! What does that mean. Well, it means it is broken. I could not get the client tools, so I could not resolve my problem. I wondered if this issue was unique to me so I asked someone else that I work with who has a Visual Studio account to click on the link and try to install it from his Visual Studio account. That didn’t work either. I emailed Microsoft, and I got an answer on a Saturday morning, which frankly shocked me. They told me that the link was working for them. At that point I read the screen more carefully. “To continue Please join Visual Studio Dev Essentials…”. That sounded like it could be a permissions issue on my account. Fortunately, I have two accounts, a work one and a personal one. I logged out of my work account and logged into my personal account. This is the picture of what the same paged looked like while logged into the other account.

WorkingRClientTooks

I have contacted Microsoft about this error, and they are looking into it. What I thought was interesting is that this update is instead of being freely available, it is account dependent. If you don’t have an account or as in my case, the account isn’t working correctly, the ability to use R on SQL Server is unavailable. While I understand that SQL Server 2016 is a brand new release, it is supposed to be ready to use. Unless you have R Client Tools, which may or may not be able to download depending upon your Visual Studio account.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

T-SQL Tuesday #79 – Creating R Code to run on SQL Server 2016

TSQL2SDAY-300x300

As SQL Server 2016 was recently released, many people have not yet had used R with SQL Server. I thought that T-SQL Tuesday would be a great way to introduce this topic.  This post contains everything you need to run your first R program from the UI, get data from SQL Server and run the R code on SQL Server from the UI. If you are running open source R, this code will not work. If you are using Microsoft R Open, this code will not work. Only if you are running the version of R which Microsoft released with SQL Server 2016 will this code work.

The Two Versions of Microsoft R

Microsoft has not one version of R, they have two but two. These two different versions are needed because they have two different purposes in mind. Microsoft R Open, is open source and fully R compatible and is faster than open source R because they rewrote a number of the algorithms to include multi-threaded math libraries. If you want to run R code on SQL Server, this is the not the version you want to use. You want to use the non-open source version designed to run on R Server, which is included with SQL Server 2016, Microsoft RRE Open. This version will run R code not only in memory but swap to disk, to create code which can access SQL Server data without needing to create a file, and can run code on the server from the client. The version of RRE Open which is included in SQL Server 2016 is 8.0.3.

Running R on SQL Server

As a handy mnemonic device, all the RRE functions start with Rx, like prescription drugs. None of these features will work in R, unless you are using the Microsoft RRE Open version. For more information on how to set up Visual Studio 2015 to use the correct libraries, please read my previous post for instructions.

SQL Server R Code Walk-through

This code was created on a PC with SQL Server 2016 Developer Edition installed with the R tools, and the Community Edition of Visual Studio 2015. On my SQL Server instance, I have created a database called TestR and loaded the sample file AirlineDemoSmall.csv included with R server as a table with the same name. If you have SQL Server 2016 installed, the real directory for the sample files can be found here

C:\Program Files\Microsoft SQL Server\130\R_SERVER\library\RevoScaleR\SampleData

The table dbo.AirlineDemoSmall has 600,000 rows. Prior to running this code, create a table on SQL Server to hold the data. The code will load the table data and using some Rx commands, load the data from SQL Server, run the code on the R Server, and draw a histogram.

library(RevoScaleR)


sqlConnString <- "Driver=SQL Server;Server=MyLaptop\\SQLSERVER2016;Database=TestR;Uid=ReadData;Pwd=readd@t@"
sqlsampleTable <- "AirlineDemoSmall"
# Set ComputeContext.
sqlShareDir <- paste("C:\\Ginger\\AllShare\\", Sys.getenv("USERNAME"), sep = "")
sqlWait <- TRUE
sqlConsoleOutput <- FALSE
serverside <- RxInSqlServer(connectionString = sqlConnString, shareDir = sqlShareDir,
wait = sqlWait, consoleOutput = sqlConsoleOutput)


rxSetComputeContext(serverside)

sqlPlaneDS <- RxSqlServerData(connectionString = sqlConnString, verbose = 1, table = sqlsampleTable)
rxGetInfo(data = sqlPlaneDS, getVarInfo = TRUE, numRows = 3)
rxHistogram( ~ CRSDepTime, data = sqlPlaneDS)

Detailed Description of the R Code

To better understand each line of code, I provided the description for each line, along with some tips to resolve some possible erors.

library("RevoScaleR")

If you get an error running this line, chances are the R compiler doesn’t know where to find the library. Maybe you need to install it. If so run this command in the interactive window

install.packages('RevoScaleR')

If this command gives you an error, R can’t find where the library is. Resolve this issue by adding the path Run this command in the immediate window. Notice the slashes go the opposite way file explorer puts them

.libPaths(c(.libPaths(),"C:/Program Files/Microsoft SQL Server/130/R_SERVER/library"))

After setting the path, run the previous command to resolve the package, and then run the first line again, as this should resolve any previous errors.

sqlConnString <- "Driver=SQL Server; Server=MyLaptop\\SQLSERVER2016;Database=TestR;Uid=ReadData;Pwd=readd@t@"

This line sets the value of the connection string. I am running SQL Server 2016 on my laptop, in an instance called SQLServer2016. Notice I had to put two slashes going the wrong way to set my connection. I have hard coded a user id and password in plain text. For test, I would use a window authentication, which does require an ODBC connection so that I would not have to put the user id and password in code in plain text.

sqlsampleTable <- "AirlineDemoSmall"

This line of code sets a variable to the name of the table created in SQL Server with the data from the csv file.

sqlShareDir <- paste("C:\\Ginger\\AllShare\\", Sys.getenv("USERNAME"), sep = "")

R needs a temporary directory to serialize the R objects when the connection is created, which I am creating here.

sqlWait <- TRUE

Setting the state to wait means that I am creating a blocking transaction which will prevent the later code from being run until this statement is complete. This is a good setting for testing and if you other commands which cannot be run until you have data, such as rxHistogram which requires the dataset to wait.

sqlConsoleOutput <- FALSE

Setting the console output to false decreases the amount of informational messages I get in the immediate window. Since the messages aren’t really that helpful as they show things like how many records were read at the time, I generally set it to false.

serverside <- RxInSqlServer(connectionString = sqlConnString, shareDir = sqlShareDir,
wait = sqlWait, consoleOutput = sqlConsoleOutput)

This line uses the Revo R function RxInSqlServer (remember unlike SQL case is important) to create a connection to SQL Server, using the variables we created earlier to a variable called serverside.

rxSetComputeContext(serverside)

Setting the compute context dictates where my code is going to run. If the compute context is set to local, I am going to run on my local PC. Since I set it to the variable I set connecting my SQL Server connection, this means all of my R code will be using the available memory on the SQL Server PC, not mine. Yes,this does mean that I can starve out the resources on the server, a topic I will address at a later time. Since I am running everything on my laptop it doesn’t matter, but it could.

sqlPlaneDS <- RxSqlServerData(connectionString = sqlConnString, verbose = 1,
table = sqlsampleTable )

This line gets the data from SQL Server, using the connection string, and specifies what data to get. I could have used a query to get data as well, but in this case I grabbed everything from the table.

rxGetInfo(data = sqlPlaneDS, getVarInfo = TRUE, numRows = 3)

To validate that some data was retrieved, rxGetInfo shows the information retrieved from three rows. Why three rows? Because numRows = 3

rxHistogram( ~ CRSDepTime, data = sqlPlaneDS)

One of the big strengths of R is the ability to create data visualizations, so I felt compelled to include the command which creates a Histogram. HistogramThe ~ (tilde) is in front of the column name CRSDepTime from the table AirlineDemoSmall, and the data comes from the variable sqlPlaneDS where all of the data was loaded.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

Asking for Help

tree-climbingWhen I was a kid, I liked to climb trees. And there was a time or two when I climbed up pretty high, and then got too scared to come down. The way I came up looked more dangerous when I was trying to come down than it did going up. I panicked, said I could never come down and my sister went and got my mom, who talked me out of the tree. This blog is proof that I was wrong. With help, I came down. With clarity that often comes with youth, my sister later told me that I was being stupid. If I had just tried harder and not panicked, I could have come down by myself. While I didn’t appreciate her directness at the time, she was right. I could have helped myself, and probably should have, that time. But there are times also when I should have asked for help, but I didn’t feel comfortable asking so I wasted a lot of time trying to figure out things that a phone call would have cleared up in an instant. I like to think that I have gotten better at knowing when to ask and when to figure it out on my own. There is a wide body of knowledge available via search engines to answer a tone of questions. Also I am very fortunate to know people who, when I have asked for help literally have forgone sleep to help me out. These resources have been invaluable when I have been stuck in a virtual tree where I have a problem I don’t know how to solve.

The Lonely Leading Edge of Technology

Recently there have been a number of new releases of software. Whenever this happens, the number of answers to be found is sparse because people haven’t had a chance to accumulate a large body of knowledge. One reason the internet is such a great place to find answers is other people ask the same questions I have and have posted the questions and answers, either on forums or blog posts. I know I have written a few blog posts after finding the answers to questions I had. I am happy to share what I know, as a way of paying back for all of the help I have received. When software is released, chances are the answers are very difficult or nearly impossible to find. There are few people to ask and the internet comes back empty. This is a problem we all can fix, starting with me.

Call for Answers

Recently I have been working with some new features of SQL Server 2016 and have had questions which blogs, TechNet and Stack Overflow provided no answers on the internet. Fortunately, I have found people to help me resolve the answers. If you go searching for the same errors I had, you will find answers now, as I have posted them. If you have had a problem unique to the latest release of SQL Server, I hope you will take the time to post the question and the answer if you have it. I’m going to try to be better at answering forum questions, especially now I have learned a few interesting factoids. I am looking forward to the fact that next time when I go looking for an answer, thanks to all of us who have done the same, we can all help each other out. The next person who finds themselves in the same jam will thank you for talking them out of the tree.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur