Using Visual Studio to develop R for SQL Server 2016

As Microsoft released SQL Server 2016 on June 1, a lot of people are starting to investigate how to write R which will run in SQL Server rather than using their local machine. People who have a background in R will automatically migrate to R Studio, the open source UI that has been around for years, but there may be a reason to switch. Visual Studio 2015 Community is also an open source application which can be used to write R code, which is definitely worth investigating.

Which R tool should I use: R Studio or Visual Studio?

For those people who haven’t made the decision as far as which tool to use, let me offer two compelling reasons to pick Visual Studio [VS] instead of R Studio: Intellisense and Improved Debugging Tools. R studio does not have intellisense and it is not possible to debug your code by stepping through it in the manner that many developers of VS are already quite familiar. You will need to configure VS to use R tools, which are detailed below.

Configuring Visual Studio to Run R

Only Visual Studio 2015 can be configured to use R and you must be using a 64 bit operating system to load R tools. If you have a different version of VS, download it here. The next step is to download VS R Tools and lastly download Microsoft R Open. There are two versions of Microsoft R open, one for R Server 2016, which is the one you want if you plan to integrate R with SQL Server 2016, and the standard version of Microsoft R Open, which does not include any of the R Server features. If you like, you can use either version Microsoft R Open in R Studio as well. The standard version is only available for 64 bit platforms, but does include versions for Windows and various flavors of Linux, including Red Hat, SUSE, and Ubuntu. The R open for Microsoft R Server 2016 can be found here.***Update***On June 6, 2016, Microsoft released a new tool called R client. Installing the version of R found in the client 8.0.3 is required to match the version of R released with SQL Server 2016. It is required to log into Visual Studio to be able to access this R client link.

After the tools have been installed, they appear in VS under R Tools, as shown on my screen below. The VS environment looks no different, with the exception of the new menu item for R Tools. This really isn’t an IDE set up for writing R, yet. Time to fix that.

Visual Studio R Tools

Click on RTools->Data Science Settings and the screen goes from the standard VS screen shown above to anR configured VS environment tailored to writing  R code as it has the specific panes used when writing R, such as R interactive and R Plot.  If you want to move these screens around, or close the start page,  feel free to organize the windows in VS in the same manner as one does  when using VS for other development tasks and languages.

If you have multiple R versions loaded, or you just want to see how it works, go to RTools->Options and look at the R engine entry. This code be pointing to C:\Program Files\R\R-3.3.0 for the open source version of R, C:\Program Files\Microsoft\MRO\r-3.2.4 for the Microsoft Open R. For R with SQL Server 2016, after installing the R Client, the R engine needs to point to C:\Program Files\Microsoft SQL Server\130\R_SERVER, assuming you have the developer edition of SQL Server 2016 loaded on your PC. If you change this entry, you will need to restart VS.

 LocationforRToolsinSQLServer2016

After you click ok, it might be a good idea to check the intellisense settings for R. that can be done by going to Go to R Tools-> Editor Options-> Advanced.

Running R in SQL Server 2016

Now that I am using Microsoft’s Version of R, I can use the libraries which allow me to run on the server, which this R code allows me to do. My server name is called MyServer\SQLServer 2016. Notice that I need to put two slashes in my code to be able to connect to the server to be able to get to the SQLServer2016 instance.  To connect can use either a SQL login, or integrated Windows authentication. For this example I am using a SQL Server ID to access the data, and yes I do need to put the password in readable text within my code if I use that option. For Windows authentication, and ODBC account would be needed to connect. The user also needs SQL Server rights granted in order to run R code from within SQL Server. The command rxSetComputeContext(runonServer) changes the location the code will be run from my local machine to SQL Server 2016

library(RevoScaleR)

# Define the SQL connection string
connStr <- "Driver=SQL Server;Server=MYSERVER\\SQLSERVER2016;Database=Review;Uid=ReadData;Pwd=P@$$word"

# Set ComputeContext.
sqlShareDir <- paste("C:\\AllShare\\", Sys.getenv("USERNAME"), sep = "")
sqlWait <- TRUE
sqlConsoleOutput <- FALSE
runonServer <-  RxInSqlServer(connectionString = connStr, shareDir = sqlShareDir,
                    wait = sqlWait, consoleOutput = sqlConsoleOutput)
rxSetComputeContext(runonServer)

As this post hardly scratches the surface of running R code on SQL Server, I intend to cover more in greater detail in a later post. Please subscribe to my blog to be notified when my later post with more information on the specific coding techniques unique to running R in SQL Server 2016.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

Using R in SQL Server 2016

As SQL Server 2016 will be released soon and to follow up on the talk I gave at SQL Saturday Atlanta, this post will guide you through the steps needed to make R work. Like many features of SQL Server 2016, R is optionally installed.  To install R make sure that when you install R, the Option for R Services (In-Database) is checked, as shown below.

SQLServer2016_InstallR

Microsoft is incorporating the version of R sold by Revolution Analytics, which they called R Server. This version, while fully compatible with Open Source R, has some additional features which allow the R code to be run not only in memory, but use of a chunking technology to swap the data to disk so that the R code will not run out of memory. The commands to use this additional functionality all start with rx and are part of the proprietary ScaleR feature set. To use the R Server as a separate stand along product, instead of selecting the R Server in database option, select the R Server Standalone shared features. A R server could be useful if you want to perform large scale data analysis on a Hadoop Cluster, or other Non-SQL database like Teradata.

SQL Server Steps to Enable R

SQL Server 2016 is installed, especially if it was installed by others, you may be wondering if the R service really is installed. Take a look at the services running on the machine with SQL Server. If the SQL Server Launchpad service is running, as shown below, the R services are installed and running.  ServicesRunningR

The last thing needed to run R is to configure and restart the SQL Server Services. In a new query type the following command
sp_configure 'external scripts enabled', 1
GO
Reconfigure
GO

After restarting the SQL Server Service, SQL Server will now run R code. The following text can be run on any SQL Server 2016 instance to see if R has been configured correctly

EXEC sp_execute_external_script @language =N'R',
@script=N'OutputDataSet <-InputDataSet',
@input_data_1 =N'SELECT 1 as CheckToSeeIfRIsWorking'
WITH result sets (([CheckToSeeIfRIsWorking] int not null));
GO

The code is executed as an external script, specifying that the language used should be R. @script contains the R code, which is a simple command to take the mean of the data coming from the InputDataSet. @Input_Data_1 contains the location of the data to be processed. Of course the R code could of course be more complicated, but this code example is generic enough to test for everyone to ensure R is set up properly on SQL Server 2016.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

Time to Use Custom Visualizations in Power BI

After my previous blog post, and subsequent complaints to several people on the Power BI Product Team, I am really happy to be writing a blog encouraging everyone to use Custom Visuals. Power BI no longer has any horrible error messages for custom visuals which prevented me from using Custom Visuals in the Past. Look at the fish with no error messages!CustomVisualizations

Warnings For Custom Visualizations Go Away

Now since Power BI Custom Visualizations are not provided by Microsoft, they feel compelled to give you a warning message letting users know this. Here is the message box you get in Power BI Desktop when PowerBIWarningusing a custom visualization. Notice that I clicked on the check box next to the text Don’t show this dialog again. As Words mean things, checking this box means the warning message never appears again. When you import the visualization into Power BI, no warning messages. Now I can use and propose custom visualizations to clients because they really are neat, and now they contain no warnings. Thanks so much to the Power BI Product team for fixing this major issue.

SQLSaturdayAtlanta-521This visualization update makes me somewhat sad that I am not talking about Power BI at SQL Saturday Atlanta, but I am really excited to be talking about R and connecting with many people in the SQL Server Community. If you are in Atlanta, I hope to see you tomorrow.

 

 

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

 

3 Best Practice Steps for Implementing a Power BI Project

Power BI is a powerful tool with a rapidly evolving set of features. While it’s great to have more features, it is hard to keep up with the best ways to use them all, and some of the information out there regarding best practices have changed with the new releases. If you are planning on implementing a Power BI project where you work, following these steps will  minimize implementation and development issues while saving time. The steps are listed in order needed for creating a new Power BI Project, but if you have a Power BI implementation already, incorporating these steps will provide a more scalable, easy to maintain system over time.

Step 1 – Create a Group Workspace

1BlockOne person should never own a report. One person cannot be responsible for the reports 100% of the time as they may want to go on vacation or do something other than modify a report right now. The reports should belong to a group of people or accounts within the organization which will give the reports the appropriate status of organizational assets, not personal items. If the group owns the workspace, then the progress on the work any report can be monitored and if one person gets busy another person has the ability to help out.  When creating a workspace a group One Drive is also created. The pbix files used by the group should be stored in the group One Drive, which will allow others to edit the report over time. I have seen it take up to 30 minutes for the  workspace to be created, which means this step needs to be done in advance of when it is needed.

Using workspaces with Power BI does mean that you will need to have an Office 365 Exchange license for each Power BI user, because they create groups which are technically an Exchange feature, not a Power BI feature. This does not mean that you need to do anything with your current Exchange servers, you just need the license so that you will have the ability to use the group One Drive. If you do not have an exchange license, the group drive will not be accessible to all users. Updates will then consist of replacing the reports and datasets, which can adversely impact the dashboards relying upon them.

Step 2 – Select and Implement a Gateway

2BlockReports are based on data which needs to be updated to reflect what is current. Power BI offers two methods of updating the data the Personal Gateway and the Enterprise Gateway. If you select an Enterprise Gateway, you may not need to create a data model, as you may rely upon one already created in your organization, which is why this step must be completed at the beginning of the project.

If the data required for Power BI requires a data model is going to be created, the type of security used to access the data needs to be determined.  While both the Enterprise Gateway and the Personal Gateway can be used to handle data mashups, meaning for example you can combine an Excel file with some data from various SQL Server databases, the difference is in the security model. Enterprise Gateways need to be setup by someone who understands the underlying database security model as and passes the users credentials to the source, inheriting the existing individuals data security. A Personal Gateway will use the security of the person who creates it. A Personal Gateway is generally created based on a data model from a report.

Step 3 – Create a Content Pack

3BlockContent Packs can be used to ensure that the data used is coming from the gateway setup in the previous step, and any other reporting standards which you wish to implement are also included. Using the Power BI desktop, create a simple working report using the data refreshed in the gateway created in the previous step. This report may not be anything that you want to release to users, but a report which you can use to test the gateway created earlier to ensure that all of the data connection and refresh issues are resolved prior to spending a lot of time working on a series of complex reports. Load this file to the Power BI Service and use it to create a content pack. By providing a working report model for everyone in the workspace can use as a guide for how to start a report, many issues regarding how to get access to data and multiple sources to the same data can be eliminated. If you have multiple gateways, each should have a content pack. Demonstrate how to use the content pack to the people in the workgroup, and then everyone can create reports where the data will be refreshed.

Follow these Steps, Eliminate Problems

These best practice steps can eliminate many problems which arise when people start working on Power BI visualizations. Going back to my previous post, following these steps addresses the important components of Security and Business Continuity first, before any reports are even created. Addressing these components first can prevent many issues which tend to happen when different models are employed.

Yours Always,

Ginger Grant

Data aficionado et SQL Raconteur

The Five Key Components for a Successful Data Analysis Project

As a consultant, I have been a part of some very successful implementations of Data Analysis Projects. I have also been brought in to fix implementations which didn’t go so well. Being on a number of different projects, certain common components emerge. When these things are addressed early in the project, things go well. These items really cannot be put off to the end or ignored completely. These items are not related to the software, used in data analysis, as no matter what the tool selected, the solution will be incomplete if the following five areas are not addressed.

5 Components Every Project Should Include

Security

Business Continuity

Reliability

Distribution

Management Direction

Each of these items are important components of a successful Data Analytics Management practice. Problems with any of them can move a project from successful to a failed project.

Security

Security is an obvious consideration which needs to be addressed up front. Data is a very valuable commodity and

Data is the New Gold

Data is the New Gold

only people with appropriate access should be allowed to see it. What steps are going to be employed to ensure that happens? How much administration is going to be required to implement it? These questions need to be answered up front.

Business Continuity

Business Continuity means the solution cannot be on the shoulders of one person, as that can be a risky situation. One person needs a break to go on vacation or not work, and needs a backup who is skilled and able to understand the system and run it alone. This can be a really big problem, especially for smaller organizations, who have only relied on one person. I have been brought in to assist companies who until very recently thought that they had a very successful Data Analytics Platform. The problem with it was that there was only one person who had the skill to maintain it, and that person quit.

Business Continuity can be a specific problem for Power BI users, as often times one user owns a report. Reports for an organization should never be owned by one person. All companies using Power BI should have the reports in a series of group workspaces, not belonging to any single person. Otherwise, if the person writing the report quits and their account is deleted, the reports are not then deleted as well.

Reliability

Reliability is critical, because who cares about a data analysis project if no one believes the data is correct? In addition to accuracy, the system used needs to be reliable, containing data updates on a scheduled basis. How and when is the data going to be updated? How is that schedule communicated? The answers to these questions need to be addressed at the beginning of the project. Regulation here may be key to stability as the lack of it could result in a full-fledged data crisis. Given the circumstance that there is a lack of personnel to fill in, monitor, and regulate data, opting for external office 365 services (if MS Excel or other Microsoft applications are used) from reputed parties could ensure appropriate data management.

I remember working for one client who had over a 100 million dollar loss in a month on a visualization we created. I asked if the data was correct as that was a huge one month loss. I was assured that the data was not correct, but no one knew how to resolve the data issue. The reporting tool, whatever it happens to be, is not the place where data is fixed, it should reflect the contents source data. Where this rule is not followed, the reports are ignored as the data is suspect as no one knows why it should be believed as doesn’t match the source system. How is the source system data going to be fixed? This is often times a management issue as people need to be appropriately incentivize to fix things.

Management Direction

All data analysis needs Management Direction to set priorities. As there are only so many hours in a day, the important items need to be identified so that they can be addressed. What is important? Everything cannot be a number one priority as that means nothing is. In many data analytics projects, someone wants a dashboard. Great idea. What numbers show whether or not the company is currently successful? In most companies where I am helping them create data analysis project, the answer to what are the Key Performance Indicators [KPIs] is; no one has made up their mind yet. Management needs to provide the direction for the KPIs.

Distribution

How are people going to get their hands on whatever it is that you just created? What are people looking for? Reports in their email? Visualizations on their phones? What do people want? Only if you ask the question do you know if you are providing the data in a way people want to consume them. In order to pick the most appropriate tool or design visualizations people are actually going to use, these questions need to be asked up front. Recently I worked for a client who had selected Tableau as there reporting solution, but they were getting rid of it. Why? The users wanted to do adhoc analysis in Excel, so they were using Tableau not to visualize their data or do ad-hoc analysis, but to select data for Excel Pivot Tables. A lot of money and time would have been saved if the question of how the users wanted to use the data was asked up front.

Hopefully all of your data analysis project include these components. In today’s environment, data is the new gold. This valuable commodity needs a system which is Reliable, Secure, important to Management, which can be distributed to continually provide value to the organization.

Yours Always,

Ginger Grant

Data aficionado et SQL Raconteur

SSIS: What to do when AcquireConnection Fails

The deadline required X number of SSIS packages to be done in each day, a deadline which was tight but doable, as long as there were no technical surprises.  The packages were written in SSDT for SQL Server 2012, and most of them were SSISFailconnecting to other SQL Server 2012 databases with no problem. The next package and contained an Ole DB connection to a SQL Server 2008R2 database with a SQL Server defined ID and password.  I made the connection, which tested successfully, opened up the data task and was able to preview the data. Everything looked fine. I right clicked on the package to execute it, and it went along fine until it reached the data task to get data from SQL Server 2008R2. Surprise! The dreaded red X appeared. Failure.

AcquireConnection method call to the connection manager failed with error code 0xC0202009/0xC020801C

Try as I might, the only thing I was able to do after an hour was periodically change the error code from 0xC0202009 to 0xC020801C. Nothing I did worked. I created a new connection, created a new OleDB Source, changed the Run64BitRuntime to False in Configuration Properties in the Debugging Section of the project execution, set the Data Flow task DelayValidation from False to True. None of these various suggestions that I got from various websites worked at all.  I thought about changing the SSIS Service ID’s execution properties, but since I was running in Debug mode I determined that this would not make any difference, so I abandoned that idea. Nothing worked. The only thing I was able to do was change the error code, not eliminate it. I could log into SQL Server with the same ID and password in my package and run the simple query in the data flow task and return data.  I could preview the data, what I couldn’t do is execute the SSIS package.  Out of desperation I rebooted, which also did nothing.

EncryptSensitiveWithUserKey fixes error code 0xC0202009/0xC020801C

While I was thinking about what was left to try, I was wondering why I was getting a failure to connect message and was thinking about passwords. Because the project and of course the package were set to DontSaveSensitive, every time I SSISGreenCheckopened the connection, I had to enter the password again. News flash. DontSaveSensitive means what it says. It doesn’t save the password, so when you run the package, you have no password and therefore cannot acquire a connection.  The fix was to change the package to EncryptSensitiveWithUserKey. Once I did that, the magical green check appeared.

Hopefully if you ever get this message you will find this post quickly and not waste the time I did figuring this out.

Yours Always,

Ginger Grant

Data aficionado et SQL Raconteur

 

Making Linear Regression Understandable for Machine Learning

When starting with Azure Machine Learning, it is sort of hard to wrap one’s brain around what kind of insight that Machine Learning can provide. When doing data analysis, often times we are looking for patterns. Does the volume of data really go up at the end of the month or is just the additional processes that make it seem that way? Does anyone really know if sales really pick up in August or is that just legerdemain from the sales department? Linear Regression can help determine that.

Relationships between Different Items

There are two types of indicators for linear correlation, positive and negative as shown on the following charts. The Y axis represents Grades, and the x axis is changed to show positive and negative correlationlinear-Grades of the amount of X on grades. When X is the amount of studylinear-CatVideos hours, there is a positive correlation and the line goes up. When X is changed to watching cat videos, there is a negative correlation. If you can’t draw a line around the points there is no correlation. If I were to create a graph where X indicated the quantity of the bags of Cheese Doodles consumed on grades, it would not be possible to draw a straight linear-CheeseDoodlesline, where the data points cluster around it. Since this is Line-ar regression, if that line doesn’t exist there is no correlation. Knowing there is no correlation is also useful.

 

Calculating Linear Regression

While the variable relationship is really easy to see without Math, there is an underlying formula that describes Linear Regression, and lest all of the math majors get upset I thought I would include the formula

Yi = a0 + b0Xi + ei

Y –  is the value of the Y axis, which in our example is grades

a – Is the point where the line intersects Y, or more clearly stated, where the line is. Now ideally your data should intersect at those points but since the line is sort of a guide, this won’t exactly match.

b – Contains the slope of the line

X – Is the value of the X axis, which depending on the example you are looking at is

E – This contains the error

Machine Learning with Linear Regression

In the blog examples, there are only two values, grades and something else. Machine learning can take all of your input variables and determine which values, if any impact the result. Hopefully this information provides you with a good use case for machine learning. In case you were unaware, Azure ML is availablefor free. All you need to do is sign up for an account at https://studio.azureml.net . There are a few size limitations as far as how much data you can load, but you can load enough to determine if machine learning will work in your environment.

 

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

Data Platform MVP

I am very excited to be able to announce that Microsoft has made me a Data Platform MVP. This is a big thrill. The right words escape me, I will have to make do with these.

If I Only had a Brain

I love this song from the Wizard of Oz.  Unfortunately, the scarecrow never gets a brain, instead he gets a honorary degree.  I wish having an MVP award would make me smarter, but unfortunately, it does not do that.  Frankly it means I am in very intimidating mental company as when I wrote this, there were only 370 Data Platform MVPs. Most likely I need to learn a lot more and maybe write a book so I can keep up.

mvp_horizontal_fullcolor

One thing I do try to do is share what I know by blogging and speaking, if for no other reason than I don’t want to be a hypocrite.  When I was learning SSIS, the person leading the project was tuning SSIS and he would not show me how.  He obfuscated, and made SSIS tuning out to be wizardry. I thought to myself at the time, that he should tell me what he knows as I would do that.  Later I found out the rules, and gave a few talks about SSIS, including one for the PASS Data Warehousing and Business Intelligence Virtual Chapter which was recorded here.  If I learn something, I want to tell other people, which is why I blog and speak.  I think this is the greatest profession in the world and I feel bad for people who have chosen to do something else as the data platform stuff and they are missing out.

Keeping Up

There are a ton of new technology things to learn coming up all the time. I keep up as much as I can and when I do learn something, I tend to blog or speak about it.  If you subscribe to this blog or follow me on twitter, hopefully keeping up will be easier.  I don’t want Microsoft to think that they made a mistake, so I plan on trying to increase the number of blog posts and speak when I am afforded the chance.

SQL Saturday Phoenix

I wanted to make sure to talk about the next place I will be speaking, SQL Saturday Phoenix, the largest data related sqlsat492_Phoenixtechnology event in the state of Arizona.  I know it is going to be a great event thanks to Joe Barth and the rest of us on the organizing committee who have volunteered to make this a great event.  The Arizona SQL Server Users Group was where I learned about the SQL Server Community and was where I started to really get motivated to start learning and I am happy to be a part of it. I hope to see you there.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

SQL Server 2016 Polybase Setup

After reading my last post you should be all ready to install SQL Server 2016 to be able to use Hadoop.  I went through all of these steps in my webinar, but I thought it might also be a good idea to include them here so you don’t have to watch the video.

Polybase Install Feature

When installing SQL Server in the Feature Selection List shown below,  PolyBase Query Service for External Data must be selected. SQL Server 2016 Polybase

To check to see if Polybase has been successfully installed, go to Control Panel->Administrative Tools->Services.  There are two services added for polybase, SQL Server PolyBase Data Movement and SQL Server Polybase Engine, as well as  a bunch of other new ones for SQL Server 2016. The polybase services and SQL Server will need to be restarted in a later step.  When starting SQL Server Management Studio for SQL Server 2016, it is hard to spot the differences between it and previous versions of SQL Server as even the icon looks exactly the same. Once you create a database for SQL Server, you will notice a slight difference under the tables folder. When polybase is installed there are two new folders, External Tables and External Resources. Data which is accessed via polybase will be stored under the External Files folder and the External Resources will contain the references to the external resources, the HDFS cluster where the data is stored, as well as the file formats of the underlying data.

Configuring Polybase on SQL Server

SQL Server needs to be configured to use polybase by using the sp_configure command

EXEC sp_configure ‚'hadoop connectivity', 5;
GO
RECONFIGURE; 

The number 5 indicates the kind of Hadoop connectivity desired.  If you were using HDInsight’s Azure Blob storage or Hortonworks on Windows, the number listed would be 4.  The number 5 indicates connectivity to Hortonworks Data Platform on Linux.  Go ahead and exit out of SQL Server as you will be restarting it soon.

Configuration File Modification

Open up your favorite text editor to modify the Hadoop.config file.  The default location is

C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\MSSQL\Binn\Polybase\Settings\Hadoop.config

Keep in mind, if you installed SQL Server with a named instance the path will include the name of that instance.  The config file contains a default password for Hadoop.  The password, which is initially set to pdw_user is probably a holdover from pre-2016 days when polybase was only available on Microsoft’s Big Data Appliance, the Analytics Platform System [APS] which was previously called Parallel Data Warehouse [PDW].  Hortonworks’ default password is hue, so you will want to modify the file so that the HadoopUserName has the right password, hue. I circled it below in a clip of the Hortonworks.config file.

Once the changes to the Hadoop.config file are saved, to get polybase to work, the last thing which must be done is to restart the two new polybase services and SQL Server 2016.

Setting up the External Data Source and External Data Files

At this point, we can now tell SQL Server where the Hadoop files are loaded. Open SQL Server Management Studio [SSMS] and create a new query to create the new an external data source, which is going to appear in the new folder, External Data Source.

CREATE EXTERNAL DATA SOURCE HDP2 WITH
(
TYPE = HADOOP,
LOCATION = 'hdfs://sandbox.hortonworks.com:8020'
)

After running this command and refreshing, the source HDP2 will appear in the folder External Data Source ->Data Sources

SQL Server needs to know what the underlying data file formats of the data stored in Hadoop. Assuming the data is stored in a tab delimited format, this command will tell SQL Server how to read the tab delimited data and the format of the dates.  This will allow polybase to read a sample file which comes with Hortonworks, the file SAMPLE_07

CREATE EXTERNAL FILE FORMAT TSV
WITH (
FORMAT_TYPE = DELIMITEDTEXT,
FORMAT_OPTIONS (
FIELD_TERMINATOR = '\t',
DATE_FORMAT = 'MM/dd/yyyy'
)
)

SQL Server needs a schema to read the data, as it doesn’t know where the fields are without it. The following code creates a table which will appear under the External Tables folder in SSMS as well as load the data by telling it where the data lives and to use the file format which we just added.

CREATE EXTERNAL TABLE Sample_07
(
code nvarchar(255),
description nvarchar(255),
total_emp int,
salary nvarchar(255)
)
WITH
(
LOCATION = '/apps/hive/warehouse/sample_07',
DATA_SOURCE = HDP2,
FILE_FORMAT = TSV,
REJECT_TYPE = value,
REJECT_VALUE=0
)

After this step is complete, you can use the new table to join to data on SQL Server from inside a HDFS cluster.

I hope this helps you get started with using polybase with SQL Server 2016.  Let me know what you think by posting a comment.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

DIY – Getting started with Hadoop and Virtual Machines

In my recent presentation on Polybase in SQL Server 2016, I demonstrated how to get started using Hadoop with SQL Server 2016. In this post I will show you everything you can do to run the demonstration yourself. The first step is to create a virtual machine so that you can run a Linux instance for Hadoop. As I know that installing a virtual machine can be intimidating, this post explains what you need to do, and how to fix a problem you may run into when running a virtual machine.

Creating Your Own Virtual Machine

Previously, spinning up a virtual machine meant purchasing software. No more, as there is now an open source application. In the example shown here, the Linux operating system will be installed, you can put any operating system you want on your virtual machine, provided of course you have a license for it. If you don’t feel comfortable installing non-released versions of code like SQL Server 2016, on your pc, a virtual VirtualBoxmachine is a great way to test it out. You will need to provide your own operating system, but there are trial versions you can use for limited periods of time as well. The open source virtual machine Oracle VM Virtual Box is the only open source version of a virtual machine software. You can download it here. This software is needed prior to installing the Hortonworks Sandbox. Obviously Hortonworks is not the only version of Hadoop available, Cloudera has a Hadoop VM too, which you can download as well. Personally I am not a use fan of the Cloudera Manager, which is why I prefer Hortonworks, but either will work with polybase.

Troubleshooting Why the Virtual Machine Won’t Run

If you are using a Windows operating system, chances are it may be running Hyper-V. Hyper-V is Microsoft’s Virtual Machine. For you to be able to use Hyper-V, you will need to have a virtual machine file which is saved in the VHD format so you can load the image in Hyper-V. Hortonworks provides two versions of their sandbox, one in the VM Ware version and one in the Virtual Box format. For this reason, Hyper-V is not an option as a virtual machine for Hortonworks. If you are running Hyper-V, you can’t run another VM.

The boot configuration data store could not be opened. Access is denied

The error message received when Hyper-V is running is pretty cryptic. What does “The boot configuration data store could not be opened. Access is denied” supposed to mean anyway? When wandering around the internets trying to find an answer this question, there is a lot of <sarcasm>wonderful advice</sarcasm> which states that the only thing to do is to replace your computer entirely as the bios won’t support Virtual Machines, even after you check the bios and find out that yes, your bios does support virtualization. Don’t be deterred. The error can be fixed without new hardware. To resolve this error,  Hyper-V needs to be turned completely off. In case you were wondering, stopping the Hyper-V services won’t fix this. Instead Hyper-V must be disabled at the command prompt, then a reboot is required. Make sure you run the command prompt as administrator, because the command won’t run if you don’t. To run the command prompt as administrator, search for CMD. When the result comes back, right click on the cmd selection and select run as administrator. To turn off Hyper-V, run this command

bcdedit /set hypervisorlaunchtype off

Exit out of the command prompt. Remember the error won’t be resolved until after rebooting. After that the reboot, the boot configuration message won’t appear when running the VM, and you can successfully get started running the Hortonworks Sandbox.

TL;DR – Links

If you just want to get started running a VM and polybase and here are the links needed to make that happen.

Oracle Virtual Box

Hortonworks Sandbox

Cloudera Quickstart

SQL Server 2016 RC1

Microsoft Server 2016 Preview

Once the VM running is running with Hadoop, install SQL Server 2016 so that you can follow along on my next post where I talk about how to use polybase.  If you want to be notified when that post will be available, please subscribe to my blog and you will find out automatically.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur