SQL Server 2016 Polybase Setup

After reading my last post you should be all ready to install SQL Server 2016 to be able to use Hadoop.  I went through all of these steps in my webinar, but I thought it might also be a good idea to include them here so you don’t have to watch the video.

Polybase Install Feature

When installing SQL Server in the Feature Selection List shown below,  PolyBase Query Service for External Data must be selected. SQL Server 2016 Polybase

To check to see if Polybase has been successfully installed, go to Control Panel->Administrative Tools->Services.  There are two services added for polybase, SQL Server PolyBase Data Movement and SQL Server Polybase Engine, as well as  a bunch of other new ones for SQL Server 2016. The polybase services and SQL Server will need to be restarted in a later step.  When starting SQL Server Management Studio for SQL Server 2016, it is hard to spot the differences between it and previous versions of SQL Server as even the icon looks exactly the same. Once you create a database for SQL Server, you will notice a slight difference under the tables folder. When polybase is installed there are two new folders, External Tables and External Resources. Data which is accessed via polybase will be stored under the External Files folder and the External Resources will contain the references to the external resources, the HDFS cluster where the data is stored, as well as the file formats of the underlying data.

Configuring Polybase on SQL Server

SQL Server needs to be configured to use polybase by using the sp_configure command

EXEC sp_configure ‚'hadoop connectivity', 5;
GO
RECONFIGURE; 

The number 5 indicates the kind of Hadoop connectivity desired.  If you were using HDInsight’s Azure Blob storage or Hortonworks on Windows, the number listed would be 4.  The number 5 indicates connectivity to Hortonworks Data Platform on Linux.  Go ahead and exit out of SQL Server as you will be restarting it soon.

Configuration File Modification

Open up your favorite text editor to modify the Hadoop.config file.  The default location is

C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\MSSQL\Binn\Polybase\Settings\Hadoop.config

Keep in mind, if you installed SQL Server with a named instance the path will include the name of that instance.  The config file contains a default password for Hadoop.  The password, which is initially set to pdw_user is probably a holdover from pre-2016 days when polybase was only available on Microsoft’s Big Data Appliance, the Analytics Platform System [APS] which was previously called Parallel Data Warehouse [PDW].  Hortonworks’ default password is hue, so you will want to modify the file so that the HadoopUserName has the right password, hue. I circled it below in a clip of the Hortonworks.config file.

Once the changes to the Hadoop.config file are saved, to get polybase to work, the last thing which must be done is to restart the two new polybase services and SQL Server 2016.

Setting up the External Data Source and External Data Files

At this point, we can now tell SQL Server where the Hadoop files are loaded. Open SQL Server Management Studio [SSMS] and create a new query to create the new an external data source, which is going to appear in the new folder, External Data Source.

CREATE EXTERNAL DATA SOURCE HDP2 WITH
(
TYPE = HADOOP,
LOCATION = 'hdfs://sandbox.hortonworks.com:8020'
)

After running this command and refreshing, the source HDP2 will appear in the folder External Data Source ->Data Sources

SQL Server needs to know what the underlying data file formats of the data stored in Hadoop. Assuming the data is stored in a tab delimited format, this command will tell SQL Server how to read the tab delimited data and the format of the dates.  This will allow polybase to read a sample file which comes with Hortonworks, the file SAMPLE_07

CREATE EXTERNAL FILE FORMAT TSV
WITH (
FORMAT_TYPE = DELIMITEDTEXT,
FORMAT_OPTIONS (
FIELD_TERMINATOR = '\t',
DATE_FORMAT = 'MM/dd/yyyy'
)
)

SQL Server needs a schema to read the data, as it doesn’t know where the fields are without it. The following code creates a table which will appear under the External Tables folder in SSMS as well as load the data by telling it where the data lives and to use the file format which we just added.

CREATE EXTERNAL TABLE Sample_07
(
code nvarchar(255),
description nvarchar(255),
total_emp int,
salary nvarchar(255)
)
WITH
(
LOCATION = '/apps/hive/warehouse/sample_07',
DATA_SOURCE = HDP2,
FILE_FORMAT = TSV,
REJECT_TYPE = value,
REJECT_VALUE=0
)

After this step is complete, you can use the new table to join to data on SQL Server from inside a HDFS cluster.

I hope this helps you get started with using polybase with SQL Server 2016.  Let me know what you think by posting a comment.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

DIY – Getting started with Hadoop and Virtual Machines

In my recent presentation on Polybase in SQL Server 2016, I demonstrated how to get started using Hadoop with SQL Server 2016. In this post I will show you everything you can do to run the demonstration yourself. The first step is to create a virtual machine so that you can run a Linux instance for Hadoop. As I know that installing a virtual machine can be intimidating, this post explains what you need to do, and how to fix a problem you may run into when running a virtual machine.

Creating Your Own Virtual Machine

Previously, spinning up a virtual machine meant purchasing software. No more, as there is now an open source application. In the example shown here, the Linux operating system will be installed, you can put any operating system you want on your virtual machine, provided of course you have a license for it. If you don’t feel comfortable installing non-released versions of code like SQL Server 2016, on your pc, a virtual VirtualBoxmachine is a great way to test it out. You will need to provide your own operating system, but there are trial versions you can use for limited periods of time as well. The open source virtual machine Oracle VM Virtual Box is the only open source version of a virtual machine software. You can download it here. This software is needed prior to installing the Hortonworks Sandbox. Obviously Hortonworks is not the only version of Hadoop available, Cloudera has a Hadoop VM too, which you can download as well. Personally I am not a use fan of the Cloudera Manager, which is why I prefer Hortonworks, but either will work with polybase.

Troubleshooting Why the Virtual Machine Won’t Run

If you are using a Windows operating system, chances are it may be running Hyper-V. Hyper-V is Microsoft’s Virtual Machine. For you to be able to use Hyper-V, you will need to have a virtual machine file which is saved in the VHD format so you can load the image in Hyper-V. Hortonworks provides two versions of their sandbox, one in the VM Ware version and one in the Virtual Box format. For this reason, Hyper-V is not an option as a virtual machine for Hortonworks. If you are running Hyper-V, you can’t run another VM.

The boot configuration data store could not be opened. Access is denied

The error message received when Hyper-V is running is pretty cryptic. What does “The boot configuration data store could not be opened. Access is denied” supposed to mean anyway? When wandering around the internets trying to find an answer this question, there is a lot of <sarcasm>wonderful advice</sarcasm> which states that the only thing to do is to replace your computer entirely as the bios won’t support Virtual Machines, even after you check the bios and find out that yes, your bios does support virtualization. Don’t be deterred. The error can be fixed without new hardware. To resolve this error,  Hyper-V needs to be turned completely off. In case you were wondering, stopping the Hyper-V services won’t fix this. Instead Hyper-V must be disabled at the command prompt, then a reboot is required. Make sure you run the command prompt as administrator, because the command won’t run if you don’t. To run the command prompt as administrator, search for CMD. When the result comes back, right click on the cmd selection and select run as administrator. To turn off Hyper-V, run this command

bcdedit /set hypervisorlaunchtype off

Exit out of the command prompt. Remember the error won’t be resolved until after rebooting. After that the reboot, the boot configuration message won’t appear when running the VM, and you can successfully get started running the Hortonworks Sandbox.

TL;DR – Links

If you just want to get started running a VM and polybase and here are the links needed to make that happen.

Oracle Virtual Box

Hortonworks Sandbox

Cloudera Quickstart

SQL Server 2016 RC1

Microsoft Server 2016 Preview

Once the VM running is running with Hadoop, install SQL Server 2016 so that you can follow along on my next post where I talk about how to use polybase.  If you want to be notified when that post will be available, please subscribe to my blog and you will find out automatically.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

SQL Server 2016 and PolyBase

The next release of SQL Server, SQL Server 2016 is continuing with a convention which was employed in previous releases, which I call the Cadillac release system.  At General Motors, in the past new features were first offered on their most luxurious brand, Cadillac, and if these features prove successful, they are rolled out to Buick and the rest of the product lines.  Microsoft does the same thing.  Their ‘Cadillac’  is the PDW [Parallel Data Warehouse], Microsoft’s Data Appliance. One notable example of this release model was the addition of column store indexes to SQL Server. Column store indexes were first available on the PDW, or APS as is was known then, and Microsoft later added column store indexes to SQL Server 2012. Later that same year, at SQL PASS Summit 2012, I heard about a really neat feature available in the PDW, PolyBase. The recording I heard is available here, where Dr. David DeWitt of Microsoft explained PolyBase in great detail. I have been waiting to hear that PolyBase was going to be released to SQL Server ever since.  On May the Fourth, 2015, Microsoft announced the preview release of SQL Server 2016. Listed in the release announcement was the feature I’d been waiting for, PolyBase.

Sqoop Limitations

PolyBase provides the ability to integrate a Hadoop cluster with SQL Server, which will allow you to query the data in a Hadoop Cluster from SQL Server. While the Apache environment provided the Sqoop HadoopSqoopapplication to integrate Hadoop with other relational databases, it wasn’t really enough. With Sqoop, the data is actually moved from the Hadoop cluster into SQL Server, or the relational database of your choice. This is problematic because you needed to know before you ran Sqoop that you had enough room within your database to hold all the data. I remembered this the hard way when I ran out of space playing with Sqoop and SQL Server. From a performance perspective, this kind of data transfer is also, shall we say, far from optimal. Another way to look at Sqoop is that it provides the Hadoop answer to SSIS. After all Sqoop is performing a data move, just like SSIS code. The caveat is SSIS is generally faster than Sqoop, and provides a greater feature set too.

Polybase – Hadoop Integration with SQL Server

Unlike Sqoop, PolyBase does not load data into SQL Server. Instead it provides SQL Server with the ability to query Hadoop while leaving the data in the HDFS clusters. Since Hadoop is schema-on-read, within SQL server you generate the schema to apply to your data stored in Hadoop. After the table schema is known, PolyBase provides the ability to then query data outside of SQL Server from within SQL Server. Using PolyBase it is possible to integrate data from two completely different file systems, providing freedom to store the data in either place. No longer will people start automatically equating retrieving data in Hadoop with MapReduce. With PolyBase all of the SQL knowledge accumulated by millions of people becomes a useful tool which provides the ability to retrieve valuable information from Hadoop with SQL. This is a very exciting development which I think will encourage more Hadoop adoption and better yet, integration with existing data. I am really looking forward SQL Server 2016.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

Azure Data Lake: Why you might want one

On April 29, 2015 Microsoft announced they were offering a new product Azure Data Lake. For those of us who know what a data lake is, one might have thought that having a new data lake product was, perhaps redundant, because Microsoft already supported data lakes with HDInsight and Hadoop. To understand why you might want a separate product, let’s look at what a data lake is.  I think the best definition of a data lake that I read recently was here. Here’s the TL;DR version “A ‘data lake’ is a storage repository, usually in Hadoop, that holds a vast amount of raw data in its native format until it is needed.” Ok so here’s the question, one  can spin up an HDInsight Hadoop cluster on Azure and put all of your data there, which means you can already create a data lake. Since you can already create a data lake, why did Microsoft go and create a new product?

Hardware Optimization and the Data Lake

If you look at Microsoft’s most recent Azure release, you’ll see they are releasing products designed to operate together. Service Bus, Event Hubs, Streaming Analytics, Machine Learning and Data Factory are designed to process lots of data, especially a lot of short pieces of data, like Vehicle GPS messages, or other types of real time status messages. In reading the product release for Azure Data Lake, they highlight it’s ability to store and more importantly retrieve this kind of data.  DataFactory The difference between the HDInsight already on Azure and the Data Lake product is the hardware dedicated to make the storage and the integration designed to improve access to the data. Data Factory is designed to move your data in the cloud to anywhere, including a data lake. If you look at the graphic Microsoft provides to illustrate what Data Factory is designed to integrate, the rest of the outputs listed have products associated with them. Now there is a product associated with the data lake too. Data lakes are designed to store all data, but unlike a database operational data store, data lakes are designed to have the database schema applied when the data is read, not when the data is written. This allows for faster writing of the data, but it does tend to make accessing the data slower. The Azure Data Lake hardware, according to the release, is designed to address this issue by providing computing power designed for massively parallel processing to provide the data when needed, which would be on the reading and analysis of the data, not when it is written. This sort of targeted computing power differs from the HDInsight Hadoop offering, which is uses a standard hardware model for storage and access. By tailoring the hardware to meet the needs of the specific type of data stored, in theory this will greatly improve performance, which will increase the adoption of not only the Azure Data Lake, but the tools to analyze and collect the data too. It’s going to be interesting to see how the marketplace responds as this could really push massive amounts of data to the Azure cloud. Time will tell.

 

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

 

 

 

Introduction to Hadoop Presentation Follow-up

Thank you so much for everyone who was able to attend my webinar http://pragmaticworks.com/Training/FreeTraining/ViewWebinar/WebinarID/676 . (If you weren’t able to attend, you can always click on the link for a recording)

It’s always hard to talk about Hadoop as the subject is so broad that there were a lot of things that I had to leave out, so it is fortunate that I have this blog to discuss the topics I wasn’t able to cover. I thought that I would take this time to respond to the questions I received.

Presentation Q & A

Do you need to Learn Java in order to develop with Hadoop?

No. If you wish to develop Hadoop in the cloud with HD Insight, you have the option of developing with .net. If you are working in the Linux environments, which is where a lot of Hadoop is being developed, you will need to learn Java.

Do you know of any courses or sessions available where you can learn about Big Data or Hadoop?

My friend Josh Luedeman is going to be teaching an online class on Big Data next year.  If you don’t want to wait that long I recommend checking out a code camp in your area, such as Desert Code Camp where they are offering courses in Azure,  or SQL Saturday, especially the BI editions

How do you recommend a person with a BI background in SQL get started in learning Hadoop and where can I get the VMs?

The two ways I recommend for a person with a BI background to get involved with Hadoop is either through a Hortonworks VM or in the Microsoft’s Azure cloud with HD Insight.  Hortonworks provides a VM and Microsoft’s environment is hosted on their cloud. As the company that Microsoft partnered with to develop their Hadoop offerings, Hortonworks has very good documentation targeted to people who have more of a Microsoft BI stack background.  If you chose to go with HD Insight, there is a lot of really good documentation and video training available as well.

How do you compare Hadoop with the PDW?

While both Hadoop and Microsoft’s PDW, which they now call APS, were both designed to handle big data, but the approaches are wildly different. Microsoft built the APS to handle the larger data requirements of people who have structured data, mostly housed in SQL Server.  Hadoop was developed in an open source environment to handle unstructured data.

How can I transfer data into HD Insight?

This is a great question, which I promise to devote an entire blog post to very soon. I’ll give you the Reader’s Digest version here.  There are a number of ways you can transfer data into HD Insight.  The first step is to transfer the data into the Azure cloud, which you can do via SSIS, with a minor modification of the process I blogged about earlier here.  The other methods you could use to transfer data are via secured FTP or by using Powershell.  You will need to call the REST API which you use to provision an HDInsight Cluster.  There is also a UI you can use within HDInsight to transfer data as well.

I really appreciate the interest in the Webinar.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur