Creating Power BI Date Tables by Marking as Date Table

In a previous post, I wrote about the fact that Power BI did not have the capability to Mark a table as date, but that has changed! Now if you want a date table to use numeric keys to join to other tables you can and still get the time dimension functionality to work.  Now there is no need to use some kind of a work around to get a date table, you can just make it one.

Marking a Date Table in Power BI

Naturally there is more than one way to mark a table as a date table in Power BI.  The first way is to click on the ellipse (you know the three dots) next to the date table Name  in the Power BI Fields, which I highlighted in yellow. The menu contains an option to Mark as date table.  If you select say the locations table, Power BI is smart enough to know that this is not a date table and the option will not appear. The first time that you select the table it will show the window shown above and prompt you to select the column in the table which is an actual value.  Once you select the OK button, the menu item will have a check mark next to the Mark As Date Option and if you select it again you will see this window. Generally speaking there is no reason to Unmark a date table, but if you want to, you can.

The other place in Power BI where you can Mark a Date table is in the report view there is an option in the Modeling tab to Mark as Date Table. Note this option will be disabled unless you have clicked on a table first.  The icon is only available in the Modeling tab of the report view.  If you are in the grid view or the relationship view you won’t see the option to Mark as Date Table, it is only available in the Report Modeling tab.

Time Dimension Functions: The reason for marking a Date Table

For those of you who may be wondering: Why would I want to mark a table as a date anyway there is a very good reason.  Marking a table as a date allows you to use all of the cool features which Microsoft added to DAX to do date calculations for you.  While it is possible to not use any time based DAX functions, unless you are rather expert at DAX, like these guys, you probably will want to use the nice DAX functions Microsoft created for handy things like TOTALYTD or SAMEPERIODLASTYEAR.  Now just because you have a date table added does not mean that you can use any date in those calculations, you need to reference a table that you have marked as date, not a date field within the table.  That means that  CALCULATE(SUM(ResellerSales_USD[SalesAmount_USD]), SAMEPERIODLASTYEAR(DateTime[DateKey]))  will return the correct value and CALCULATE(SUM(ResellerSales_USD[SalesAmount_USD]), SAMEPERIODLASTYEAR(ResellserSales[DateKey]))  will not.  You need to reference the table marked as a date table every single time in for Time Dimensions.   For and example of a Power BI Desktop DAX calculation not working when you are not using a marked date table,  check out this post.

Power BI Update Requests

Power BI updates every month and you may miss some of the things that the team released last month if you do not check out this site every month.  I know that the Power BI teams takes ideas for upcoming releases from Https://ideas.powerbi.com, and there is one thing that I really wish would get some more votes so that it can be added soon, which is another idea, like Marking Date Tables, which started out in SSAS Tabular. Having recently worked on a project with a ton of measures. The next feature that I hope that Power BI inherits from SSAS tabular is the ability to create folders for measures.   If you would like that too, vote for it here.

Yours Always,

Ginger Grant

Data aficionado et SQL Raconteur

Limitations in Time Series Data Analysis and the growth of Advanced Data Analytics

As someone who regularly analyzes data, I have done my share of time series analysis to determine trends over time.  I am struck by the fallibility of this sort of analysis.  For those who are unfamiliar with this time of analysis, time series analysis is performed to try to identify patters in the noise of data to help predict future trends through the use of algorithms like ARIMA. As I kid I remember hearing in fast announcer voice the following text “Past performance is no indication of future results”.  As a matter of fact this is a rule that the SEC requires mutual funds to tell all of their investors this statement.  Yet I get asked to do it anyway.  While I enjoy working with data and using advance analysis techniques including R and Python, I think it is important to realize the limitations of this sort of analysis.  It is considered a good experiment in Machine Learning if you are 85% right.  This is not acceptable if you are talking about a self-driving car as running people over 15% of the time is generally not considered acceptable. There are times when looking at the future that the data is not always going to provide an answer.  When looking to find answers in data, that needs to be something people keep in mind.  While you can find some answers in data, other answers will require prognostication or plan old guessing.

Impact on Technology realized in 2026

http://michaeljswart.com/2016/06/t-sql-tuesday-079-its-2016/comment-page-1/#comment-186750Data analysis is all about pattern matching, and while I don’t find it to be infallible, looking at a wideset of data has led me to plan accordingly.  While I am no Faith Popcorn, my analysis of what I see in the marketplace has led me to make some changes in my own life as I believe change is coming to the industry.  Adam Mechanic’s prompting for looking ahead to 2016 has provided the impetus to publish these theories.  What I see in the marketplace is the tools which are used to support databases are improving.  I see the ability of software to provide relevant hints and automate tuning of database queries and performance to continually improve, meaning there will be less of a need to employ people to perform this task.  I see with databases being pushed more and more to the cloud and managed services less and less need to employ many people to perform dba roles. Where I see the industry moving is towards more people being employed in analyzing the data to determine meaning from it.  I see that in 2026 very little data analysis being performed with R and most analysis being performed in Python.  This means that if you are looking ahead, and are employed in areas where people are being supplemented with tools, the time is now to learn skills in areas where there is growth. If you have been thinking about learning data science, Python and advanced analytics tools now is the time to start so that you will be prepared for the future.

 

Yours Always,

Ginger Grant

Data aficionado et SQL Raconteur

Preparing for Exam 70-774 – Perform Cloud Data Science with Azure Machine Learning

There are a number of reasons why you might want to take a Microsoft cert exam. Maybe you want to focus your studies on a tangible thing, or you think it will help further your career, or you work for a Microsoft Partner and they required a certain number of people to pass the exam to maintain their current partner status.  I am not going to get into the long argument regarding whether or not a cert will help you in your career, or not, I can tell you why you might want to take the 70-774 exam. Machine Learning, or Data Science if you prefer, is an important analytic skill to have to analyze data.  I believe that it will only become more useful overtime. Azure Machine Learning is a good tool for learning the analysis process.  Once you have the concepts down, then should you need to use other tools to perform analysis it is just a matter of learning a new tool.  I talk to a number of people who are trying to learn new things, and the study them in their spare time.  It’s very easy to spend time vaguely studying something, but you may find that having a target set of items to study will focus your time, and as a bonus you get a neat badge and some measure of proof that you were spending time on the computer learning new things and not just watching cat videos.

Exam 70-774 Preparation Tips


While you could always buy the book for the exam (shameless plug as I was one of the authors), the book will not be enough and you will still need to write some code, and do some additional studying. This exam one of two needed for the MCSA in Data Science and you an take the exams in any order. The best place to start is by first looking at the 70-774 exam reference page from Microsoft.  There are four different sections in the exam, and I have created some links for each section which will help you prepare for the exam. In studying for exams in the past, the best way I have found to prepare is to look at everything on the outline and make sure that I know it.

Prepare Data for Analysis in Azure Machine Learning and Export from Azure Machine Learning

Normalizing Data
https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/normalize-data

TanH
https://reference.wolfram.com/language/ref/Tanh.html

ZScore
http://stattrek.com/statistics/dictionary.aspx?definition=z-score
http://howto.commetrics.com/methodology/statistics/normalization/

Min Max
https://www.quora.com/What-is-the-meaning-of-min-max-normalization

PCA
https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/principal-component-analysis
https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/principal-component-analysis
https://stackoverflow.com/questions/9590114/importance-of-pca-or-svd-in-machine-learning

SVD
http://andrew.gibiansky.com/blog/mathematics/cool-linear-algebra-singular-value-decomposition/

Canonical-correlation analysis (CCA)
https://en.wikipedia.org/wiki/Canonical_correlation

Singular Value Decomposition (SVD)
http://andrew.gibiansky.com/blog/mathematics/cool-linear-algebra-singular-value-decomposition/

Develop Machine Learning Models

Team Data Science
https://docs.microsoft.com/fi-fi/azure/machine-learning/team-data-science-process/python-data-access

K-Means
https://www.datascience.com/blog/k-means-clustering

Confusion Matrix
http://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/
https://en.wikipedia.org/wiki/Confusion_matrix
https://en.wikipedia.org/wiki/F1_score

Ordinal Regression
https://en.wikipedia.org/wiki/Ordinal_regression

Poisson regression
https://en.wikipedia.org/wiki/Poisson_regression

Mean Absolute Error and Root Mean Squared Error
http://www.eumetrain.org/data/4/451/english/msg/ver_cont_var/uos3/uos3_ko1.htm

Cross Validation
https://towardsdatascience.com/cross-validation-in-machine-learning-72924a69872f

Operationalize and Manage Azure Machine Learning Services

Connect to a published Machine Learning web service
https://docs.microsoft.com/en-us/azure/machine-learning/studio/publish-a-machine-learning-web-service
https://docs.microsoft.com/en-us/azure/machine-learning/studio/consume-web-service-with-web-app-template
https://docs.microsoft.com/en-us/azure/machine-learning/studio/manage-new-webservice

Use Other Services for Machine Learning

Microsoft Cognitive Toolkit
https://www.microsoft.com/en-us/cognitive-toolkit/

BrainScript
https://docs.microsoft.com/en-us/cognitive-toolkit/brainscript-basic-concepts

Streamline development by using existing resources
https://docs.microsoft.com/en-us/azure/machine-learning/studio/gallery-how-to-use-contribute-publish
Perform database analytics by using SQL Server R Services on Azure
https://docs.microsoft.com/en-us/azure/machine-learning/data-science-virtual-machine/provision-vm
https://docs.microsoft.com/en-us/machine-learning-server/install/r-server-vm-data-science
https://journal.r-project.org/archive/2009-2/RJournal_2009-2_Williams.pdf
http://blog.revolutionanalytics.com/2017/07/xgboost-support-added-to-rattle.html
https://github.com/JohnLangford/vowpal_wabbit/wiki

I hope you have found this test preparation material helpful.  If you passed the exam, let me know by sending me a comment.

Yours Always,

Ginger Grant

Data aficionado et SQL Raconteur

Azure Machine Learning Workbench

Microsoft released Azure Machine Learning Workbench at the Ignite conference on September 25, 2017 as a public preview.  This tool is a new tool which they are adding to their Azure ecosystem, which includes the machine learning tool they introduced three years ago, Azure Machine Learning Studio. Microsoft has said they plan on keeping both products. When asked about the two products, they said that the earlier tool, Azure Machine Learning Studio, is targeted to developers who wanted to add machine learning to their current applications, as it is an easy to use tool that doesn’t require a person to be a trained data scientist.  Azure Machine Learning Workbench is targeted to data scientists who want to bring in other libraries, like TensorFlow for Python, and delve deep into the data.

Microsoft Moves into Machine Learning Management

Microsoft is looking for Azure Machine Learning Workbench for more than a tool to use for Machine Learning analysis. It is part of a system to manage and monitor the deployment of machine learning solutions with Azure Machine Learning Model Management. The management aspects are part of the application installation.  To install the Azure Machine Learning Workbench, the application download is available only by creating an account in Microsoft’s Azure environment, where a Machine Learning Model Management resource will be created as part of the install. Within this resource, you will be directed to create a virtual environment in Azure where you will be deploying and managing Machine Learning models.

This migration into management of machine learning components is part of a pattern first seen on the on-premises version of data science functionality.  First Microsoft helped companies manage the deployment of R code with SQL Server 2016 which includes the ability to move R code into SQL Server.  Providing this capability decreased the time it took to implement a data science solution by providing a means for the code can be deployed easily without the need for the R code to be re-written or included in another application. SQL Server 2017 expanded on this idea by allowing Python code to be deployed into SQL Server as well.  With the cloud service Model Management, Microsoft is hoping to centralize the implementation so that all Machine Learning services created can be managed in one place.

Hybrid Cloud, Desktop, and Python

While you must have an Azure account to use the Machine Learning Workbench, the application is designed to run on a locally on either a Mac or Windows computer.  There is a developer edition of the tool so that one can learn the tool and not incur a bill, which is the case with the previous product, Azure Machine Learning.  The download of Machine Learning Workbench must be accessed within an Azure account and is installed to your local computer.  When running the application from your computer, the application will prompt to log into your Azure account to load Azure Machine Learning Workbench.

The application is designed to use and create Python code.  Azure Machine Learning Workbench does not contain any accommodation to incorporate machine learning components written in R, just Python.  If you have created machine learning components using R, they can be incorporated into the Azure Machine Learning Model Management if you create webservices which encapsulate the R code. The R code does not interface into Workbench, but can be made to be a part of the managed projectes in Azure. While it is possible to create a webservice for R with the earlier product with Azure Machine Learning, there is no direct way to include R with Azure Machine Learning Workbench.  There are a number of sample templates to get started using Python templates including the ubiquitous Iris dataset, Linear regression and several others.   Once the project is created, you can use your favorite IDE, it creates python code which can be read anywhere.

Staying within Machine Learning Workbench application allows you access to arguably one of the neatest parts of the Machine Learning Workbench, the data parser. This tool which was originally code-named project Pendleton and designed to be an intuitive way to modify the contents of data even better than the previous leader in parsing data, Power BI’s Power Query.

You can select the option “Derive column by example” or “Split Column by Example” and then start typing in a new column.  For example, if you want to separate a column which contains the date and the time, if you right click on that date column and select “Split Column by Example” then type the date in the new column provided, the application will immediately determine that you want two columns and crate them. The date column and a time column be created for you after typing in one date.  After the sample columns have been created, you can approve the change or reject it if does not work how you want to.

Like Power Query, each change made to the data is included in the window called Steps on the right side of the application window. When you are done modifying the data, right click on the Data Preparations source icon, which in my example is called UFO Clean, to and the UI changes made to the data are used to create Python code to perform the changes. The generated Python code can be used to the source data programmatically.

The next step in the process is to write the python code needed to evaluate the data and create a model which would in my case determine where and when you are most likely to see the next UFO based on the dataset I have included in my project.  Unlike it’s counterpart Azure Machine Learning, you will need to know how to write the necessary code needed to create a machine learning analysis in Python for Azure Machine Learning Workbench. One could write the Python code to create a machine learning analysis in any Python editor.  If you chose to use Azure Machine Learning, the Python library scikit-learn is installed as part of the application.  Other libraries which you may want to use, such as the common library matplot, you will need to load within Azure Machine Learning Workbench.

Web Service: How Azure Machine Learning Workbench Solutions are Deployed

To deploy a package, you will need to export the completed model serialized Python object, with the Python Module, Pickle. This will create a file with the suffix of pkl, which is the file that you will be deploying. Azure Machine Learning Workbench expects that you will be deploying via Docker containers or creating an Azure cluster.  You will need to register the Docker container in the Machine Learning Container for it to be deployed.

Yours Always,

Ginger Grant

Data aficionado et SQL Raconteur

Analyzing Salary Data with Power BI and R – Part 2

As advanced analytical techniques become more popular, companies are looking to hire people to find answers in their data. What kind of answers?  Predicting the future, determining what factors make customers leave, what kind of products can they get good customers to buy, what conditions are related, is this a valid transaction and similar questions. What answers can be provided have everything to do with the available data. The data I chose to analyze was the salary data provided by Brent Ozar, which is publicly available here.  I started looking at the data in a previous post where I did an initial review of the data and discussed the data analysis process.

Regression Analysis – What kind of relationships can I find in the Data?

Looking at how data is related is a very important step in data analysis.  Most often various items are analyzed using some linear regression algorithms to compare one or more variables together. For this kind of analysis, generally speaking all of the data needs to be represented numerically, which means that if data instead exists as categories of items, the data will need to be transformed. For example, on the Salary Data which I analyzed and published in Power BI, the job experience and salary roe compared using the ggplot library from R, and the two different values are included on respective axis. What I was hoping to find is that there was a strong relationship between these two values.  If you look at the graph, you can see this is not the case. Interestingly enough while the line shows an upward trend, you can see a drop in salaries for those with a lot of job experience.  Those people with the most experience, above 35 years are making less money than those with less experience.  The graph also shows those who are just starting in their careers are not necessarily making the very little money. What the data shows is there is no guarantee that the longer you work the more money you make.

Data Cleansing for Analysis

Because I am looking for data trends and not anomaly detection, I normalized the survey data.  I eliminated the 100 people who did not fill in the salary amount, and cut off the high and the low.  I used the box plots generated in Power BI to serve as a guideline for the ranges to exclude.  As I was also interested in determining the difference in the responses between male and female, so I did some data substitution on some of the values as I wanted to included more records.  In 2018, the only year that this question was asked, 87.6% of the respondents were male.  I made the decision to include all of the respondents where the number of respondents was less than .22% as male so that I would have more data to evaluate. I modified all of the data in Power BI using M code.  You can take a look at all of the modifications I made to the data here in the Power BI report I created, as I am making it available for the next 30 days.

Examining the Top 5%

Recently I have had some conversations with some colleges regarding salary, and that led me to want to review what people would like to make.  Most people would like to be making the most money possible in their profession, and are not interested in moving, which is why I chose not to do much with the geographic data.  I ran a number of different machine learning algorithms on the data trying to find a definitive set of results among those who reported making the most money.  The results of those experiments were inconclusive.  While I found some items which were common among the highest earners, the results were not statistically conclusive. There are a number of conceptions that people have regarding salary, and I chose to illustrate some of them to dispel some myths surrounding data. I also grouped the salaries into groups: 95% for above 153,565, 75 for above 67,789, and the rest for the average.  These numbers were based on the values in the box chart in the top left of the Power BI report.

Salary Conclusions – Myth vs Reality

I know that I have heard that if you want to make money you need to get into management. Being a good manager is not the same skill set as being a good database professional, and there are many people who do not want to be managers.  According to the data in the survey, you can be in the top 5% of wage earners and not be a manager. How about telecommuting? What is the impact on telecommuting and the top 5%?  Well, it depends if you are looking at the much smaller female population. The majority of females in the top 5% telecommute.  Those who commute 100% of the time do very well, as well as those who spend every day at a job site.  Males report working more hours and telecommuting less than females do as well.  If you look at people who are in the average category, they do not telecommute. The average category has 25% of people who work less than 40 hours a week too. If you look at the number of items in the category by country you can determine that in many cases, like Uganda, there are not enough survey respondents to draw any conclusions about salary in locations.

After spending quite a bit of time analyzing and visualizing the data, I was unable to determine a specific set of skills which to provide a roadmap of exactly what one needs to do to be in the top 5% of the salary for a data professional.  What I can tell you is more than likely there is someone with your level of work experience and position who is doing really well, and there is no reason why by the time that the next survey comes out, you are not the person who is in the top 5 percent.  This may mean working harder at your job and perhaps changing employers as the analysis shows that is the best way to make more money.

 

Yours Always,

Ginger Grant

Data aficionado et SQL Raconteur

Analyzing Salary Data with Power BI and R – Part 1

CRISP-DM Diagram

The standard method for analyzing data is the CRoss Industry Standard Process for Data Mining [CRISP-DM].    Rather than describe the method, this post will walk through the process to illustrate how to analyze data using it. The data that I selected for analysis is the Brent Ozar Salary Survey information.The data is available under open source license and contains two years of answers to salary data with a total of over 6,000 responses. Understanding what is in the data determines what kind of answers it can provide. What can the data reveal?  Prior to drawing any conclusions, one needs to examine the data to determine the level of completeness, correctness and whether or not you have enough data to make decisions based upon the data.

Data Understanding

The first step in the process is to analyze the data to evaluate what kind of knowledge you can gather from the data.  The primary perspective of the data is salary, and the survey describes the characteristics which people with a certain salary level have.  The survey used to gather the data contained a number of drop down boxes and those fields can be used as categorical variables as there are a fixed number of possible responses.  Other survey items allowed people to freely enter anything, which makes it harder to statistically analyze some of the data.

 

Where Do Data Professionals Make the Most Money?

In the survey for 2018, the people who made the most money were from Hong Kong with an average salary of $263,289.  Before you start planning on moving, you will might want to look at the data a little closer.  There were 2 people who responded from Hong Kong.  One of them said he was making over 1.4 million dollars, the highest amount reported in the survey.  Given the fact that we only have two responses from Hong Kong, we will be unable to draw a definitive conclusion with 2 records. To be able to answer that question, more analysis will need to be done on the location and salary information and you will probably want to add market basket criteria because a dollar say in Hong Kong doesn’t go as far as the average apartment rental is $3,237 a month as it does say in Uganda where the rent is around $187 a month.

Using Power BI to Provide Data Understanding

The data modeling step of the CRISP-DM process anticipates that you will want to modify the raw data.  There may be records containing null or erroneous values that you will need to eliminate the entire record or substitute entries for a particular value.  You can also use this analysis to determine what conclusions you will be able to derive from the data.  For example, if you wanted to analyze what criteria are required for Microsoft Access Developers to make over $100,000 a year, you could easily find out in Power BI that it is not possible to do that analysis as there is no data for that set of criteria. If you want to do a year over year analysis of people who are working as DBAs, which I show in the second tab, you will need to change some of the categories as they changed from 2017 to 2018.

Analyzing Data with R and Power BI

Many times when providing a final report to explain your analysis, you will need to provide some documentation to demonstrate your conclusions.  In addition to creating some visualizations in Power BI, I also created some in R to include visualizations and analysis with R.  While I can include any R library I wish in Power BI Desktop, there are only 364 currently added to the Power BI Service in Azure.  If there is an R visualization you would like to add, you can send an email request to rinpbifb@microsoft.com and ask for it to be added.

Power BI Salary Data

For more information on Analyzing Data with Power BI and R, I recorded a video for Microsoft’s Power BI team which is available here.  The video shows the cleaning process some information regarding the analysis of the process. The analysis of the Salary data itself will be included in another post.  If you would like to find out when the next post in this series is available, please subscribe to my website for all of the latest updates.

*** UPDATE: My next post on further analysis of this data is available here.

Yours Always,

Ginger Grant

Data aficionado et SQL Raconteur

 

 

 

 

SQL Server 2017 Machine Learning Services Part 3 – Internals

After you have installed SQL Server 2017 with Machine Learning Services, you may notice a couple of interesting things.  One is that by default you will have 20 new users created.  These user ids are  by default named MSSQLSQLServer01, MSSQLSQLServer02, MSSQLSQLServer03… MSSQLSQLServer20, but if you have a named instance, like I have called SQLServer2017, the users are named with the named instance.  There is a subdirectory created for each User ID with is by default located in  \Program Files\Microsoft SQL Server\MSSQL14.MSSQLSERVER\MSSQL\ExtensibilityDataExternal .  You do not want to remove these User IDs or rename them.  You may be wondering Why do you have all of these User IDs to use Machine Learning Services and what are they for? Keep on reading to find the answer

SQL Server Launchpad and User IDs

When calling external processes, internally SQL Server uses User IDs to call the Launchpad service, which is installed as part of Machine Learning Services and must be running for SQL Server to be able to execute code written in R or Python.  The number of users is set by default.  To change the number of users, open  up SQL Server Configuration Manager by typing SQLServerManager14.msc at the run prompt. For some unknowable reason Microsoft decided to hide this application which was previously available by looking at the installed programs on the server.  Now for some reason they think everyone should memorize this obscure command. Once you have the SQL Server Configuration Manager open, right click on the SQL Server Launchpad service and select the properties which will show the window, as shown below.  You will notice I am running an instance called SQLServer2017 which is listed in parenthesis in the window name.

SQL Server 2017 Launchpad Configuration

Clicking on the Advanced Tab shows an entry for External Users Count, which is shown highlighted. This value is set by default to 20 users.  This means that 20 different threads can concurrently call an R or Python process.  If you reduce this number to 0, no R or Python code can be run, and the SQL Server Launchpad service will not run.  The minimum number of users you can have and have the launchpad service still run is two, but changing the users to that low number is not recommended as those processes are needed to run Machine Learning Services to rn.  If you have more than 20 concurrent R or Python processes running, SQL Server will wait until one of these threads is no longer in use and once one is free, will use it to call another process. While the process is running you may see some GUIs or other non-decipherable data appear in the folders for a user.  The garbage cleanup runs soon after to delete anything that is in the folder, as they will eventually all be empty. What does the Launchpad Service do and what does that have to do with Machine Learning Services and SQL Server? Well, the short answer is the launchpad.exe is used to call R and Python.

SQL Server Internal Machine Learning Components

To run R or Python code in SQL Server, you will need to execute an external script, which I talked about in the first post of this series.  The following diagram illustrates what happens when that call is made and what executables are called.  When a request to run R or Python code is received by the sqlservr.exe, using a named pipe, SQL Server calls the Launchpad.exe. Every time a stored procedure or call to run R or Python is requested an Rlauncher or Python process is run.  Windows job objects to process the are also created if none exist, but if there are unused windows job objects initiated by a previous call and not presently in use they will be utilized.

The job objects containers will execute the code using the rterm.exe or Python.exe. The rxlink.dll processes messages to the BxlServer to process any SQL/R functions written in the R code, send monitoring information to the SQLPAL, create XEvents.  The Python35.dll will run the python code.  If the Python code is using the revoscalepy library it will call the SQLPAL to create XEvents to use it. Otherwise it will call the BxlServer and call the sqlsatellite.dll to send and retrieve data from SQL Server.  The data is sent back to SQL Server from the sqlsatellite.dll back to SQL Server.  The named pipe used to call launchpad.exe is created internally and is not part of any other named pipe process.  The launchpad.exe uses the User IDs to call R or Python external processes. The R and Python code is executed outside of SQLPAL and the processed data is returned by sqlsatellite.dll to SQL Server.

Hopefully this post answered the questions you had about what SQL Server is doing when you run Machine Learning Services. If you have any additional questions, please let me know by asking me on twitter @desertislesql or leaving me a comment on this post.

 

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

 

 

 

SQL Server 2017 Machine Learning Services Part 2 – Memory Allocation

SQL Server 2017 fundamentally changed the underlying structure of SQL Server for reasons that had nothing to do with Machine Learning Services.  Understanding this new architecture will help you configure SQLServer to optimally run R and Python. When Microsoft set out to get SQL Server to work on Linux, the goal was to provide the nearly 30 years of development effort to a new operating system without having to re-write all of the code used to make SQL Server run on the Linux operating system. For SQL Server 2005, Microsoft created a SQLOS, which created an abstraction layer between the hardware and SQL Server.  This abstraction layer allowed SQL Server to take advantages of hardware changes by expanding the capability of SQL Server to take advantage of hardware changes even when the operating system had not implemented all of the code needed to fully implemented it. From a practical perspective, this mean when you configured SQL Server internally to use 100% of all available memory, this didn’t mean all of the memory on the server, it mean 100% of the memory allocated to SQL Server.

For SQL Server 2017, Microsoft created the SQL Server Platform Abstraction Layer [SQLPAL].  Like SQLOS before it, SQLPAL abstracts the calls to the operating system. It implemented the ability to be operating system independent by separating SQL Server Code from the operating system by creating abstraction layer between SQL Server and the Operating system which includes the management of memory, processing thread and IO. This layer of abstraction provides the ability to create one version of SQL Server code which can then be run both platforms, Linux or Windows operating systems.  SQL PAL manages all memory and threads used by SQL Server.

Machine Learning Resources and SQL Server Memory Allocation

Enabling Machine Learning Services on SQL Server which I discussed in a previous blog post, requires you to enable external scripts.  Machine Learning Services are run as external processes to SQLPAL. This means that when you are running Python or R code you are running it outside of the managed processes of SQL Server and SQLPAL.  This design means that the resources used to run Machine Learning Services will run outside of the resources allocated for SQL Server.  If you are planning on using Machine Learning Services you will want to review the server memory options which you may have set for SQL Server.  If you have set the max server memory For example, if your server has 16 GB of RAM memory, and you have allocated  8 GB to SQL Server and you estimate that the operating system will use an additional 4 GB, that means that machine learning services will have 4 GB remaining which it can use.

By design, Machine Learning Services will not starve out all of the memory for SQL Server because it doesn’t use it.  This means DBAs to not have to worry about SQL Server processes not running because some R program is using all the memory as it does not use the memory SQL Server has allocated.  You do have to worry about the amount of memory allocated to Machine Learning Services as by default, using our previous example where there was 4 GB which Machine Learning Services can use, it will only use 20% of the available memory or  819 KB of memory.  That  is not a lot of memory.  Most likely if you are doing a lot of Machine Learning Services work you will want to use more memory which means you will want to change the default memory allocation for external services.

SQL Server Resource Allocation

SQL Server manages all resources using the application layer, SQLOS. SQLOS is the interface between SQL Server and all of the underlying hardware resources, including of course memory.  Using the Resource Governor within SQL Server it is possible to allocate the resources used by specific processes to ensure that no single process will for example use all the memory, starving out other processes running on the machine. Configuring and using Resource Pools provides more important functions such as production applications to be allocated the majority of the SQL Server resources used by the SQLOS. This will ensure for example that an ad-hoc reporting query will not adversely impact the primary application.

Machine Learning Services Resource Allocation within SQL Server

The allocations for the Resource Governor for all SQLPAL functions can be found by running

SELECT * FROM sys.resource_governor_resource_pools WHERE name = 'default'

By default, the max cpu, memory and cpu cap are all set to 100 percent. To look at the resource allocation for Machine Learning Services, you will need to  look at the the external resource pools.

SELECT * FROM sys.resource_governor_external_resource_pools WHERE name = 'default'

By default, the maximum memory that Machine Learning Services can use, outside of the memory that has been allocated to SQL Server, is 20% of the remaining memory. If the processes running require more memory, the allocated percentage amounts for memory and external pool resources may need to be adjusted. The following settings will decrease the overall memory settings for SQLOS and increase the memory allocated to external processes from 20% to 50%

ALTER EXTERNAL RESOURCE POOL "default" WITH (max_memory_percent = 40);
ALTER RESOURCE GOVERNOR reconfigure;
GO

Using our previous example of 4 GB of memory available after the memory allocation to SQL Server and the OS, the memory available for Machine Learning Services would go from .819 GB to 2 GB.  Setting resources for the external resource pool will in no way impact the resources SQL Server uses.  If you run the previous queries listed above you will see the changes made to the external pool while the standard resource governor pool is not changed.

Determining How Much Memory is needed for Machine Learning Services with SQL Server

How do you know how much memory SQL Server needs for Machine Learning Services? Well since I am a consultant I feel compelled to say, it depends.  Given the relative newness of the Machine Learning Tools, there are not any really good guidelines as the memory which you are using greatly depends on the complexity and quantity of the R or Python code you are running as well as how much data these processes are running against.  It also depends what language you are using.  R is more memory intensive than R and unless you are using the Rx functions which are a part of the Machine Learning Services service, will not swap items in and out of memory. The best way to determine how much memory you are using is to monitor its use over time, and the best way to do that is to create a process for monitoring the external resources.

Best Practice Method for Monitoring Machine Learning Services Resources

Creating resource pools for machine learning to monitor use over time is considered a best practice method for ongoing monitoring of resources. The following code will create an external resource pool for processes running Machine Learning Services and classifying the resources run to use it. If you are familiar with setting up resource pools in SQL Server, this process is the same, it just needs to be applied to external resources as well to use the external resources. To monitor the Machine Learning Services, the first step is to create an external resource pool called ML_Resources instead of just using the default. I am going to allocate all of the external resources to it.

CREATE EXTERNAL RESOURCE POOL ML_Resources WITH (max_memory_percent = 100);

The next step in the process is to create a workload group.  The workload group, named MLworkloadGroup  in the code, is used as a container to hold processes which have been classified as ML processes.

CREATE WORKLOAD GROUP MLworkloadGroup WITH (importance = medium) USING "default", EXTERNAL "ML_resources";

The next step is to create a function for classifying processes running as R or Python so that they can be monitored in the workload group.

USE master
GO
CREATE FUNCTION is_ML_app()
RETURNS sysname
WITH schemabinding
AS
BEGIN
IF program_name() in ('Microsoft R Host', 'RStudio', ‘Python’, ‘Pythonw’) RETURN 'MLworkloadGroup';
RETURN 'default'
END;
GO

Once the function has been created, then the Resource Governor is directed to use the function so that all of the Python and R code are monitored in the external resource pool and turns on the Resource Governor with the reconfigure command.

ALTER RESOURCE GOVERNOR WITH (classifier_function = dbo.is_ML_app);
ALTER RESOURCE GOVERNOR   reconfigure;
GO

Going forward, all processes running R or Python will be classified and use all available memory.  After these steps are completed, you can obtain performance information from the DMVs sys.dm_resource_governor_resource_pool and  sys.dm_resource_governor_workload_groups by creating a query like this

USE master
GO
SELECT a.session_id, a.login_name,  b.name
FROM sys.dm_exec_sessions AS a
JOIN sys.dm_resource_governor_workload_groups AS b
ON a.group_id = b.group_id

 

Using the Windows Performance Monitor, you will now be able to take a look at the resources being used for Machine Learning Services and can then determine how much memory is needed based upon actual usage on the server.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

SQL Server 2017 Machine Learning Services Part 1- Installation

With the release of SQL Server 2017, you now have the capability to incorporate both R and Python into SQL Server. As there  is a lot of material on the topic, this is the first post in the series which covers installation.  In future topics we will be covering the internal components, monitoring a R and Python code to determine performance impact on SQL Server, creating and maintaining R code, creating and maintaining Python code, and other related topics.

Installing Machine Learning Services

R was first introduced in the SQL Server 2016 and it was called R Services.  For SQL Server 2017, this same service was renamed to Machine Learning Services, and expanded to include Python. In the install there are three options, installing the Machine Learning Services, then selecting R and/or Python as you see in the attached picture.

Why you want to select Machine Learning Services(In-Database)

There are two installation options:  In-Database or Standalone.  If you are evaluating Machine Learning Services and you have no knowledge of what the load may be, start by selecting the Machine Learning Service In-Database.  There are several reasons why by default you want to select the In-Database option. One of the problems that Microsoft was looking to solve by incorporating advanced data analytics was to improve performance of the native code by greatly reducing data latency.  If you are analyzing a lot of data which is stored within SQL Server, the performance will be improved if the data does not need to be moved around on a network. Also, the licensing costs of installing R Server standalone also need to be evaluated with a Microsoft representative as well. An evaluation of the resource load on the network, as well as analysis of the code running on SQL Server should be performed prior to the decision to install the Machine Learning Server Standalone.

Internet Access Requirements for installing R and Python

The Machine Learning Service is an optional part of the SQL Server Install.  Because R and Python are both open source applications, Microsoft cannot include the R or Python executables within the install of SQL Server.  The executables must be downloaded from their respective locations on the internet, and the installation process is a little different if there is no internet access.  Each language has two installs, one for the executable and one for the server. If you do not access to the internet on the server where you are installing SQL Server 2017, you can download the files needed for the install.

Here are the links for SQL Server 2017 CU2.

Microsoft R Open

Microsoft R Server

Microsoft Python Open

Microsoft Python Server

If you are installing a different version, use the links provided on the Offline Installation screen. These links each will download a .cab file.  You will need to copy the cab files to a location where the server can access them and provide the path in the Offline Installation window.

 

What is Installed with Machine Learning Services

Machine Learning is an external process and communicates with launchpad.exe to access either R or Python.  For a quick check to see if the Machine Learning Services were installed, look for the SQL Server Launchpad service in the list of running services. It will also create by default 20 different external users which are used to call R or Python. There will be a subdirectory created for each user, with the name of the SQL Server instance name, which is by default MSSQLSERVER, followed by a number 00-20.  These subdirectories have nothing in them, as they are used temporarily when R or Python needs them and the information in them is eventually deleted by SQL Server. The default location is C:\Program Files\Microsoft SQL Server\MSSQL14.MSSQLSERVER\MSSQL\ExtensibilityData.

The R tools are located in the C:\Program Files\Microsoft SQL Server\MSSQL14.MSSQLSERVER\R_SERVICES directory, and include RLaucher.dll which will load R.  If you want to run R on the server, which can be handy when updating R libraries, run the RGUI.exe, which will load up an interface where you can run R code.  In SQL Server 2017 CU2, Microsoft R Open 3.3.3 is installed along with R Server 9.2.0.  Microsoft R Open is a version that is 100% Open Source and completely compatible with the standard Open Source version of R, which is commonly referred to as Comprehensive R Archive Network [CRAN] R.  Microsoft rewrote some of the underlying functions so that they would be multi-threaded, which R is not, and incorporate the Intel Math Kernel libraries to improve the performance.   R Server is the version of R which contains the proprietary functions which Microsoft created for SQL Server which include the ability to load code in and out of memory, which will be discussed in a future post.

The Python tools are located in the C:\Program Files\Microsoft SQL Server\MSSQL14.MSSQLSERVER\PYTHON_SERVICES directory, and include Pythonlauncher.dll which will load Python 3.5. The installation includes the Anaconda distribution for Python, which includes not only the data science components of Anaconda, but also SciKitLearn and Pandas.  Microsoft has also included machine learning algorithms which were created for Python in the microsoftml and revoscalepy libraries.  There is a lot of interesting content will be discussing more about these libraries in a later post.

Configuring SQL Server to Run R or Python

Once the Machine Learning components are installed, there are some configuration steps which must be completed to permit R or Python to run on SQL Server.  If this is a new server, make sure to install SQL Server Management Studio, since it is not included in the SQL Server Install. From within an SSMS query window, the following script needs to be run to enable R

SP_CONFIGURE 'external scripts enabled', 1
GO
RECONFIGURE
GO

After this step completes successfully, a restart of SQL Server Services is required.  When stopping the service, you will be notified that SQL Server Launchpad also will need to be restarted.  I have noticed that for  some reason, Launchpad does not always restart when SQL Server is restarted, so you might want to check to make sure that it is running, as you cannot run R or Python unless the SQL Server Launchpad service is running.  After the restart, to check to see if R is working properly, run the following code from within an SSMS query window.

EXEC sp_execute_external_script @language =N'R',
@script=N'OutputDataSet <-InputDataSet',
@input_data_1 =N'SELECT 1 as CheckToSeeIfRIsWorking'
WITH RESULT SETS (([CheckToSeeIfRIsWorking] int not null));
GO

When run successfully, this script will return a 1. SQL Server is now ready to run R.  To check to see if Python can be run successfully, run this script.

EXEC sp_execute_external_script  @language =N'Python',
@script=N'OutputDataSet = InputDataSet',
@input_data_1 = N'SELECT 1 AS CheckToSeeIfPythonIsWorking'
WITH RESULT SETS ((CheckToSeeIfPythonIsWorking int not null));
GO

In my next post I will cover the SQL Server internal components which are run when R or Python code is run.  Please subscribe to my blog to be notified when the next installment will be available. If you have any questions, comments or ideas for future post topics, please leave me comments as I would really enjoy any feedback.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

 

 

Applying Data Science to SQL Server

Data has been getting a lot of attention in the business world for a while now. First there was big data, which was another way to store data so that later the data could be analyzed. Recently the talk has been all about analyzing the data with new tools such as R and Python. The reality is that people who have been working with databases doing work in business intelligence have been analyzing data for a while. Learning a different toolset for analyzing data is not such a big leap, but an expansion of what they know.

As the field is rapidly expanding now, and demand is huge, now is a great time to learn the tools. With the advent of more advanced software, it seems essential to create fast and reliable processing mechanisms. In addition, concepts such as Web3 seem to be developing and becoming realities at an accelerated rate. Data plays an essential role in the development of such marvels of information technology. It is high time companies consider combining traditional data science tools, like MATLAB, Apache Spark, and SQL, with modern tools, like web3 sql. This could help them organize and process the data at relative speeds while reducing data analytics costs.

Traditional Data Science Development

Data scientist have created analysis solutions with data for a number of years. The data is analyzed, cleaned, processed with various algorithms, and results are created. When the process is complete, code has been created to provide meaning from a portion of the data and is ready to be migrated to production. Traditionally there has been a big gap between creating a solution and implementing the solution to be run against data on a regular basis. Data Scientists traditionally are not part of the IT organization, they are actuaries or analysts, not the people who have anything to do with system processing. Recently I did some work for a company and after the data scientists were done creating a solution, they turned over all of their code to the Java team. Six weeks later the code was released into production. This solution made no one happy. Management thought it took too long. The data scientist didn’t believe that the code that they created was what was implemented into production, and the java developers were tired of people blaming them for wrong code which required a long time to implement.

SQL Server Implementation of Data Science

Since SQL Server 2016 incorporates R and SQL Server 2017 has added the ability to include Python code into SQL Server, data science solutions can be incorporated as part of a scheduled process with SQL Server. There is now a dev ops solution for incorporating R and Python into SQL Server. One way of learning about the technology is through blogs and other online training which can help you get up to speed. Many times though there is no substitute for hands on learning. If you are attending PASS Summit 2017, and want to learn not only about data science, but how to incorporate it into SQL Server, I hope you can sign up for my all day training session on Applied Data Science for the SQL Server Professional. I hope to see you there.

I have recently created a You Tube channel where I plan on sharing more data related content where I have included my first video about this conference.

If you are at PASS Summit, please introduce yourself as I would love to meet people who read my blog personally.

 

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

200 OK

OK

The server encountered an internal error or misconfiguration and was unable to complete your request.

Please contact the server administrator at [no address given] to inform them of the time this error occurred, and the actions you performed just before this error.

More information about this error may be available in the server error log.