Articles

Machine Learning with R

While there are a number of different Applications designed to implement Machine Learning, such as Azure Machine Learning, Matlab and Octave, a specific package to perform Machine Learning is not required.  The algorithms used to generate machine learning experiments, can be applied in other languages, such as R.

Machine Learning Algorithms

machinelearningLearning is often described as a method of applying rules to situations. “Don’t put your finger on the stove.  The stove is hot and will burn you”.  A child can extrapolate this to irons, fire and other hot things after being told about stoves.  Computers process learning a little differently, by applying rules or algorithms to data to determine a result.  A great example of this was the Kaggle competition to determine from looking at a picture, which picture was a cat, and which picture was a dog. The computer reviewed a number of different pictures where there was a label on the picture, indicating that it was a cat or a dog and applied those rules where the pictures were not labeled.  The winning algorithm was right 98.914% on identifying dogs and cats.  Sorting pictures into groups is a classification function, one of the common functions used in Machine Learning. Other popular functions include anomaly detection, regression and clustering.  Once experiments are created, there are a number of different methods used to determine their effectiveness, such as the Receiver Operating Characteristic [ROC] graphs or a Confusion Matrix.

Algorithm Determination

Often times determining which algorithm to use can take a while.  Here is a pretty good flowchart for determining which algorithm should be used given some examples of what the desired outcomes and data contain. The diagram lists the algorithms, which are implemented in Azure ML.  The same algorithms can be implemented in R.  In R there are libraries to help with nearly every task.  Here’s a list of libraries and their accompanying links which can be used in Machine Learning.  This list is no means comprehensive as there are libraries and functions other than the ones listed here, but if you are trying to write a Machine Learning Experiment in R, and are looking at the flowchart, these R functions and Libraries will provide the tools to do the types of Machine Learning Analysis listed.

Drawing ROC CurvesROCR

Anomaly Detection

Regression

There is a really good list of all of the R regression functions here

Clustering

Binary Classification

Multi-Class Classification

 

Applied Machine Learning

Hopefully this list of R libraries will help you apply machine learning to data within R. To see how R can be used in Machine Learning, please join me on my upcoming webinar on Machine Learning with R and SQL Server 2016  where I will show how an R program can be created and applied to a production environment.

 

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

Data Factory – Executing an Azure Machine Learning Web Service

My last blog post showed how to copy data to a blob storage account, which I needed to do to be able to call an Azure Machine Learning [ML] Web Service. When calling a ML Web Service, the data must be in an Azure Blob Storage account. Once a ML model has been trained, and a web services has been created, it’s ready for production. Calling the experiment in Data Factory allows the ML to be run with tens of thousands of rows as part of a scheduled process. Prior to inserting the ML web service in Data Factory, make sure that you test it to ensure there are no errors with the web service, as Data Factory does not expose all of the ML errors which may be encountered by the web service.

Creating Azure Machine Learning Data Factory Pipelines

ML DF PipelineTwo new steps need to be added to the existing Data Factory Pipeline, one to call the ML Web Service and one for the output. The ML pipeline requires two pieces of JSON code, a linked service to make the connection to the web service and a pipeline to invoke the job and specify the inputs and the Outputs. For the Output, the first step requires no JSON as first a blob storage container in Azure needs to be created to store it. The next steps involve writing JSON to create a linked service to connect to it and lastly an Output dataset needs to be defined.

Calling Machine Learning Service

The Linked Service for ML is going to need some information from the Web Service, the URL and the API key. Chances are neither of these have been committed to memory, instead open up Azure ML, go to Web Service and copy them. For the URL, look under the API Help Page grid, there are two options, Request/Response and Batch Execution. Clicking on Batch Execution loads a new page Batch Execution API Document. The URL can be found under Request URI. When copying the URL, you do not need to include any text after the word “jobs”. The rest of the URL, “?api-version=2.0”. Copying the entire URL will cause an error. Going back to the web Services page, The API Key appears on the dashboard section of Azure ML and there is a convenient button for copying it. Using these two pieces of information, it is now possible to create the Data Factory Linked Service to make the connection to the web service, which here I called AzureMLLinkedService

{
"name": "AzureMLLinkedService",
"properties": {
"description": "Connecting ML Experiment”
"hubName": " GingerDataFactoryTest_hub",
"type": "AzureML",
"typeProperties": {
"mlEndpoint": "https://ussouthcentral.services.azureml.net/workspaces/fbe056b6d4c74d7f9d1954367dc3fa61/services/xxa56efd75b745e28cd0512822d17eae/jobs",
"apiKey": "**********"
}
}
}

We will need another linked service for the Output, which takes the data from the experiment and writes it to a blob. The field names in the experiment are listed.

{
"name": "OutputML",
"properties": {
"structure": [
{"name": "Age", "type": "Int32" }
,
{ "name": "workclass", "type": "string" }
,
{ "name": "education-num", "type": "Int32" }
,
{ "name": "marital-status", "type": "String" }
,
{ "name": "occupation", "type": "String" }
,
{ "name": "relationship", "type": "String" }
,
{ "name": "race", "type": "String" }
,
{ "name": "sex", "type": "String" }
,
{ "name": "hours-per-week", "type": "Int32" }
,
{ "name": "native-country", "type": "String" }
,
{"name": "Scored Labels","type": "Int32"}
,
{"name": "Scored Probabilities","type": "Decimal"}
],
"published": false,
"type": "AzureSqlTable",
"linkedServiceName": "LinkedServiceOutput",
"typeProperties": {
"tableName": "ExperimentMLOutput"
},
"availability": {
"frequency": "Hour",
"interval": 1
},
"external": false,
"policy": {}
}
}

The API key will show the actual value until you save it, at which point it will change to the stars you see here. This Linked Service will be referenced in the next bit of JSON for the pipeline


"name": "PipelineML",
"properties": {
"description": "Use Azure ML Model",
"activities": [
{
"type": "AzureMLBatchExecution",
"typeProperties": {
"webServiceInput": "InputDataSetBlob",
"webServiceOutputs": {
"output1": "OutputDataSetBlob"
},
"globalParameters": {}
},
"inputs": [
{
"name": "InputDataSetBlob"
}
],
"outputs": [
{
"name": "OutputDataSetBlob"
}
],
"policy": {
"timeout": "02:00:00",
"concurrency": 3,
"executionPriorityOrder": "NewestFirst",
"retry": 1
},
"scheduler": {
"frequency": "Hour",
"interval": 1
},
"name": "MLActivity",
"description": "Execute Experiment",
"linkedServiceName": "AzureMLLinkedService"
}
],
"start": "2016-08-19T10:30:00Z",
"end": "2016-08-20T23:30:00Z",
"isPaused": true,
"hubName": " GingerDataFactoryTest_hub ",
"pipelineMode": "Scheduled"
}
}

Lastly another Dataset needs to be created to process the output. The data will be written to a file called Output.csv, which is in a folder called mloutput01/ which is located in the Blob storage container, which is the same one I used previously for the input folder used earlier.   This file will be overwritten every single time this is run.

{
"name": "OutputDataSetBlob",
"properties": {
"published": false,
"type": "AzureBlob",
"linkedServiceName": "AzureBlobStorageLinkedService",
"typeProperties": {
"fileName": "output.csv",
"folderPath": "mloutput01/",
"format": {
"type": "TextFormat",
"columnDelimiter": ","
}
},
"availability": {
"frequency": "Hour",
"interval": 1
},
"external": false,
"policy": {}
}
}

If you add this code onto the previous Data Factory code, you can take data from the database and use it to run a Azure ML experiment and run as much data as you want through the experiment.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

I’m Not Good at Math

How many times have you heard someone say, “I’m not good at Math”? Often times this statement is used as a reason why something technical cannot possibly be pursued. It’s a self-inflicted limitation; a reason that entire areas of study cannot be pursued. If you have ever said this, stop it. Don’t repeat it even if you believe you are not good at math. Why? Because while you may not be good at math now, there is no reason why that should stop you from learning it.

Math, Music and Programming

Years ago, back in the days before PCs and more importantly computer science degrees offered by major universities, IBM was working on developing mainframe computers and needed people to help them develop them. Since there were no computer MathandMusicscience degrees being offered at that time, they hired people with degrees in Math and Music. Music? Why Music? Music uses the same part of the brain as math does. This is one of the reasons educators think that music should be taught to small children as it has been shown to improve math scores. Personally I have found it interesting to ask technical people if they play or have played an instrument. Ask around yourself and you may be surprised at the large number of people in technical fields who have or do play a musical instrument. Musicians have the brain training needed to be good technical people, regardless of their math skills.

Learning Limits

There are no limits to what you can learn, other than the limits you put on yourself. The brain is very complex and there are infinite ways to train it to do something. Generally speaking one is not good at math because they haven’t learned it.  Oddly enough, discouraging one’s ability to learn often starts in school. If this sounds familiar, remember life isn’t school.  Often times a school setting isn’t the best way to learn anything. Performance in class is not indicative of one’s ability to learn. It may have be the ability of the instructor to teach or willingness to focus at that time. I am willing to bet you don’t view the world the way you did when you were sixteen, so why would you judge your ability to learn with that same filter?

Machine Learning is a Skill Which Can Be Learned

I know a very smart developer who told me recently that he wasn’t good at math, so he couldn’t possibly do machine learning. Really. PowerShell, Networking, TSQL, C#, SSIS, MDX and DAX you could learn but you can’t teach yourself Machine Learning? I am not going to say it is easy, but I wouldn’t say that about any of the other development and IT tasks either. If you can learn one of those, you can learn Machine Learning too, despite your belief in your math skills. There is no reason why not. I think Yoda said it best “Do or do not. There is no Try”. There is nothing really stopping you.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

 

Incorporating Azure Stream Analytics with Azure ML – Part 1

Using the Azure Stream Analytics Query Language to Drive an ML Experiment

In the past I have talked about some of the components of Azure Machine Learning, but I thought it might make more sense to talk about creating a solution, rather than the individual components.  As that will take a while, this post  begins a multi-part series to bring in some real world examples to make the concepts around streaming data and Azure Machine Learning [ML] less abstract by starting with the data, adding several ML experiments, then talking about ways to implement the solution. The blog series is focused on the streaming data from a sample company the concrete company Eohs.

Streaming Data in Azure

Eohs has installed a vehicle tracking system which sends GPS positioning and sensor data which is sent back in near real time to the dispatching company. The dispatchers are able to monitor on their screens the location of the truck, speed, heading and some sensor information delivered every 20 seconds which allow them to know if the truck is loading concrete, pouring concrete, adding water, seat belt information, and if the passenger door is opened. Eohs has some policies for their drivers which can involve termination if they are violated. Drivers are not permitted to stop the truck anywhere other than the assigned delivery location, which cuts down on fraud and helps reduce insurance costs. This data is streamed via Azure Stream Analytics [ASA].

Cortana Analytics Implementation of Azure ML

Since Eohs is streaming their data with ASA, we want to implement an Azure ML Experiment to notify dispatch in real time any violation of their policies. As I discussed in a previous blog, since Cortana Analytics includes Azure ML and Stream Analytics, this would using the components is considered a Cortana Analytics implementation. We have created a Machine Learning Experiment which will look at the GPS position of the delivery location, and determine if a driver is stopped for an extra-ordinary length of time at a delivery location, as well as stopped in a non-delivery location. The dispatchers are immediately notified of this, so they can call the driver to figure out what is happening to the truck. What kind of data is needed to be sent to the Azure ML experiment to analyze?

Sliding Windows in Azure Stream Analytics

SlidingWindowsThe Azure ML Experiment needs to evaluate all of the vehicle data which shows that the truck is stopped for a while, generally speaking greater than 90 seconds. After all some traffic lights take 90 seconds to get through, so eliminating the short stops would be helpful in decreasing the data needed to be evaluated. ASA uses a SQL-like query language which makes it easy to split the data so only the data that the experiment needs will be sent. We want to evaluate a window of time where data returned is only the data where the vehicle shows it is stopped for 91 seconds. Finding the 91 second stops is considered a sliding window. Here’s the code you would need to do this.

SELECT VehicleID, Avg(GPSLat), avg(GPSLong), min(Speed), max(PourSensor),Max(WaterSensor), dateadd(second, -91, System.Timestamp) as StartEvalTime
, System.Timestamp as EndEvalTime
FROM VehicleTrackingSystem TIMESTAMP by SensorTime
Group by VehicleID, SlidingWindow(second,91)
HAVING min(Speed) <1

 

EndEvalTime is the Time that this event was calculated by the system. Since I wanted both the start and end evaluation time, the time was calculated by using the DATEADD function. If one of the data elements arrived out of order, using the TIMESTAMP function will ensure that they events will be evaluated in the order they happened instead of the order when the data was received.

Other Windowing in Azure Stream Analytics

window_slideASA also supports two other windowing functions, Tumbling and Hopping. In my next post I will be discussing how and when to use a Tumbling Window. If you are interested in reading the posts as they occur, please subscribe to desertislesql.com to be notified when the next post is available.

 

 

 

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

 

What is the difference between Machine Learning and Data Mining?

An Example of Machine Learning: Google's Self-Driving Car

An Example of Machine Learning: Google’s Self-Driving Car

Often times when I give a talk about machine learning, I get a question about what is data mining and what is machine learning, which got me to thinking about the differences. Data mining has been implemented as a tool in databases for a while. SSIS even has a data mining task to run prediction queries on an SSAS data source. Machine Learning is commonly represented by Google’s self-driving car. After reading the article I linked about Google’s car or study the two disciplines, one can come to the understanding that they are not all that different. Both require the analysis of massive amounts of data to come to a conclusion. Google uses that information in the car to tell it to stop or go. In data mining, the software is used to identify patterns in data, which are used to classify the data into groups.

Data Mining is a subset of Machine Learning

There are four general categorizations of Machine Learning: Anomaly Detection, Clustering, Classification, and Regression. To determine the results, algorithms are run against data to find the patterns that the data contains. For data mining the algorithms tend to be more limited than machine learning. In essence all data mining is machine learning, but all machine learning is not data mining.

Goals of Machine Learning

There are some people who will argue that there is no difference between the two disciplines as the algorithms, such as Naïve Bayes or Decision trees are common to both as is the process to finding the answers. While I understand the argument, I tend to disagree. Machine learning is designed to give computers the ability to learn without specifically being programmed to do so, by extrapolating the large amounts of data which have been fed to it to come up with results which fit that pattern. The goal of machine learning is what differentiates it from data mining as it is designed to find meaning from the data based upon patterns identified in the process.

Deriving Meaning from the Data

As more and more data is gathered, the goal of turning data into information is being widely pursued. The tools to do this have greatly improved as well. Like Lotus 123, the tools that were initially used to create machine learning experiments bear little resemblance to the tools available today. As the science behind the study of data continues to improve, more and more people are taking advantage of the ability of new tools such as Azure Machine Learning to us data to answer all sorts of questions, from which customer is likely to leave aka Customer Churn or is it time to shut down a machine for maintenance. Whatever you chose to call it, it’s a fascinating topic, and one I plan on spending more time pursuing.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

DIY Machine Learning – Supervised Learning

When I first heard about supervised learning I had a picture in my head of a kindergarten class with a teacher trying to get the small humans to read. And perhaps that isn’t a bad analogy when talking about Machine Learning in general as it is based on the same principles as school, repetition and trial. After that the analogy falls apart though when you get to the specific criteria needed for Supervised Learning. There are two broad categories for types of machine learning which have the binary descriptions of supervised learning, which fall into the binary categories of Supervised and Unsupervised. This means you only have to know the one set of criteria for supervised learning, to determine which type you need.

Training Data

A problem solved with supervised learning will have a well-defined set of variables for its sample data and a known outcome choice. Unsupervised learning has an undefined set of variables as the task is to find the structure from data where it is not apparent nor is the type of outcome known. An example of Supervised learning would be determining if email was spam or not. You have a set of emails, which you can evaluate by examining a set of training data and you can determine using the elements of the email such as recipient, sender, IP, topic, number of recipient, field masking and other criteria to determine whether or not the email should be placed in the spam folder. Supervised learning is very dependent upon the training data to determine a result, as it uses training data to determine the results. Too much training and your experiment starts to memorize the answers, rather than developing a technique to derive solutions from them.

When Supervised Learning Should be employed in a Machine Learning Experiment

As the field of data science continues to proliferate, more people start are becoming interested in Machine Learning. Having the ability to learn with a free tool like Azure Machine Learning helps too. Like many tools while there are many things you can do, so knowing when you should do something is a big step in the right direction. While unsupervised learning provides a wide canvas for making a decision, creating a successful experiment can take more time as there are so many concepts to explore. If you have a good set of test data and a limited amount of time to come up with an answer, the better solution is to create a supervised learning experiment. The next step in the plan is to figure out what category the problem uses, a topic I plan to explore in depth in a later post.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

 

Upcoming and Recent Events

24HOPPassSpeakingThe PASS organization is a professional organization which sponsors a number of different technical events in the technical community. Recently, I have been honored to be selected to speak at not one but two events hosted by PASS, a professional organization which provides a lot of great resources to improve knowledge of all things SQL Server and related technologies to the world. The PASS Business Intelligence Chapter provides training on all things related to Business Intelligence via the web. I was selected to talk at the last meeting in May. Thank you to all of the people who were able to attend my talk on Top 10 SSIS Tuning Tricks live. If you had to work, no problem all of the talks hosted by the PASS Business Intelligence Virtual Chapter Recordings are available on www.Youtube.com. The recording of my Top 10 SSIS Tuning Tricks session is available here.

24 Hours of PASS

Periodically PASS provides a 24 Hour Training session on SQL Related topics to provide training live to every time zone in the world. As this event is watched by people around the world, it is a real honor to be selected for this event. This time the speakers were selected from people who had not yet spoken at the PASS Summit Convention, as the theme was Growing Our Community. The theme is just another way the PASS organization is working to improve people’s skills. Not only do they provide the opportunity to learn all things data, but also provide professional development through growing the speaking skills by providing many avenues to practice these skills.

Data Analytics with Azure Machine Learning

My abstract on Improving Data Analytics with Azure Machine Learning was selected by the 24 Hours of PASS. As readers of my blog are aware, I have been working on Azure Machine Learning [ML] this year and look forward to discussing how to integrate Azure ML into current environments. Data analytics with ML are yet another way to derive meaning from data being collected and stored. I find the application of data analytic fascinating, and hope to show you why if you are able to attend. There are a number of wonderful talks scheduled at this event, so I encourage you to check out the schedule at attend as many as you can. To be sure I’ll be signing up for a number of sessions as well.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

Azure ML, SSIS and the Modern Data Warehouse

Recently I was afforded the opportunity to speak at several different events, all of which I thoroughly enjoyed. I was able to speak on Azure Machine learning first at the Arizona SQL Server Users Group meeting. I really appreciate all who attended as we had quite a crowd. Since the meeting is held MachineLearningTalkpractically on Arizona State University’s Tempe Campus, it was great to see a number of students attending, most likely due to Ram’s continued marketing efforts on meetup.com. After talking to him about it, I was impressed at his success at improving attendance by promoting the event on Meetup, and wonder if many SQL Server User Groups have experienced the same benefits. If you have, please let me know. Thanks Joe for taking a picture of the event too.

Modern Data Warehousing Precon

The second event where I had the opportunity to talk about technology was at the Precon at SQL Saturday in Huntington Beach, where I spoke about Modern Data Warehousing. It was a real honor to be selected for this event, and I really enjoyed interacting with all of the attendees. Special thanks to Alan Faulkner for his assistance. We discussed the changing data environment including cloud based storage, analytics, Hadoop, handling ever increasing amounts of data from different sources, increasing demands of users, the review of technology solutions demonstrate ways to resolve these issues in their environments.

Talking and More Importantly Listening

The following day was SQL Saturday in Huntington Beach #389. Thanks to Andrew, Laurie, Thomas and the rest of the volunteers for making this a great event as I know a little bit about the work that goes into planning and pulling off the event. My sessions on Azure ML, Predicting the future with Machine Learning and Top 10 SSIS Tuning Tricks were both selected and I had great turnout on both sessions. To follow-up with a question I received during my SSIS Session, Balanced Data Distributor was first released as a new SSIS transform for SQL Server 2008 and 2008 R2, so you can use it for versions prior to SQL Server 2012. I’ve posted more information about it here. I also got a chance to meet a real live data scientist, the first time that has happened.  Not only did I get a chance to speak but a chance to listen. I really enjoyed the sessions from Steve Hughes on the Building a Modern Data Warehouse and Analytics Solution in Azure, Kevin Kline on , and Julie Koesmarno on Interactive & Actionable Data Visualisation With Power View. As always it’s wonderful to get a chance to visit in person with the people who’s technical expertise I read. In addition to listening to technical jokes which people outside of the SQL community would not find humorous, it’s great to discuss technology with other practitioners. Thanks to Mr. Smith for providing me a question which I didn’t know the answer, which now I feel compelled to go find. I’ll be investigating the scalability of Azure ML and R so that I will be able to have an answer for him next time I see him. I really enjoy the challenge of not only investigating and applying new technology but figuring out how to explain what I’ve learned. I look forward to the opportunity to present again, and when I do I’ll be sure to update this site so hopefully I get a chance to meet the people who read this.
Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

 

Complex Data Analysis and Azure Machine Learning Presentation Wrap Up

Thank you for all of the people who signed up for my webinar on Data Analysis with Azure Machine Learning [ML]. I hope after watching it that you find reasons to agree that the most important thing you need to know to get started in Machine Learning is not Math, but having good knowledge of the data you want to analyze. There’s no reason not to investigate as Azure Machine Learning is free.  In order to take more time with the questions after the presentation than the webinar format allowed,  I am posting my answers here, where I am able to answer them in greater detail.

How would one choose a subset of data to “train” the model? For example, would I choose a random 1000 rows from my data set?

It is important to select a subset of data which is representative of the data which wish to evaluate. Sometime a random 1000 rows will do that, and other times you will need to use other criteria, like transactions throughout a given date range to be a better representative sample. It all comes down to knowing your data well enough to know that the data used for testing is similar to what you will be ultimately using for analysis.

Do you have to rerun or does it save results?

The process of creating an experiment requires that for each run you need to re-run the data as it does not save results.

Does Azure ML use the same logic as data mining?

In a word, no. If you look at the algorithms used for data mining you will see they overlap with some of the models available in Azure ML. Azure ML provides a richer set of models, plus a greater ability to either call models created by others or write custom models.

How much does Azure ML cost?

There is no cost for Azure ML. You can sign up and use it for free.  Click here for more information on Azure ML.

If I am using Data Factory, can I use Azure ML ?

Data Factory added the ability to call Azure ML in December, providing another place to incorporate Azure ML analytics. When an Azure experiment is complete, it is published as a web service so that the experiment can be called by any program which chooses to call it. Using the Azure ML experiments from directly within Data Factory decreases the need to write custom code, while allowing the logic to be incorporated into routine data collection processes.

http://azure.microsoft.com/blog/2014/12/16/azure-data-factory-updates-integration-with-azure-machine-learning-2/

If you have more questions about Azure ML or would like to see me present on the topic live and live in Southern California, I hope you can attend SQL Saturday #389 – Huntington Beach where I will be presenting on Azure ML and Top ten SSIS tips. I hope to see you there.

 

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur