What is the difference between Machine Learning and Data Mining?

An Example of Machine Learning: Google's Self-Driving Car

An Example of Machine Learning: Google’s Self-Driving Car

Often times when I give a talk about machine learning, I get a question about what is data mining and what is machine learning, which got me to thinking about the differences. Data mining has been implemented as a tool in databases for a while. SSIS even has a data mining task to run prediction queries on an SSAS data source. Machine Learning is commonly represented by Google’s self-driving car. After reading the article I linked about Google’s car or study the two disciplines, one can come to the understanding that they are not all that different. Both require the analysis of massive amounts of data to come to a conclusion. Google uses that information in the car to tell it to stop or go. In data mining, the software is used to identify patterns in data, which are used to classify the data into groups.

Data Mining is a subset of Machine Learning

There are four general categorizations of Machine Learning: Anomaly Detection, Clustering, Classification, and Regression. To determine the results, algorithms are run against data to find the patterns that the data contains. For data mining the algorithms tend to be more limited than machine learning. In essence all data mining is machine learning, but all machine learning is not data mining.

Goals of Machine Learning

There are some people who will argue that there is no difference between the two disciplines as the algorithms, such as Naïve Bayes or Decision trees are common to both as is the process to finding the answers. While I understand the argument, I tend to disagree. Machine learning is designed to give computers the ability to learn without specifically being programmed to do so, by extrapolating the large amounts of data which have been fed to it to come up with results which fit that pattern. The goal of machine learning is what differentiates it from data mining as it is designed to find meaning from the data based upon patterns identified in the process.

Deriving Meaning from the Data

As more and more data is gathered, the goal of turning data into information is being widely pursued. The tools to do this have greatly improved as well. Like Lotus 123, the tools that were initially used to create machine learning experiments bear little resemblance to the tools available today. As the science behind the study of data continues to improve, more and more people are taking advantage of the ability of new tools such as Azure Machine Learning to us data to answer all sorts of questions, from which customer is likely to leave aka Customer Churn or is it time to shut down a machine for maintenance. Whatever you chose to call it, it’s a fascinating topic, and one I plan on spending more time pursuing.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

Creating a Date Dimension with a CTE

I am looking forward to SQL Server 2016, which includes an update to DAX which will generate dates for you, but since it hasn’t been released yet, the need still exists to create a date Table. Since I am creating a Tabular Analysis Service instance, an identified date dimension is needed for time functions. There are a number of different ways to create a date dimension, here I am going to take advantage of recu There are a number of ways you can create a date dimension, here I am going to describe a neat way of creating one using a recursive CTE, which decreases the SQL code required to generate one.

Date Dimension Fields

There are a number of fields which are pretty standard for date tables. The fields can vary if you need to add things like Fiscal years or Month End dates which have business specific rules like last Friday of the month. The most common fields needed for a Date Dimension are listed here.

  • DateKey – Integer key representing the date, for example 20150524
  • CalendarDate – Date time field
  • CalendarYear – Four digit year
  • QuarterNumber – Number from 1 to 4
  • QuarterName – First Quarter, Second Quarter, 1 Qtr 2015, First Quarter 2015 are some of the ways the name is listed. Pick whichever format people would like to see.
  • MonthNumber – 1 for January, 2 for February, 3 for March etc.
  • NameMonth –   January, February, March etc.
  • MonthDayNumber – July 15 would have the value of 15 here
  • WeekDayNumber – Date of the Month. July 4 would have a 4 here.
  • CalendarWeekofMonth – Counts the weeks within each month
  • NameDay – Monday, Tuesday etc.
  • DayNumber – Whether you choose 1 for Sunday or 1 for Monday is a business rule you need to find out first. This example shows 1 for Sunday, 2 for Monday etc.
  • YearDay – Sometimes referred to as the Julian number this counts the days from 1- 365 and can be useful in some calculations
  • YearWeek – Counts the weeks from 1 -52

As a best practice, remember reserved words such as DATE or Index or any other word which shows up as blue or pink in SQL Server Management Studio, should never be the names of any columns. Always select a name other than a reserved word when creating tables. Also friendly names containing spaces are great for exposing to users, but they are annoying in SQL Server, so leave the spaces out of the column names. Evaluate what dates you are going to be needing in the table so that you don’t have to go back and redo it. I am creating a smaller table here, but that is just because it is an example. Look at the dates you will be storing in your data warehouse when determining your starting dates, and set the end dates for probably about five years longer than you think the data warehouse will still be in use.

Using a CTE to Generate a Date Dimension

CTEs, which Microsoft added in 2005, is a great way to generate a date table by harnessing the power of computer to spin through your code, decreasing the need to write a lot of code. I am using a recursive CTE method here first published by Itzik Ben-Gan to generate the number table being used in the code below.

CREATE TABLE [dbo].[DimDate](
[DateKey] int NULL,
[CalendarDate] [smalldatetime] NULL,
[CalendarYear] [int] NULL,
[QuarterNumber] [int] NULL,
[QuarterName] [varchar](14) NULL,
[MonthNumber] [int] NULL,
[NameMonth] [nvarchar](30) NULL,
[MonthDayNumber] [int] NULL,
[CalendarWeekOfMonth] [int] NULL,
[NameDay] [nvarchar](30) NULL,
[DayNumber] [int] NULL,
[YearDay] [int] NULL,
[YearWeek] [int] NULL
) ON [PRIMARY]

GO

/*Make sure you change the start and end dates listed here to the dates you wish to use in your table*/
DECLARE @StartDate smalldatetime = '01/01/2014'
DECLARE @EndDate smalldatetime = '12/31/2016' ; /* don't forget the semi-colon or you will get an error*/

/*This CTE is used to create a list of numbers used to generate the calendar*/
WITH
A00(N) AS (SELECT 1 UNION ALL SELECT 1),
A02(N) AS (SELECT 1 FROM A00 a, A00 b),
A04(N) AS (SELECT 1 FROM A02 a, A02 b),
A08(N) AS (SELECT 1 FROM A04 a, A04 b),
A16(N) AS (SELECT 1 FROM A08 a, A08 b),
A32(N) AS (SELECT 1 FROM A16 a, A16 b),
cteTally(N) AS (SELECT ROW_NUMBER() OVER (ORDER BY N) FROM A32),
/*Calendar dates are created here*/
CalendarBase as (
SELECT
DateKey = n
, CalendarDate = DATEADD(day, n - 1, @StartDate )
FROM cteTally
WHERE N <= DATEDIFF(day, @StartDate , @EndDate +1)
)

/*Using the list of dates created above will populate your date table here*/
insert into dbo.DimDate(DateKey
,CalendarDate
,CalendarYear
,QuarterNumber
,QuarterName
,MonthNumber
,NameMonth
,MonthDayNumber
,CalendarWeekOfMonth
,NameDay
,DayNumber
,YearDay
,YearWeek)

SELECT
DateKey       = CONVERT(char(8), CalendarDate, 112)
, CalendarDate
, CalendarYear  = YEAR(CalendarDate)
, QuarterNumber =  (DATEPART(QUARTER,CalendarDate) )
, QuarterName = 'Quarter ' + cast((DATEPART(QUARTER,CalendarDate) ) as char(1)) +' ' + cast(YEAR(CalendarDate) as char(4))
, MonthNumber = MONTH(CalendarDate)
, NameMonth     = DATENAME(Month, CalendarDate)
, WeekDayNumber   = DATEPART(DAY, CalendarDate)
, CalendarWeekOfMonth = DATEDIFF(week, DATEADD(day,1, CalendarDate - DAY(CalendarDate) + 1) -1, CalendarDate) +1
, NameDay       = DATENAME (Weekday,CalendarDate )
, DayNumber       = DATEPART(Dw, CalendarDate)
, YearDay       = DATEPART(DAYOFYEAR, CalendarDate)
, YearWeek      = DATEPART(WEEK, CalendarDate)
FROM CalendarBase

After running this code you will have a date table created and loaded.
Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

SQL Server 2016 and PolyBase

The next release of SQL Server, SQL Server 2016 is continuing with a convention which was employed in previous releases, which I call the Cadillac release system.  At General Motors, in the past new features were first offered on their most luxurious brand, Cadillac, and if these features prove successful, they are rolled out to Buick and the rest of the product lines.  Microsoft does the same thing.  Their ‘Cadillac’  is the PDW [Parallel Data Warehouse], Microsoft’s Data Appliance. One notable example of this release model was the addition of column store indexes to SQL Server. Column store indexes were first available on the PDW, or APS as is was known then, and Microsoft later added column store indexes to SQL Server 2012. Later that same year, at SQL PASS Summit 2012, I heard about a really neat feature available in the PDW, PolyBase. The recording I heard is available here, where Dr. David DeWitt of Microsoft explained PolyBase in great detail. I have been waiting to hear that PolyBase was going to be released to SQL Server ever since.  On May the Fourth, 2015, Microsoft announced the preview release of SQL Server 2016. Listed in the release announcement was the feature I’d been waiting for, PolyBase.

Sqoop Limitations

PolyBase provides the ability to integrate a Hadoop cluster with SQL Server, which will allow you to query the data in a Hadoop Cluster from SQL Server. While the Apache environment provided the Sqoop HadoopSqoopapplication to integrate Hadoop with other relational databases, it wasn’t really enough. With Sqoop, the data is actually moved from the Hadoop cluster into SQL Server, or the relational database of your choice. This is problematic because you needed to know before you ran Sqoop that you had enough room within your database to hold all the data. I remembered this the hard way when I ran out of space playing with Sqoop and SQL Server. From a performance perspective, this kind of data transfer is also, shall we say, far from optimal. Another way to look at Sqoop is that it provides the Hadoop answer to SSIS. After all Sqoop is performing a data move, just like SSIS code. The caveat is SSIS is generally faster than Sqoop, and provides a greater feature set too.

Polybase – Hadoop Integration with SQL Server

Unlike Sqoop, PolyBase does not load data into SQL Server. Instead it provides SQL Server with the ability to query Hadoop while leaving the data in the HDFS clusters. Since Hadoop is schema-on-read, within SQL server you generate the schema to apply to your data stored in Hadoop. After the table schema is known, PolyBase provides the ability to then query data outside of SQL Server from within SQL Server. Using PolyBase it is possible to integrate data from two completely different file systems, providing freedom to store the data in either place. No longer will people start automatically equating retrieving data in Hadoop with MapReduce. With PolyBase all of the SQL knowledge accumulated by millions of people becomes a useful tool which provides the ability to retrieve valuable information from Hadoop with SQL. This is a very exciting development which I think will encourage more Hadoop adoption and better yet, integration with existing data. I am really looking forward SQL Server 2016.

Yours Always

Ginger Grant

Data aficionado et SQL Raconteur

When IsDate Does Not Work

IsDate is an often used feature used in TSQL to determine whether or not a field contains a date. I am sure there are a number of people who have used this successfully and may wonder what I could possibly be talking about. To maintain my technical bona fides I’ve included some code, which you can run inside of SQL Server Management Studio.

 Declare @ValidDateTest as table ( Somefield varchar(15), Works bit)
 Insert into @validDateTest values ('Not a date' , 0)
 Insert into @validDateTest values ('4-1-2-14' , 0)
 Insert into @validDateTest values ('5-2-7' , 0)
 Insert into @validDateTest values ('2014.2.3' , 1)
 Insert into @validDateTest values ('08/02/10' , 0)
 Insert into @validDateTest values ('7/3/2015' , 1)
 Insert into @validDateTest values ('2014-3-14' , 1)
 Insert into @validDateTest values ('12-3-1' , 0)
 Insert into @validDateTest values ('14-3-4' , 0)
 Insert into @validDateTest values ('20140301' , 1)
 Insert into @validDateTest values ('201123' , 1)
 Insert into @validDateTest values ('2011204' , 0)
 Insert into @validDateTest values ('7/023/2015' , 0)
 Insert into @validDateTest values ('6/02/014' , 0)
 Insert into @validDateTest values ('003/02/014' , 0)
 Insert into @validDateTest values ('3/010/2014' , 0)
 Insert into @validDateTest values ('4/02/012' , 0)
Select case when isdate(somefield) = 0 then --False, not a date
 '1/1/1900'
 else Convert(datetime, somefield, 110)
 end as ConvertedDate , Somefield
 from @validdatetest
 where works = 1

It is obvious by looking at my example, that some of the values inserted into the temp table @validDateTest are not dates. The code below the insert statements will Convert only the values which IsDate says are dates, not all of the values as I have a where condition in the query.  The values that will Convert without giving me an error have a works value of 1. If you run this query, you will see the difference in the values returned by IsDate, and the values I have provided that work without returning an error.

 Select Somefield, isdate(somefield)as IsDateValue, works
 from @ValidDateTest

Here are the results

Covertresult

Now if you run the query listed above and take off the where condition

 Select case when isdate(somefield) = 0 then --False, not a date
 '1/1/1900'
 else Convert(datetime, somefield, 110)
 end as ConvertedDate , Somefield
 from @validdatetest

You’ll get this error
Msg 241, Level 16, State 1, Line 21
Conversion failed when converting date and/or time from character string.

Casting Around for Bad IsDate solution

If you are fishing around for a solution, the simplicity of this will probably make you smack your desk. Now ideally, the source system should be fixed, as it really should be making sure that the dates are valid. I can’t make that happen. But I do have a solution to resolve this problem, just Cast it. Cast and Convert are on the same page in Microsoft online help and a lot of the time you might think they are synonymous, but they don’t work the same way.

 Select case when isdate(somefield) = 0 then --False, not a date
 '1/1/1900'
 else Cast (somefield as datetime)
 end as ConvertedDate , Somefield
 from @validdatetest

This works with no errors whatsoever. Simple. Easy. If you ever run into this, I hope this saves you some time.

Yours Always,
Ginger Grant
Data aficionado et SQL Raconteur