Archive

Archive for the ‘SSAS Data Mining’ Category

Use SSIS to access SAS data

I’ve spent some time to look for the products that can access the SAS data files (.sas7bdat format in particular) .

And I’ve found two:

  1. SAS to Windoes utility (http://www.savian.net/#Utilities ). It is free.
  2. CozyRoc just released New SAS Adapters for SSIS (http://www.cozyroc.com/blog/new-sas%C2%AE-adapters-and-price-change . It enables SSIS to be able to read and write to sas7bdat (SAS format) files. The cost is USD$2,499.95/Year.
Advertisements

FW: Cleaning Azure Marketplace Data with PowerPivot and Predixion Insight

January 26, 2011 Leave a comment

Jamie McLennan has a great post taling about “Cleaning Azure Marketplace Data with PowerPivot and Predixion Insight

Predixion Insight – Data Mining

January 22, 2011 1 comment

As I’m based in Sydney, I missed the webinar due to the time difference between US and AU. But luckily, I received a replay link and watched the demo.  Although I’ve play with Predixion software before, it is still exciting to watch over the demo.

Basically, in addition to the data mining functionality similar to the DM add-in for Excel, Predixion Insight is able to talk to PowerPivot. It is able to apply the trained mining model to the data set in PowerPivot and tag the record with the mining result. Then, using DAX to further format the dataset in PowerPivot and you can easily get a nicely analyzed predictive analytic report. Even more, with PMML support, you can even import SAS, SPSS model into Predixion Insight and then apply it to the dataset in Excel/PowerPivot as well as validation tools built into Predixion Insight to validate the model. Also, with Excel Services in SharePoint 2010, the final report can be uploaded and shared by everyone.

Before I had a bit concern about the data volume supported by Excel as well as Internet speed. Now I am much more confident after watching the demo. The PowerPivot would overcome the Excel data volume limitation. Regarding the data transferring across the Internet, the case given by Predixion software is that one of their client sent a 10 million records PowerPivot to the cloud server and it only takes around 4 minutes to get the mining results from the remote server. I believe in most cases, it would satisfy most of the small to medium organisations.

Now the only concern I have at the moment is the operational BI support. Imaging I have a model in the cloud used for identifying the customer category (e.g. good/bad). If a new customer is set up in the OLTP system, it would be fantastic if the system can query the model and tag the customer. In this case, Excel is not used at all. Instead, a web-service or APIs are required so that the OLTP systems can interact with the cloud mining services.

IBM to pay US$1.2 billion for SPSS

 
Categories: SSAS Data Mining

Preparing for Data Mining – Row Sampling

February 19, 2009 Leave a comment
The following code is what I used for my past Data Mining projects. I download some where long time ago and  now I can’t really remember where I got. If anyone happens to read the code and know where the code is originally from, leave your comment please. I’ll appreciate your help.

You can run the code in SSMS or Excute SQL Task in SSIS.

USE AdventureWorksDW
GO

–Extracting 20% of data for test data
–20 percent = 0.2; 30%=0.3
SELECT *
  INTO TK445_Ch09_TestSet
  FROM vTargetMail
 WHERE RAND(CHECKSUM(NEWID())%1000000000+CustomerKey) < 0.2
GO

— 80% training set
SELECT *
  INTO TK445_Ch09_TrainingSet
  FROM vTargetMail
EXCEPT
SELECT *
  FROM TK445_Ch09_TestSet
GO

— Check if we get back 18484 cases with UNION ALL
SELECT *
  FROM TK445_Ch09_TrainingSet
UNION ALL
SELECT *
  FROM TK445_Ch09_TestSet
GO

— Check the first four moments
— Now check them on training & test data sets
SELECT Mean = MIN(m.mean),
       [StDev] = MIN(m.[StDev]),
       Skew = SUM(((Age*1.0-m.mean)/m.[StDev])*((Age*1.0-m.mean)/m.[StDev])*((Age*1.0-m.mean)/m.[StDev])) * MIN(m.corrfact1),
       Kurt = SUM(SQUARE(SQUARE(((Age*1.0-m.mean)/m.[StDev])))) * MIN(m.corrfact2) – MIN(m.subfact)
  FROM TK445_Ch09_TrainingSet v CROSS JOIN
       (SELECT mean = AVG(Age*1.0), [StDev] = STDEV(Age),
               corrfact1 = COUNT(*)*1.0 / (COUNT(*)-1) / (COUNT(*)-2), 
               corrfact2 = COUNT(*)*1.0 * (COUNT(*)+1) / (COUNT(*)-1) / (COUNT(*)-2) / (COUNT(*)-3),
               subfact = 3.0 * SQUARE((COUNT(*)-1)) / (COUNT(*)-2) / (COUNT(*)-3)
          FROM TK445_Ch09_TrainingSet v) AS m 
UNION
SELECT Mean = MIN(m.mean),
       [StDev] = MIN(m.[StDev]),
       Skew = SUM(((Age*1.0-m.mean)/m.[StDev])*((Age*1.0-m.mean)/m.[StDev])*((Age*1.0-m.mean)/m.[StDev])) * MIN(m.corrfact1),
       Kurt = SUM(SQUARE(SQUARE(((Age*1.0-m.mean)/m.[StDev])))) * MIN(m.corrfact2) – MIN(m.subfact)
  FROM TK445_Ch09_TestSet v CROSS JOIN
       (SELECT mean = AVG(Age*1.0), [StDev] = STDEV(Age),
               corrfact1 = COUNT(*)*1.0 / (COUNT(*)-1) / (COUNT(*)-2), 
               corrfact2 = COUNT(*)*1.0 * (COUNT(*)+1) / (COUNT(*)-1) / (COUNT(*)-2) / (COUNT(*)-3),
               subfact = 3.0 * SQUARE((COUNT(*)-1)) / (COUNT(*)-2) / (COUNT(*)-3)
          FROM TK445_Ch09_TestSet v) AS m 
GO
— we can see there are no significant differences, so split is done well

Categories: SSAS Data Mining