Archive
Use SSIS to access SAS data
I’ve spent some time to look for the products that can access the SAS data files (.sas7bdat format in particular) .
And I’ve found two:
- SAS to Windoes utility (http://www.savian.net/#Utilities ). It is free.
- CozyRoc just released New SAS Adapters for SSIS (http://www.cozyroc.com/blog/new-sas%C2%AE-adapters-and-price-change . It enables SSIS to be able to read and write to sas7bdat (SAS format) files. The cost is USD$2,499.95/Year.
FW: Cleaning Azure Marketplace Data with PowerPivot and Predixion Insight
Jamie McLennan has a great post taling about “Cleaning Azure Marketplace Data with PowerPivot and Predixion Insight“
Predixion Insight – Data Mining
As I’m based in Sydney, I missed the webinar due to the time difference between US and AU. But luckily, I received a replay link and watched the demo. Although I’ve play with Predixion software before, it is still exciting to watch over the demo.
Basically, in addition to the data mining functionality similar to the DM add-in for Excel, Predixion Insight is able to talk to PowerPivot. It is able to apply the trained mining model to the data set in PowerPivot and tag the record with the mining result. Then, using DAX to further format the dataset in PowerPivot and you can easily get a nicely analyzed predictive analytic report. Even more, with PMML support, you can even import SAS, SPSS model into Predixion Insight and then apply it to the dataset in Excel/PowerPivot as well as validation tools built into Predixion Insight to validate the model. Also, with Excel Services in SharePoint 2010, the final report can be uploaded and shared by everyone.
Before I had a bit concern about the data volume supported by Excel as well as Internet speed. Now I am much more confident after watching the demo. The PowerPivot would overcome the Excel data volume limitation. Regarding the data transferring across the Internet, the case given by Predixion software is that one of their client sent a 10 million records PowerPivot to the cloud server and it only takes around 4 minutes to get the mining results from the remote server. I believe in most cases, it would satisfy most of the small to medium organisations.
Now the only concern I have at the moment is the operational BI support. Imaging I have a model in the cloud used for identifying the customer category (e.g. good/bad). If a new customer is set up in the OLTP system, it would be fantastic if the system can query the model and tag the customer. In this case, Excel is not used at all. Instead, a web-service or APIs are required so that the OLTP systems can interact with the cloud mining services.
IBM to pay US$1.2 billion for SPSS
Another acquisition is made by IBM to enhance its BI platform.
Preparing for Data Mining – Row Sampling
You can run the code in SSMS or Excute SQL Task in SSIS.
USE AdventureWorksDW
GO
–Extracting 20% of data for test data
–20 percent = 0.2; 30%=0.3
SELECT *
INTO TK445_Ch09_TestSet
FROM vTargetMail
WHERE RAND(CHECKSUM(NEWID())%1000000000+CustomerKey) < 0.2
GO
— 80% training set
SELECT *
INTO TK445_Ch09_TrainingSet
FROM vTargetMail
EXCEPT
SELECT *
FROM TK445_Ch09_TestSet
GO
— Check if we get back 18484 cases with UNION ALL
SELECT *
FROM TK445_Ch09_TrainingSet
UNION ALL
SELECT *
FROM TK445_Ch09_TestSet
GO
— Check the first four moments
— Now check them on training & test data sets
SELECT Mean = MIN(m.mean),
[StDev] = MIN(m.[StDev]),
Skew = SUM(((Age*1.0-m.mean)/m.[StDev])*((Age*1.0-m.mean)/m.[StDev])*((Age*1.0-m.mean)/m.[StDev])) * MIN(m.corrfact1),
Kurt = SUM(SQUARE(SQUARE(((Age*1.0-m.mean)/m.[StDev])))) * MIN(m.corrfact2) – MIN(m.subfact)
FROM TK445_Ch09_TrainingSet v CROSS JOIN
(SELECT mean = AVG(Age*1.0), [StDev] = STDEV(Age),
corrfact1 = COUNT(*)*1.0 / (COUNT(*)-1) / (COUNT(*)-2),
corrfact2 = COUNT(*)*1.0 * (COUNT(*)+1) / (COUNT(*)-1) / (COUNT(*)-2) / (COUNT(*)-3),
subfact = 3.0 * SQUARE((COUNT(*)-1)) / (COUNT(*)-2) / (COUNT(*)-3)
FROM TK445_Ch09_TrainingSet v) AS m
UNION
SELECT Mean = MIN(m.mean),
[StDev] = MIN(m.[StDev]),
Skew = SUM(((Age*1.0-m.mean)/m.[StDev])*((Age*1.0-m.mean)/m.[StDev])*((Age*1.0-m.mean)/m.[StDev])) * MIN(m.corrfact1),
Kurt = SUM(SQUARE(SQUARE(((Age*1.0-m.mean)/m.[StDev])))) * MIN(m.corrfact2) – MIN(m.subfact)
FROM TK445_Ch09_TestSet v CROSS JOIN
(SELECT mean = AVG(Age*1.0), [StDev] = STDEV(Age),
corrfact1 = COUNT(*)*1.0 / (COUNT(*)-1) / (COUNT(*)-2),
corrfact2 = COUNT(*)*1.0 * (COUNT(*)+1) / (COUNT(*)-1) / (COUNT(*)-2) / (COUNT(*)-3),
subfact = 3.0 * SQUARE((COUNT(*)-1)) / (COUNT(*)-2) / (COUNT(*)-3)
FROM TK445_Ch09_TestSet v) AS m
GO
— we can see there are no significant differences, so split is done well