As I write this entry I'm keeping an eye on circa113 million records exporting from a SQL Server database and thinking "why do i have to do this?". The answer is that i probably don't.
My objective is to use a particular analytical tool that needs to have the data in its own file format so I'm extracting the fields that i think i need to produce some Rule Sets but the chances are that i may need to go back to the database and get some more later as I iterate in the usual style.
There has long been a recognition that this data extraction step - when the data is already inside a database - is, for the most part, a wasted one. Hence the leading database vendors have implemented data mining algorithms inside their database servers. Rather like this project though my sense is that the majority of analytical modelling still happens outside the database (even though more data than ever sits inside one). I suspect that one of the main reasons for this is that the native User Interfaces that exist to these database server algorithms are not as mature as they could be and that most analytical tool vendors don't typically enable us to do the in-database mining. They may have better interfaces but you are usually going to have to extract the data to use them.
There are exceptions to the rule and SPSS Clementine is a notable one. In Clementine you can model using the algorithms inside SQL Server, Oracle or IBM DB2. Moreover when you manage/transform data - sort it for example - then that will also happen inside the database. However Clementine is the Rolls Royce among data mining tools.
This is very much at the front of my mind because we've just re-connected with Oracle after a number of years and I'm hopeful that we are going to be able to start using the Oracle Data Mining (ODM) algorithms more often in our projects. Many moons ago Oracle acquired a data mining tool called Darwin (it was so long ago that there isn't even a Wikipedia entry on it). Over the years they have integrated (or should I say "evolved" - sorry!) that technology into the Oracle database server and developed some new - and frequently innovative - algorithms into what is now ODM.
So a belated resolution for Analytical People this year is to get stuck into the database resident algorithms more. 30 million and counting ...