Data Matters

Posts

Showing posts from 2007

The segmentation of Time Series

In 2003, Keogh, E., Chu, S., Hart, D., Pazzani and M. Segmenting had written Time Series: A Survey and Novel Approach. in Data Mining in Time Series Databases published by World Scientific Publishing Company. The article talks about how to convert time series data with high frequency noise into some linear regression segmentations. If the original time series data has N timestamps, the linear regression result has K segments(K is far smaller than N), the storage and transmission space required of the data would be minimized. The article summarized three typical method of segmenting time series: sliding window, top-down and bottom-up. Sliding window method is to consider the time series as a data stream, and segment the series from early time to later time data. Every time the method steps forward a timestamp, adds the current value to exist window, or closes the current window and establishes a new window for incoming data. The method could be used as online processing ...

Deep Web

Is deep web the web where no search engine had gone before? It may be long road for general search engines to get inside deep web, and, doing so will leave deep web no more than a part of the database of search engines. Deep webs may provide some detailed topics or hot search queries for general search engines to bring users, not just only some titles or a few key words. This may be a task of data mining on the query log of deep webs.

Ruby

Ruby is remarkably cool a programming language! The tutorial in http://pine.fm/LearnToProgram is very interesting!

The layers of web platforms

It seems that the operating systems are the first layer of flatforms for building multiple applications. But operating systems are not enough for various requirements, and building applications directly upon operating systems layer is just very difficult and hard to transplant. So, we build applications inside our internet navigators. It is the first layer of web platforms, solve a lot of cross-OS problems. Now, multiple APIs arise and become very popular, such as javascript based libraries, AJAX, Flash, Silverlight etc... Maybe we could play game like Quake 4 inside the Firefox with a 3D engine built by javascript. There comes the next layer, which is the gadget layer. The web 2.0 websites open their API for third party developers, the typical is Facebook. Applications built on this layer could easily make use of the social networks resources of the platform itself.

The Two Basic Tasks in Data Mining

We know there are so many different methods and tasks in Data Mining: Classification, Predication, Clustering, Sturctured Data Mining, Association Rule Mining, Frequent Itemset Mining, Social Networks Mining, ... ... To distinguish them would lead a boring case. But in some phylosophical view, I think there are 2 only basic tasks in Data Mining: the first one is to distinguish different items, mainly including Classification and Clustering, their task is to discover or make use of the differences among the target dataset elements; the second one is to discover the combanations, or the patterns in the target dataset, mainly including Frequent Pattern Mining(Itemset, Sequence, Graph), their task is to find the similarity of different elements. Most of Data Mining Tasks make use of both basic tasks.

The trick of counting support in structured data mining

In most of structured data mining problems, there is trick of counting pattern support in different types of problems. Take sequece mining for example, suppose we have a dataset of 2 sequences as below: (AB)(C)(AB)(BC) (BC)(A)(BC) We have 2 ways of counting the support of pattern (A)(C): 1) Count every appearances of (A)(C), and in this case, the support of (A)(C) would be 5（We call it support-all）. 2) Count once for all appearances of (A)(C) in one sequence, in this case, the support of (A)(C) would be 2(We call it support-byseq). Now, we have 2 types of supports. When dealing with practical problems, we usually name one of the supports as the threshold for frequent patterns. In fact, the other support do have some properties related with the chosen one. The BIDE algorithm setup an example, it uses support-byseq as threshold, and use the other support to form a pruning schema.

A solution for two problems in deep first search

The problems are exercises in . Exercises 22.3-7 Give a counterexample to the conjecture that if there is a path from u to v in a directed graph G, and if d[u] Exercises 22.3-8 Give a counterexample to the conjecture that if there is a path from u to v in a directed graph G, then any depth-first search must result in d[v] ≤ f[u]. There is a counterexample for both problems in the following graph.

"does not exist in the current context" error in Visual Studio 2005

In ASP.NET development, there may be error "... does not exist in the current context" and the component ... lies obviously where it is. This may be a bug of Visual Studio 2005. When creating an aspx page, the component named "Label1" may be assigned a variable named "ctll00_Content1_Label1" in CLR, when the name "Label1" is changed, the name "ctll00_Content1_Label1" may not follow, and lead a error. The solution of the problem is very simple. Just copy+X your code away, save the file, and then copy+V the code back for Visual Studio 2005 to perform refactoring of the code.

App_Code, App_Data and Namespace in Asp.net 2.0

Directory App_Code and App_Data in Asp.net 2.0 should be located on the root directory of the website. The shared soured code should be in App_Code directory, and shared data file in App_data Directory. The Asp.net 2.0 compilation environment searches the namespaces whin the App_Code directory. For example, if the namespace of Default.aspx.cs is Test.Site, Default.aspx.cs could find the classes of the same namespaces in the App_Code Directory directly. But if Default.aspx.cs is not in the namespace of Test.Site, it should contain "using Test.Site;".

C++ implementation of quick sort using std::list

An C++ implementation of quick sort using std::list. The code is as follow: #include using namespace std; typedef list ::iterator ITER; void quicksort(list * seq, ITER middle, ITER end){ ITER iter = end; ITER begin = middle; if(middle == end) { return; } else{ --iter; } ITER temp; while(iter != middle){ if(*iter temp = iter; --iter; begin = seq->insert(begin,*temp); seq->erase(temp); } else{ --iter; } } quicksort(seq,begin,middle); quicksort(seq,++middle,end); }

Pseudocode of Teiresias algorithm

//the candidate stack Stack S; //check if P is prefix less than Q //P=AC..T and Q=T.G, P is prefix less than Q. //P=AC..TA and Q=AC..T, P is prefix less than Q. bool PrefixLess(P,Q); //check if P is suffix less than Q //P=AC..T and Q=AC.GT, Q is suffix less than P. //P=C.GT and Q=AC.GT, Q is suffix less than P. bool SuffixLess(P,Q); //check if P and Q are convolutable bool isConvolutable(P,Q); //convolute P and Q Sequence Convolute(P,Q); //check from the result list, judge if P is maximal bool isMaximal(P); //add P to result list bool AddToResultList(P); Teiresias(){ P=S.top(); for Q=S.Location(P).next() to S.bottom(){ if isConvolutable(P,Q){ R=Convolute(P,Q); S.push(R); Teiresias(); } } for Q=S.Location(P).next() to S.bottom(){ if isConvolutable(Q,P){ R=Convolute(Q,P); S.push(R); Teiresias(); } } if isMaximal(P){ AddToResultList(P); } }