Skip to main content

Notes on Sequential Pattern Mining (1)

1.

Here I write about some definitions about Sequential Pattern Mining.

We define the basic elements items, and S the set of all items. The subset of S, defined as event, inside a event, the items are unordered. The ordered set of event is defined as Sequence.

We define size of a Sequence the number of events, and length of a Sequence the total count of the instances of items. The sequence of length l is called l-sequence.

2.

Sequences is the basic logical elements of Sequential Pattern Mining. While implementing some mining algorithm, we need a high efficiency impletation of Sequence. So to design the class is a very important and fundamental work.

We need to design the Sequence class to meet the following requirments:

1) We could visit the a sequence by items and by event, and for a single item, we could locate it in the form "event X item Y" or in the form "item T from beginning of the sequence";

2) We need to get a sub-sequence of sequence easily and need to save memory of storing the sub-sequences;

3) We need a efficient way to find the instances of prefix in sequence, and then build the projectd-database with high efficiency.

3.The mapping between suquences is a very fundamental algorithm for mining sequential patterns. While judging the support relationships, finding the prefix, a huge amounts of such operations need to be proformed. We will consider a algorithm for the mapping here.

We build the algorithm in a layerd way. First, check all the events. Second, inside each events, check the items. We use two pointer for both layer, one sub-objects pointer, the other the parents-objects pointer. The basic algorithm is as below:

while parents-objects pointer not reach the end of the parents-sequence if map(parents-objects, sub-objects)==true sub-ojbects ++; parents-objects++; if sub-objects reach the end of the sub-sequence return true; end while;return false.

The function map, in the first layer should call the second layer for cheking the events.

4.

Here, we shall design the class for sequence.

First, we design a iterator fo sequence, it can get or set a item by "event X item Y" form, or just "the i-th item from the beginning of the sequence" form; and could move forward or backward by a item.

Second, the Event class, just using a vector in STL to store items by lexicographic order, and provide a function to judge the relationship of containment between events.

Third, the Sequence class, store a pointer to the vector composing of a series of events(this is the "real" sequence in the SDB), and two iterators, one pointing at the beginning of this Sequence, the other pointing at the end of this Sequence. When new Sequence generated, just "new" a vector, and modify the pointer to it and two iterators.

5.

The application of Sequential Pattern Mining is to mining the patterns in the opensource code examples(or something like), and sometimes the pattern could show you then common usages of the method in API.

The application is denoted in paper MAPO: Mining API Usages from Open Source Repositories(2005).

MAPO is to search the methods of API by some search engine. With the large amount of the search results, mining the code containing the wanted items.

Comments

Popular posts from this blog

A simple implementation of DTW(Dynamic Time Warping) in C#/python

DTW(Dynamic Time Warping) is a very useful tools for time series analysis. This is a very simple (but not very efficient) c# implementation of DTW, the source code is available at  https://gist.github.com/1966342  . Use the program as below: double[] x = {9,3,1,5,1,2,0,1,0,2,2,8,1,7,0,6,4,4,5}; double[] y = {1,0,5,5,0,1,0,1,0,3,3,2,8,1,0,6,4,4,5}; SimpleDTW dtw = new SimpleDTW(x,y); dtw.calculateDTW(); The python implementation is available at  https://gist.github.com/3265694  . from python-dtw import Dtw import math dtw = Dtw([1, 2, 3, 4, 6], [1, 2, 3, 5],           distance_func=lambda x, y: math.fabs(x - y)) print dtw.calculate() #calculate the distance print dtw.get_path() #calculate the mapping path

Change the default user when start a docker container

When run(start) a docker container from an image, we can specify the default user by passing -u option in command line(In https://docs.docker.com/engine/reference/run/#user ). For example docker run -i -t -u ubuntu ubuntu:latest /bin/bash We can also use the USER instruction in DOCKERFILE to do the same thing(In https://docs.docker.com/engine/reference/builder/#user), note that the option in command line will override the one in the DOCKERFILE. And there is actually another way to start a container with neither DOCKERFILE nor -u option, just by a command like: docker run -i -t ubuntu:latest /bin/bash # with ubuntu as the default user This happens when your start the container from an image committed by a container with ubuntu as the default user. Or in detail: Run a container from some basic images, create ubuntu user inside it, commit the container to CUSTOM_IMAGE:1 . Run a container from CUSTOM_IMAGE:1 with "-u ubuntu" option, and commit the container to CUSTOM...

Notes on Sequential Pattern Mining (2) -- Partial Order Pattern Mining and Contrast Mining

1. In , the authors induce TEIRESIAS algorithms to mining combinatorial patterns with gap constraints in biological sequences. The patterns TEIRESIAS mined is similiar with the common sequential patterns, but it could contain "." the wild card which is also in the alphbel of the sequences database standing for any other item available, for example pattern "A..B" is a length-4 pattern, with two arbitrary items between the first A and the last B. Patterns "AC.B", "AADB" are all said to be more specific than pattern "A..B". TEIRESIAS mining all the maximal patterns () with a support over a min threshold K. There some key points of TEIRESIAS algorithms: 1)The growth of the patterns The growth of the patterns is accomplished by convolute current pattern by a short length pattern. Pattern A and pattern B are convolutable if the last L(very small) characters of pattern A is the same as the first L characters of pattern B, then ...