Data Matters

Posts

Showing posts from 2010

Solve an error when migrating Mysql tables

We may copy the files under /var/lib/mysql when migrating Mysql tables. But we may get following erros in mysql client: mysql> select * from test; ERROR 1017 (HY000): Can't find file: './testdb/test.frm' (errno: 13) This is the issue of permission, you need to: sudo chown -R mysql.mysql /var/server/mysql/testdb/

Interesting Javascript Tutorials

Learning Javascript with Object Graphs Part 1 , Part 2 , Part 3

pygapbide -- Python implementation of Gap-Bide algorithm

I have implemented a python version of the Gap-Bide algorithm according to Efficiently Mining Closed Subsequences with Gap Constraints . The algorithm is to mine a set of gap-constrained patterns from a set of sequence. The source code is release at http://code.google.com/p/pygapbide/ , under MIT License . Note the code has been migrated to github, at https://github.com/socrateslee/pygapbide .

test youtube html5 embed video

Some ssh tricks

1 ssh log on without password 1) in localhost, ssh-key-gen 2) in localhost, ssh-copy_id username@remotehost 2 remove entries from known_hosts you may get the error like the following message, if remotehost change it's ip address: @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ @ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @ @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ ... Add correct host key in /home/username/.ssh/known_hosts to get rid of this message. Offending key in /home/username/.ssh/known_hosts:$line_number$ ... To solve this, you may 1) edit ~/.ssh/known_hosts 2) remove line $line_number$

Some svn tricks

1.Sovle checksum mismatch when update When performing "svn up test.h", you may get error: svn: Checksum mismatch for … To solve the issue, you may want to backup files mentioned below, and perform following actions: 1) vi .svn/entries, find and delete content like ^L test.c file 712 2010-10-24T09:39:39.000000Z d41d8cd98f00b204e9800998ecf8427e 2008-10-21T19:51:09.435185Z 3 foo 0 ^L 2) delete related files in .svn/text-base/(e.g., test.c.svn-base) 3) delete related files in .svn/tmp/text-base/ 4) delete test.c 5) svn up test.c

The "hello world" for celery

This is a simple "hello world" case for python based task queue celery on ubuntu 9.10. Install sudo apt-get install rabbitmq-server sudo easy_install celery In celerytest directory: Create celeryconfig.py: BROKER_HOST = "localhost" BROKER_PORT = 5672 BROKER_USER = "guest" BROKER_PASSWORD = "guest" BROKER_VHOST = "/" CELERY_RESULT_BACKEND = "amqp" import os import sys sys.path.append(os.getcwd()) CELERY_IMPORTS = ("tasks", ) Create tasks.py: from celery.decorators import task @task def add(x, y): return x + y Start celery worker: celeryd Excute the program: python -c "from tasks import *;r=add.delay(3,5);print r.wait()"

A simple test on Kyoto Cabinet

Kyoto Cabinet is an update of Tokyo Cabinet as an light-weighted key-value database solution. The installation of Kyoto Cabinet and its python library could be done by the following script: wget http://fallabs.com/kyotocabinet/kyotocabinet-1.2.9.tar.gz tar vxzf kyotocabinet-1.2.9.tar.gz cd kyotocabinet-1.2.9/ ./configure make sudo make install cd .. wget http://fallabs.com/kyotocabinet/pythonlegacypkg/kyotocabinet-python-legacy-1.5.tar.gz tar vzxf kyotocabinet-python-legacy-1.5.tar.gz cd kyotocabinet-python-legacy-1.5/ make sudo make install cd .. I have a very simple test on Kyoto Cabinet. I have test set and get 1,000,000 entries to the db and compared it with python dict. The result is as below: kc write time: 4.4425470829 kc read time: 1.49812507629 dict write time: 3.50502705574 dict read time: 1.01603198051 kc key/value iteration time: 2.5863609314 dict key/value iteration time: 3.59536600113

oauth2

The OAuth protocol provides a secure way for users to share their information with third-party web resources. Twitter is pushing their OAuth 1.0 login while Facebook adopts OAuth 2.0. The difference between OAuth 1.0 and 2.0 are well described in this article . Mainly, OAuth 2.0 uses SSL connection to ensure the security and simplify the authorize process while OAuth 1.0 uses signature to avoid the overhead caused by SSL. For python users, this library may be used for both OAuth 1.0(the Client class) and OAuth 2.0(the Client2 class).

Solve "verify error:num=20:unable to get local issuer certificate" in openssl

Using openssl s_client to test a ssl connection, we may get the following error: verify error:num=20:unable to get local issuer certificate For example: openssl s_client -connect facebook.com:443 CONNECTED(00000003) depth=2 /C=US/O=DigiCert Inc/OU=www.digicert.com/CN=DigiCert High Assurance EV Root CA verify error:num=20:unable to get local issuer certificate ... ... Server certificate -----BEGIN CERTIFICATE----- MIIGiDCCBXCgAwIBAgIQCoLvg+TMQDau82d6KfXrwDANBgkqhkiG9w0BAQUFADBp MQswCQYDVQQGEwJVUzEVMBMGA1UEChMMRGlnaUNlcnQgSW5jMRkwFwYDVQQLExB3 d3cuZGlnaWNlcnQuY29tMSgwJgYDVQQDEx9EaWdpQ2VydCBIaWdoIEFzc3VyYW5j ZSBFViBDQS0xMB4XDTA4MTExODAwMDAwMFoXDTEwMTEyMjIzNTk1OVowgfYxGzAZ BgNVBA8MElYxLjAsIENsYXVzZSA1LihiKTETMBEGCysGAQQBgjc8AgEDEwJVUzEZ MBcGCysGAQQBgjc8AgECEwhEZWxhd2FyZTEQMA4GA1UEBRMHMzgzNTgxNTEbMBkG A1UECRMSMTU2IFVuaXZlcnNpdHkgQXZlMQ4wDAYDVQQREwU5NDMwMTELMAkGA1UE BhMCVVMxEzARBgNVBAgTCkNhbGlmb3JuaWExEjAQBgNVBAcTCVBhbG8gQWx0bzEX MBUGA1UEChMORmFjZWJvb2ssIEluYy4xGTAXBgNVB

Custom logging handler in Turbogears

Configure a custom logging handler(such as subclass of logging.Handler) may induce errors. We could add custom logging handler in controllers.py. Considering a prod.cfg or dev.cfg as below: [logging] ... [[[allinfo]]] level='INFO' handlers=['debug_out'] [[[access]]] level='INFO' qualname='turbogears.access' handlers=['access_out'] propagate=0 We could add FileHandler for allinfo by: logging.getLogger().addHandler(logging.FileHandler('allinfo.log','a') ) and for access: logging.getLogger("turbogears.access").addHandler(logging.FileHandler('access.log','a') ) In prod.cfg, we could direct the default logging output to NullHandler or simply left blank.

Scribe client/proxy in Python

I have tested two Python scribe client/proxy, python-scribe-logger and ScribeHandler . python-scribe-logger has a Writer and a Logger , where the Writer handles the writing to scribe server and the Logger provide a handler for python logging module. The Writer in python-scribe-logger uses a threading.RLock to ensure the access of TTransport. LoggerScribeHandler is much simpler. It only implements an handler for python logging module, and no locking. Maybe better for single thread uses.

Serve file downloading in Turbogears 1.0.x

You may want to generate a .csv file using StringIO for downloading, here is how to do it in Turbogears 1.0.x: @expose(content_type='text/csv',format="csv") def csvdownload(self,**kw): file = cStringIO.StringIO() file.write('a,b,c\n') cherrypy.response.headerMap["Content-Disposition"] ='attachment;filename=stats.csv' return file.getvalue()

Install Scribe on Ubuntu

Scribe is a very useful tools for collecting logs in the cloud. Here is an simple instruction for installing scribe on Ubuntu(based on those articles: link1 link2 link3 ). First, install libevent, boost and thrift: sudo apt-get install libevent-dev sudo apt-get install libboost-dev=1.38.1 flex bison libtool automake autoconf pkg-config wget http://archive.apache.org/dist/incubator/thrift/0.4.0-incubating/thrift-0.4.0.tar.gz cd thrift && ./bootstrap.sh && ./configure && make && sudo make install cd contrib/fb303/ ./bootstrap.sh && sudo make && sudo make install If libboost-dev 1.38.1 is not found, version 1.40 is OK. Second, get scribe(from http://github.com/facebook/scribe/downloads ) installed: ./bootstrap.sh ./configure sudo make && sudo make install During the make of scribe, you may get following errors: 1) "configure: error: Could not link against !"try sudo apt-get install libboost-all-dev 2)"

400 Bad Request Issue with urllib2

In python 2.5, urllib2.urlopen may lead to "HTTPError: HTTP Error 400: Bad Request" sometimes(such as google checkout xml api). This may be caused by a bug . According to this post , we may use httplib instead. Take google checkout for example: url = '/checkout/api/checkout/v2/request/Merchant/%s'%merchant_id conn = httplib.HTTPSConnection('sandbox.google.com') headers = {} headers['Authorization'] = 'Basic %s' % ( base64.b64encode(merchant_id + ':' + merchant_key)) headers['Content-type'] = 'application/xml; charset=UTF-8' headers['Accept'] =' application/xml; charset=UTF-8' conn.request("POST", url, shopcart_xml_info, headers) response = conn.getresponse() return response.read()

Erlang nodes and /etc/hosts configuration

The basic unit of erlang distributed programming is erlang node. To find an erlang node is a prerequisite for running program on it. The configuration of /etc/hosts and domain resolving may affect. The issue came to me when I tested discoproject on two computers and got "Node failure". Let's start with the classic example in Pragmatic Programming Erlang. First we start two erlang shell on the same machine, one named gandalf, the other named bilbo: erl -sname gandalf erl -sname bilbo In node bilbo, you try "net_adm:ping(gandalf@localhost)." and you suppose to get a "pong" as response. Unfortunately, this may not happen on every computers, it is highly possible you may get a "pang" or some error massage. But you may start erlang shells in another way: erl -sname gandalf@localhost erl -sname bilbo@localhost This time, you should finally get a "pong"(if you got "pang" before). The magic lies here, if start an erlan

NameError when easy_install tc-0.7.2

Tc is python interface for Tokyo Cabinet, supporting Table DB feature. I got an "NameError: global name 'debian' is not defined" when "sudo easy_install tc-0.7.2". I managed to get the library instlled as follow: 1) Download and uzip the package. The download address is http://pypi.python.org/packages/source/t/tc/tc-0.7.2.tar.gz . 2) Edit setup.py, and comment out the following lines: try: shell_cmd('which dpkg-buildpackage') self.cmdclass['debian'] = debian except IOError: pass 3) Build and install tc: python setup.py build python setup.py install

Key/Value Store using Tokyo Cabinet

In this post , the test results show that Tokyo Cabinet hashtable has a remarkable performance. It is pretty convenient to use it as a Key/Value store for data analysis. pytc is a python interface of Tokyo Cabinet. pytc doesn't provide a document, but we find some examples here .

Aardvark paper in WWW 2010: about social search

Aardvark, which powers vark.com, published an interesting paper entitled " Anatomy of a Large-Scale Social Search Engine " at WWW 2010. The paper talks about the social search engine applied in vark.com. The search engine is based on social graphs and topics instead of keyword. The paper addresses search engine like Google as library paradigm and Aardvark as village paradigm. The search engine of village paradigm gets answers by asking the one who are expert in the underlying topic in social graphs. In library paradigm, the search engine needs to figure out what a user what based on keywords, search history and user profile, which considers to be a very difficult task. The village paradigm leaves the difficult part to human being, so, only problem is to find the right person. The model of Aardvark considers that a user u1 asks a question q, and the search engine should find the right user u2 to provide the answer. Aardvark associates both users and questions to topics. Aa

Tutorials on social networks mining

Social Network Mining. Jennifer Neville, Purdue University, Foster Provost, New York University, Stern School of Business. AAAI 2008 Tutorial, Chicago, IL, USA webpage slides additional resources Modeling Social and Information Networks: Opportunities for ML. Jure Leskovec. ICML 2009. webpage slides video