Skip to main content

Using HTMLParser to extract links from html files

We can use python htmllib.HTMLParser to extract link and corresponding text from html files:


import urllib
from formatter import *
from htmllib import *


formatter = AbstractFormatter(NullWriter())
class LinkParser(HTMLParser):
    def __init__(self, *sub, **kw):
        HTMLParser.__init__(self, *sub, **kw)
        self.current_link = self.current_text = None
    def handle_starttag(self, tag, method, attrs):
        if tag == 'a':
            for attr in attrs:
                if attr[0]=='href':
                    self.current_link = attr[1]
        return HTMLParser.handle_starttag(self, tag, method, attrs)
    def handle_endtag(self, tag, method):
        if tag == 'a':
            if self.current_link and self.current_text:
                print self.current_link, self.current_text
            self.current_link = self.current_text = None
        return HTMLParser.handle_endtag(self, tag, method)
    def handle_data(self, data):
        if self.current_link:
            self.current_text = str(data).strip()
        return HTMLParser.handle_data(self, data)

To use the code:

p = LinkParser(formatter)
p.feed(urllib.urlopen('http://www.slashdot.org').read())
p.close()

Comments

Popular posts from this blog

A simple implementation of DTW(Dynamic Time Warping) in C#/python

DTW(Dynamic Time Warping) is a very useful tools for time series analysis. This is a very simple (but not very efficient) c# implementation of DTW, the source code is available at  https://gist.github.com/1966342  . Use the program as below: double[] x = {9,3,1,5,1,2,0,1,0,2,2,8,1,7,0,6,4,4,5}; double[] y = {1,0,5,5,0,1,0,1,0,3,3,2,8,1,0,6,4,4,5}; SimpleDTW dtw = new SimpleDTW(x,y); dtw.calculateDTW(); The python implementation is available at  https://gist.github.com/3265694  . from python-dtw import Dtw import math dtw = Dtw([1, 2, 3, 4, 6], [1, 2, 3, 5],           distance_func=lambda x, y: math.fabs(x - y)) print dtw.calculate() #calculate the distance print dtw.get_path() #calculate the mapping path

Install mysql-python with mariadb

mysql-python requires libmysqlclient-dev in ubuntu, but the installation of mariadb will have the lib with unmet dependenccies, so the error of "mysql_config not found" may occurred if you install mysql-python via pip. The case is that mariadb has a compatible package, if you have the ppa setup as in  http://downloads.mariadb.org/ . Just "sudo apt-get install libmariadbclient-dev".

PrefixSpan source code in python

The prefixspan is a key algorithm for mining sequential patterns. I have implemented the algorithm in Python. The algorithm is based on the following paper: Jian Pei, Jiawei Han, Senior Member, Behzad Mortazavi-asl, Jianyong Wang, Helen Pinto, Qiming Chen, Umeshwar Dayal. Mining Sequential Patterns by Pattern-Growth: The PrefixSpan Approach. IEEE Transactions on Knowledge and Data Engineering, 2004. or their conference paper You may download the source code at the following addresses: Link1