Skip to main content

Using HTMLParser to extract links from html files

We can use python htmllib.HTMLParser to extract link and corresponding text from html files:


import urllib
from formatter import *
from htmllib import *


formatter = AbstractFormatter(NullWriter())
class LinkParser(HTMLParser):
    def __init__(self, *sub, **kw):
        HTMLParser.__init__(self, *sub, **kw)
        self.current_link = self.current_text = None
    def handle_starttag(self, tag, method, attrs):
        if tag == 'a':
            for attr in attrs:
                if attr[0]=='href':
                    self.current_link = attr[1]
        return HTMLParser.handle_starttag(self, tag, method, attrs)
    def handle_endtag(self, tag, method):
        if tag == 'a':
            if self.current_link and self.current_text:
                print self.current_link, self.current_text
            self.current_link = self.current_text = None
        return HTMLParser.handle_endtag(self, tag, method)
    def handle_data(self, data):
        if self.current_link:
            self.current_text = str(data).strip()
        return HTMLParser.handle_data(self, data)

To use the code:

p = LinkParser(formatter)
p.feed(urllib.urlopen('http://www.slashdot.org').read())
p.close()

Comments

Popular posts from this blog

A simple implementation of DTW(Dynamic Time Warping) in C#/python

DTW(Dynamic Time Warping) is a very useful tools for time series analysis. This is a very simple (but not very efficient) c# implementation of DTW, the source code is available at  https://gist.github.com/1966342  . Use the program as below: double[] x = {9,3,1,5,1,2,0,1,0,2,2,8,1,7,0,6,4,4,5}; double[] y = {1,0,5,5,0,1,0,1,0,3,3,2,8,1,0,6,4,4,5}; SimpleDTW dtw = new SimpleDTW(x,y); dtw.calculateDTW(); The python implementation is available at  https://gist.github.com/3265694  . from python-dtw import Dtw import math dtw = Dtw([1, 2, 3, 4, 6], [1, 2, 3, 5],           distance_func=lambda x, y: math.fabs(x - y)) print dtw.calculate() #calculate the distance print dtw.get_path() #calculate the mapping path

Change the default user when start a docker container

When run(start) a docker container from an image, we can specify the default user by passing -u option in command line(In https://docs.docker.com/engine/reference/run/#user ). For example docker run -i -t -u ubuntu ubuntu:latest /bin/bash We can also use the USER instruction in DOCKERFILE to do the same thing(In https://docs.docker.com/engine/reference/builder/#user), note that the option in command line will override the one in the DOCKERFILE. And there is actually another way to start a container with neither DOCKERFILE nor -u option, just by a command like: docker run -i -t ubuntu:latest /bin/bash # with ubuntu as the default user This happens when your start the container from an image committed by a container with ubuntu as the default user. Or in detail: Run a container from some basic images, create ubuntu user inside it, commit the container to CUSTOM_IMAGE:1 . Run a container from CUSTOM_IMAGE:1 with "-u ubuntu" option, and commit the container to CUSTOM...

Install mysql-python with mariadb

mysql-python requires libmysqlclient-dev in ubuntu, but the installation of mariadb will have the lib with unmet dependenccies, so the error of "mysql_config not found" may occurred if you install mysql-python via pip. The case is that mariadb has a compatible package, if you have the ppa setup as in  http://downloads.mariadb.org/ . Just "sudo apt-get install libmariadbclient-dev".