Using whoosh with web2py

This tutorial is about creating a search engine application using whoosh and web2py.

To create a search engine first we do require some documents. For this tutorial I have crawled some sample documents from the Reuters archive. The code below describes crawling and parsing the HTML document to extract the desired content for indexing. It is desired to analyze the structure of the HTML file for efficient extraction of important information.

Lets start the first phase for a search engine,

Crawling:

Firstly, we will create a browser class which will imitate a web browser(User Agent) and request the desired pages for crawling. For this we will use the urllib2 python library. Our browser class is configured as Mozilla Firefox.

The methods of the browser class are described below:

The request which is send by our crawler will be seen as a request by a Firefox browser. The browser class constructor takes care of this part, for that we set up cookie and added user agent description in the header. This description will be sent along with the HTTP request.

Next we have the get_html function which is called with a URL(to be crawled). It requests for the URL as a browser and stores the response.  Sometimes the server sends the page content in a compressed format, so its helpful to keep a check on it to decompress. Zlib is used for decompressing.

We are handling some of the commonly occurring exceptions while crawling,

  • URL ERROR : The handlers raise this exception (or derived exceptions) when they run into a problem. Like I/O Error.
  • SOCKET TIMEOUT ERROR: It handles timeout error.
  • HTTP ERROR : This is useful when handling exotic HTTP errors, such as requests for authentication.

The get_html function either returns the HTML content or a dictionary containing error information.

I am considering some URLs for crawling which are available in links.txt file.

Next, we have the function crawl which creates a browser instance. It reads the links.txt file line by line, crawls the URLs and stores them as HTML files. If errors occur for an URL, it  is logged into a log file(error.txt for our case).

HTML Parsing:

The above functions will take care for crawling the documents. Now the part for parsing follows.

For that let’s create a class StoryExtractor which will get instantiated with a document, and some methods to parse the DOM structure of the HTML. We are going to use a parser library BeautifulSoup to parse the documents. An instance of the beautiful soup creates a parse tree for the HTML document, with markup tags as node. Using BeautifulSoup, the desired content within the HTML tags can be found by searching for the tag or its attributes(class, id etc).  So, we are required to do a quick analysis of the HTML document and make note of the important tags and attributes.

The method of the StoryExtractor Class are described below:

It’s the constructor of the class which takes the HTML content an create a BeautifulSoup instance.

While analyzing the HTML document structure, we found that the content of the articles are available inside paragraph tags. So, the function get_story_content finds all the paragraph tags in a form of list, then we traverse the list and append the text to form the main content of the document. Here we see that we have only cut  out the main span where the content resides, specifying the attribute and its value.

Similarly the other function get_title and get_url are implemented .

Finally, the extract function that loops through all the crawled story and extracts the desired content using the StoryExtractor. The extracted title, content and URL are stored in dictionary object, and the dictionary object is serialized to a file using the python pickle library(i.e. storing the dictionary into a file).

So, this was the first part of the tutorial. You can find the code for this tutorial here.

and the whole code for the demo search engine in my Github repository.

Next part we will be focusing on indexing the document using whoosh(python library for text indexing).