This tutorial is about creating a search engine application using whoosh and web2py.
To create a search engine first we do require some documents. For this tutorial I have crawled some sample documents from the Reuters archive. The code below describes crawling and parsing the HTML document to extract the desired content for indexing. It is desired to analyze the structure of the HTML file for efficient extraction of important information.
Lets start the first phase for a search engine,
Crawling:
Firstly, we will create a browser class which will imitate a web browser(User Agent) and request the desired pages for crawling. For this we will use the urllib2 python library. Our browser class is configured as Mozilla Firefox.
|
class Browser: def __init__(self, newbrowser=None): """ Constructor to initializes the Cookies and add user agent information in the request header. """ def get_html(self, link): """ Get html content of the link passed as argument """ |
The methods of the browser class are described below:
|
def __init__(self, newbrowser=None): """ Constructor to initializes the Cookies and add user agent information in the request header """ # Initialize Cookies CHandler = urllib2.HTTPCookieProcessor(cookielib.CookieJar()) self.newbrowser = urllib2.build_opener(CHandler) self.newbrowser.addheaders = [ ('User-agent', 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:28.0) Gecko/20100101 Firefox/28.0')] urllib2.install_opener(self.newbrowser) self.error_dict = {} # to be returned by get_html if any thing goes wrong |
The request which is send by our crawler will be seen as a request by a Firefox browser. The browser class constructor takes care of this part, for that we set up cookie and added user agent description in the header. This description will be sent along with the HTTP request.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
|
def get_html(self, link): """ Get html content of the link passed as argument """ print 'opening ' + link try: res = self.newbrowser.open(link) if res.headers.getheader('Content-Encoding') == 'gzip': data = zlib.decompress(res.read(), 16 + zlib.MAX_WBITS) else: data = res.read() return data except urllib2.HTTPError as e: self.error_dict['e_code'] = e.code self.error_dict['e_reason'] = e.reason print 'HTTPError in link=%s' % link return self.error_dict except urllib2.URLError: self.error_dict['e_code'] = 404 self.error_dict['e_reason'] = 'URLError' print 'UrlError in link=%s' % link return self.error_dict except socket.timeout: time.sleep(60 * 2) # wait self.error_dict['e_code'] = 408 self.error_dict['e_reason'] = 'SocketTimeOut' print 'SocketTimeout Error in link=%s' % link return self.error_dict |
Next we have the get_html function which is called with a URL(to be crawled). It requests for the URL as a browser and stores the response. Sometimes the server sends the page content in a compressed format, so its helpful to keep a check on it to decompress. Zlib is used for decompressing.
We are handling some of the commonly occurring exceptions while crawling,
- URL ERROR : The handlers raise this exception (or derived exceptions) when they run into a problem. Like I/O Error.
- SOCKET TIMEOUT ERROR: It handles timeout error.
- HTTP ERROR : This is useful when handling exotic HTTP errors, such as requests for authentication.
The get_html function either returns the HTML content or a dictionary containing error information.
I am considering some URLs for crawling which are available in links.txt file.
Next, we have the function crawl which creates a browser instance. It reads the links.txt file line by line, crawls the URLs and stores them as HTML files. If errors occur for an URL, it is logged into a log file(error.txt for our case).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
|
def crawl(): try: os.mkdir(document_dir) except OSError: print "Directory already exists !" browser = Browser() article_id = 1 print "Crawling the links ..." with open("links.txt", 'r') as links_file: for link in links_file: link = link.strip() print 'Opening link %s ...' % link html_page = browser.get_html(link) if type(html_page) == 'dict': with open('error.txt', 'w') as error_file: # to keep track of failed urls error_file.write(link) f = open(document_dir + str(article_id) + '.html', 'w') f.write(html_page) f.close() article_id += 1 |
HTML Parsing:
The above functions will take care for crawling the documents. Now the part for parsing follows.
For that let’s create a class StoryExtractor which will get instantiated with a document, and some methods to parse the DOM structure of the HTML. We are going to use a parser library BeautifulSoup to parse the documents. An instance of the beautiful soup creates a parse tree for the HTML document, with markup tags as node. Using BeautifulSoup, the desired content within the HTML tags can be found by searching for the tag or its attributes(class, id etc). So, we are required to do a quick analysis of the HTML document and make note of the important tags and attributes.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
|
class StoryExtractor: """ Class for parsing the HTML document and extract the desired contents from the HTML source For this tutorial this class is parsing news articles selected for REUTERS' archives and its specific to it. """ def __init__(self, html): """ Takes the HTML page of story and initializes the soup for it """ self.soup = bs4.BeautifulSoup(html) def get_story_content(self): """ Function to get the content of the article. """ def get_url(self): """ Function to get the URL of the artices """ def get_title(self): """ Function to get the title of the artice """ |
The method of the StoryExtractor Class are described below:
|
def __init__(self, html): self.soup = bs4.BeautifulSoup(html) |
It’s the constructor of the class which takes the HTML content an create a BeautifulSoup instance.
While analyzing the HTML document structure, we found that the content of the articles are available inside paragraph tags. So, the function get_story_content finds all the paragraph tags in a form of list, then we traverse the list and append the text to form the main content of the document. Here we see that we have only cut out the main span where the content resides, specifying the attribute and its value.
|
def get_story_content(self): article = '' content_block = self.soup.find('span', attrs={'id':'articleText'}) story = content_block.findAll('p') if story is None: return "None" else: for paragraph in story: s = paragraph.get_text().strip() s = s.encode('ascii', 'ignore') article += ' ' + s.replace('\n', ' ') return article |
Similarly the other function get_title and get_url are implemented .
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
|
def extract(): """ Function to extract the desired contents from the html source corresponding to each document and store them to a dictionary and finally store them to files using pickle """ temp_dict = dict() try: os.mkdir(extracted_dir) except OSError: print extracted_dir + ' already exists' for file in os.listdir(document_dir): # loop through each document in the html_docs directory with open(document_dir + file, 'r') as html_file: html = html_file.read() extractor = StoryExtractor(html) temp_dict['title'] = extractor.get_title() temp_dict['content'] = extractor.get_story_content() temp_dict['url'] = extractor.get_url() f = open(extracted_dir + file.split('.')[0] + '.pkl', 'w') pickle.dump(temp_dict, f) f.close() |
Finally, the extract function that loops through all the crawled story and extracts the desired content using the StoryExtractor. The extracted title, content and URL are stored in dictionary object, and the dictionary object is serialized to a file using the python pickle library(i.e. storing the dictionary into a file).
So, this was the first part of the tutorial. You can find the code for this tutorial here.
and the whole code for the demo search engine in my Github repository.
Next part we will be focusing on indexing the document using whoosh(python library for text indexing).