Hi! So, for fun, I made couch-crawler, a search engine and crawler on top of the very excellent couchdb-lucene. I wanted to create a hackable search engine for my work intranet using modern tools. Lucene is great, but the Nutch search engine/crawler was kind of annoying to work with. I couldn’t figure out how to get it to update the search indexes without a restart of the server, which sucks. Also, I just really, really like CouchDB.
There’s no real web tier, CouchDB hosts static JavaScript/HTML/CSS files and the UI gets built up dynamically with AJAX calls to CouchDB. It’s kind of nice to be able to cut out a whole layer of glue code.
Templating is done with mustache.js, a JavaScript templating language that does a good job of being a dumb template language, making you define your presentation logic in JavaScript, where it should be.
On the indexing side of things, there’s a crawler written in Python that pulls down html, parses it with BeautifulSoup, extracts useful text content to be indexed then follows links within the page to a specified max depth. It probably could be smarter and parallel-er, but I wanted to start with a simple design and iterate over it.
The couchdb-lucene indexer indexes the title, url and contents, and saves the first 140 characters from the contents in the index to display with search results.
Ch-ch-check it out and let me know what you think.
P.S. If you use Homebrew for your OS X packaging needs, I have a fork of homebrew with a couchdb-lucene formula for easy installation.
Comments 2
Hi!
Posted 28 Jan 2010 at 7:14 pm ¶You’ve got a cool idea with couchdb crawler.
Have you seen scrapy crawler (scrapy.org)?
It’s built on top of Twisted and it’s smart.
Hey Victor,
Thanks for the suggestion of Scrapy, I’ll definitely check it out. I created a branch for an Erlang-based crawler, but I definitely missed Python’s html parsing niceness. The fact that it uses Twisted (and not threads) is a big plus.
Posted 30 Jan 2010 at 1:31 am ¶Trackbacks & Pingbacks 3
[...] This post was mentioned on Twitter by CouchDB, Carlo Cabanilla. Carlo Cabanilla said: Announcing Couch Crawler, a couchdb-lucene search engine/crawler http://bit.ly/8D8t5O #couchdb [...]
Social comments and analytics for this post…
This post was mentioned on Twitter by clofresh: Announcing Couch Crawler, a couchdb-lucene search engine/crawler http://bit.ly/8D8t5O #couchdb…
[...] /2010/01/17/announcing-couch-crawler-a-couchdb-search-enginecrawler/ [...]