Syntactic Bay Leaves

Announcing Couch Crawler, a CouchDB search engine/crawler

Hi! So, for fun, I made couch-crawler, a search engine and crawler on top of the very excellent couchdb-lucene. I wanted to create a hackable search engine for my work intranet using modern tools. Lucene is great, but the Nutch search engine/crawler was kind of annoying to work with. I couldn’t figure out how to get it to update the search indexes without a restart of the server, which sucks. Also, I just really, really like CouchDB.

There’s no real web tier, CouchDB hosts static JavaScript/HTML/CSS files and the UI gets built up dynamically with AJAX calls to CouchDB. It’s kind of nice to be able to cut out a whole layer of glue code.

Templating is done with mustache.js, a JavaScript templating language that does a good job of being a dumb template language, making you define your presentation logic in JavaScript, where it should be.

On the indexing side of things, there’s a crawler written in Python that pulls down html, parses it with BeautifulSoup, extracts useful text content to be indexed then follows links within the page to a specified max depth. It probably could be smarter and parallel-er, but I wanted to start with a simple design and iterate over it.

The couchdb-lucene indexer indexes the title, url and contents, and saves the first 140 characters from the contents in the index to display with search results.

Ch-ch-check it out and let me know what you think.

P.S. If you use Homebrew for your OS X packaging needs, I have a fork of homebrew with a couchdb-lucene formula for easy installation.

Comments 2

Victor Mireyev wrote:
Hi!
You’ve got a cool idea with couchdb crawler.
Have you seen scrapy crawler (scrapy.org)?
It’s built on top of Twisted and it’s smart.
Posted 28 Jan 2010 at 7:14 pm ¶
Carlo Cabanilla wrote:
Hey Victor,

Thanks for the suggestion of Scrapy, I’ll definitely check it out. I created a branch for an Erlang-based crawler, but I definitely missed Python’s html parsing niceness. The fact that it uses Twisted (and not threads) is a big plus.
Posted 30 Jan 2010 at 1:31 am ¶

Trackbacks & Pingbacks 3

From Tweets that mention Syntactic Bay Leaves - Announcing Couch Crawler, a CouchDB search engine/crawler -- Topsy.com on 17 Jan 2010 at 9:40 pm ¶
[...] This post was mentioned on Twitter by CouchDB, Carlo Cabanilla. Carlo Cabanilla said: Announcing Couch Crawler, a couchdb-lucene search engine/crawler http://bit.ly/8D8t5O #couchdb [...]
From uberVU - social comments on 18 Jan 2010 at 12:20 am ¶
Social comments and analytics for this post…

This post was mentioned on Twitter by clofresh: Announcing Couch Crawler, a couchdb-lucene search engine/crawler http://bit.ly/8D8t5O #couchdb…
From Announcing Couch Crawler, a CouchDB search engine/crawler › ec2base on 18 Jan 2010 at 6:43 pm ¶
[...] /2010/01/17/announcing-couch-crawler-a-couchdb-search-enginecrawler/ [...]

Announcing Couch Crawler, a CouchDB search engine/crawler

Comments 2

Trackbacks & Pingbacks 3

« Home

About This Post

Categories

Tags