Language Detection With N-Grams

November 11, 2009

So far when we’ve been looking at text we’ve been breaking it down into words, albeit with varying degrees of preprocessing, and using the word as our token or term. However, there is quite a lot of mileage in comparing other units of text, for example the letter n-gram, which can prove effective in a variety of applications.

Alternative Term Weighting

November 6, 2009

The term weighting and ranking function is at the core of any information retrieval system. The vector space model with the cosine similarity is maybe the best known and most widely used, but there are plenty of alternatives. We’re looking at two here, the BM25 function based around a probabilistic model, and a function based around language modeling.

Spelling Correction

November 1, 2009

Peter Norvig’s spelling corrector is an interesting example of using some statistical techniques for the very practical purpose of spelling correction, inspired by a conversation on the Google ‘Did You Mean’ spelling suggestion functionality. There’s an excellent explanation of the background in his article, so I’ll skim over the ideas and how you might implement them in PHP.

Simple Collaborative Filtering

October 6, 2009

Collaborative filtering is a way of trying to present more relevant information to users, by choosing what to show based on how other users have acted. The “You might also like” boxes on Amazon and similar are probably the most popular example of this kind of technology, but it’s applicable in recommending all kind of content outside of ecommerce, particular anything that often has user ratings, such as video or games.

Shingling - Near Duplicate Detection

October 6, 2009

Determining whether two documents are exactly the same is pretty easy, just use some suitably sized hash and look for a match. A document will generally only hash to the same as another though if they are identical - the smallest change, or the same content on another site with a different header and footer, for example, will cause the hash to be quite different. These near duplicates are very common, and being able to detect them can be useful in a whole range of situations. Shingling is one process for relatively cheaply detecting these duplicates.

Text Classification (And Twitter)

September 29, 2009

Classification techniques are used for spam filters, author identification, intrusion detection and a host of other applications. They can be used to help organise data into a structure, or to add tags to allow users to find documents. While the latest classification algorithms are at the cutting edge of machine learning, there are still thousands of systems using simpler algorithms to great effect.

Tries And Wildcards

September 27, 2009

One nice bit of search query functionality, particularly in boolean systems, is the wildcard match. If you aren’t sure whether the title you’re trying to remember contains the word academy, academic, academically, or academics then you might be well served by trying all four: academ*.

Simple Search: Phrases

September 21, 2009

In an earlier post we looked at a simple search system that could handle straightforward boolean combinations of words in a query. Much of the time we can treat even ‘natural’ searches like that, assuming that a search like php information retrieval is “look for any document containing the words php AND information AND retrieval”, but sometimes the user is searching for that specific phrase in that specific order.

Simple Search: The Vector Space Model

September 17, 2009

One of the issues with the boolean search model is that results are unranked - every matching document for a query contains all of the terms in that query, and there’s no real way of saying which are ‘better’. However, if we could weight the terms in a document based on how representative they were of the document as a whole, we could order our results by the ones that were the best match for the query. This is the idea that forms the basis for the vector space model.

K-Means Clustering

September 14, 2009

My friend Vincenzo recently posted up a review of academic work on clustering that he compiled while working at the University of Naples. It’s worth a look if you’re interested in the field, going from the basic methods all the way up to the latest techniques like Support Vector Clustering (which I believe you can read about in Enzo’s masters thesis).