Language Detection With N-Grams

So far when we’ve been looking at text we’ve been breaking it down into words, albeit with varying degrees of preprocessing, and using the word as our token or term. However, there is quite a lot of mileage in comparing other units of text, for example the letter n-gram, which can prove effective in a variety of applications.

Read More

Alternative Term Weighting

The term weighting and ranking function is at the core of any information retrieval system. The vector space model with the cosine similarity is maybe the best known and most widely used, but there are plenty of alternatives. We’re looking at two here, the BM25 function based around a probabilistic model, and a function based around language modeling.

Read More

Spelling Correction

Peter Norvig’s spelling corrector is an interesting example of using some statistical techniques for the very practical purpose of spelling correction, inspired by a conversation on the Google ‘Did You Mean’ spelling suggestion functionality. There’s an excellent explanation of the background in his article, so I’ll skim over the ideas and how you might implement them in PHP.

Read More

Simple Collaborative Filtering

Collaborative filtering is a way of trying to present more relevant information to users, by choosing what to show based on how other users have acted. The “You might also like” boxes on Amazon and similar are probably the most popular example of this kind of technology, but it’s applicable in recommending all kind of content outside of ecommerce, particular anything that often has user ratings, such as video or games.

Read More

Shingling - Near Duplicate Detection

Determining whether two documents are exactly the same is pretty easy, just use some suitably sized hash and look for a match. A document will generally only hash to the same as another though if they are identical - the smallest change, or the same content on another site with a different header and footer, for example, will cause the hash to be quite different. These near duplicates are very common, and being able to detect them can be useful in a whole range of situations. Shingling is one process for relatively cheaply detecting these duplicates.

Read More

Text Classification (And Twitter)

Classification techniques are used for spam filters, author identification, intrusion detection and a host of other applications. They can be used to help organise data into a structure, or to add tags to allow users to find documents. While the latest classification algorithms are at the cutting edge of machine learning, there are still thousands of systems using simpler algorithms to great effect.

Read More

Tries And Wildcards

One nice bit of search query functionality, particularly in boolean systems, is the wildcard match. If you aren’t sure whether the title you’re trying to remember contains the word academy, academic, academically, or academics then you might be well served by trying all four: academ*.

Read More

Simple Search: Phrases

In an earlier post we looked at a simple search system that could handle straightforward boolean combinations of words in a query. Much of the time we can treat even ‘natural’ searches like that, assuming that a search like php information retrieval is “look for any document containing the words php AND information AND retrieval”, but sometimes the user is searching for that specific phrase in that specific order.

Read More

Simple Search: The Vector Space Model

One of the issues with the boolean search model is that results are unranked - every matching document for a query contains all of the terms in that query, and there’s no real way of saying which are ‘better’. However, if we could weight the terms in a document based on how representative they were of the document as a whole, we could order our results by the ones that were the best match for the query. This is the idea that forms the basis for the vector space model.

Read More