Buried away in a Google patent application from 2006 entitled “DOCUMENT SCORING BASED ON DOCUMENT INCEPTION DATE“, there is a somewhat obscure reference to using the “entropy” of a document. “Entropy” used in this sense is not simply as it’s defined in the field of physics, where your daughter’s room tends towards a maximum state of disorganization; instead, it refers to its definition in the field of Information Theory, which applies it to information rather than atoms.
Wikipedia has a lengthy entry on this, but you can think of Shannon entropy as essentially measuring how much information is in a document.
If you have a 20,000 word document that simply consists of “all work and no play makes jack a dull boy” repeated 2,000 times Read on »