One of the outcomes of applying an exploratory interface onto a dataset is that the values within the metadata are immediately foregrounded. This means that any mistakes, typos or incorrect values are made very clear. The effect of this is usually of surprise and occasionally confusion and denial by providers of data, that their metadata included those values in the first place. The result manifests itself in that the default ordering of strings is by character code — punctuation, then numbers and letters. This means that any values that start with punctuation, such as spaces, newlines, parentheses, quotes, apostrophes, hyphens, backticks, colons, semicolons and other very ugly characters not only come first, but fill the initial screen of the interface.
The main problem with having non-alphanumerics pushed to the top of lists is that when encoding problems are caused (anywhere in the data workflow), they might manifest themselves by spewing out characters with incorrect and low-numbered (causing incorrect punctuation) characters. This may seem weird, but it really is commonplace. By the time a data source is exposed/dumped/copied to someone like myself who is going to use that data to provide an interface, it has typically gone though many systems over a lifetime that could span decades of software changes, merging of multiple sources, modification by badly written scripts/programs. When only a small percentage of the metadata is wrongly encoded/translated by software changes, it may not be noticed, and as such “dirty” data seems to exist in almost all datasets that I’ve encountered.
While having “correct” and “clean” is of course the expectation and dream of data providers, it is just that – a dream. Even my own limited personal experience in providing faceted interfaces over a wide variety of types of data for the last 6 years has shown that data will be dirty, and some level of modification will always be necessary before it can be exposed in an exploratory interface.
One fairly wide-reaching and effective solution for the UI is alter the way that ordering of strings is handled in the database, so that instead of promoting punctuation to the top of the list, fold it in, based on the characters that follow the punctuation. The net result of this, would be that values beginning with numbers would rule the top of the list, with alphabetic strings following — much better than floating things that start with weird punctuative characters.
For our preferred backend, MySQL, the ordering of results is handled by collations. In short, a collation is a map for a given character set that gives weighting to each character in that set, and uses those weightings to sort strings. Thus, A has a weighting one higher than B, which is one higher than C, and so on. For unicode you may likely wish to use a collation that weights a spanish character ñ the same as n, so that Espana orders next to España. Likewise, in order to prevent punctuation from living at the top of results listings, we can give a null (or zero) weight to punctuation characters. An example of this is given in a presentation available on mysql.com, where a phone number table has some characters folded in, since brackets and hyphens aren’t really parts of numbers.
Using the knowledge gleaned from the above presentation, and from the web, I found a unicode table and ran through characters that were (and could in future) be promoted erroneously to the top of results listings, and included them in a script that generated my collation. This small perl script takes an array of hex unicode character codes, and null weights them. If you create this collation and then alter the columns that need to have the correct ordering to use the new collation, they will then order with numbers and alphas first, rather than punctuation.