Thursday, August 13, 2009

On sorting files... part 2, the metadata

Continued from my previous post on sorting file, here is my own finding on metadata required.

First of all, not all metadata is made equal. There is a concept know as "metacrap": the fact that metadata is sometimes less useful then no metadata at all. Metadata is a fragile concept: if that part of the system is badly implemented it will crash before it has even left the ground.

Thus the source of metadata is crucial, and it must be verifiable. But I want to kept that part of the system for a future post. I want first to describe the metadata itself.

Let's take an example of existing well know metadata: the ID3 tag of mp3. That containers works great for mp3 files, but it has several downsides. The one relevant to this discussion is the fact that the ID3 is part of the file. So if, for example, you decide to change the genre of a song from "country" to "western", the file itself becomes a different file. The cryptographic hash of the file completly change and a md5sum would be useless. This hash is important because it guaranties that the file you get is the file you want.

Mp3 files are not the only type of file that have inline (inside the file) metadata. The Microsoft win32 binary files also contain some metadata. That information can be seen using "properties" on any executable file. (It's in the "details" tab.) Or by using the GetFileVersionInfo function for any developers. That information, like the id3 tag, is contained inside the file. But that metadata is more buiness data then actual useful 'sorting' information. And Microsoft never designed this information to be user-editable. It's only accessible to the developer of the application. It was never designed as a generic container, It's used to identify binaries.

There is a way to store additional information inside a file without modifying the file itself with either ressources forks or extended attributes but it's not supported on all filesystems. Mac OS X support resource fork in the HFS+ filesystem, as does NTFS with Alternate Data Streams (ADS).Extended Attributes are supported on Mac OSX HFS+, Linux's ext3 and BSD's UFS.

The only problem with storing this informatin in either NTFS ADS, resources forks or Extended attributes is that this information will be deleted if the file is moved to a FAT32 drive (used on thumbdrives).Simply downloading the files might strip the files of ressources fork or extended attributes.

And thus, it need to be distributed alongside the file, as either a single file or bundle of information about several files.

Next post will be about the metadata exchange format.

No comments: