Thursday, August 13, 2009

On sorting files... part 3, exchange format

In this post I want to talk about the content of the metadata itself and the format in which it is distributed...

Let's take a similar example of metadata exchange: del.icio.us (or delicious.com now). I have to admit that delicious is the initial idea that got me thinking about this problem of file exchange sorting. What I liked about the delicious example was the API. Like a lot of web application, the API is a simple webpage that you call using arguments in the url. (you can see it here)

I propose either XML or JSON as these format are well understood and easy to implement but really, any sort of inter-exchange format, it could be implemented.
That metadata need to include:
  • Filename
  • Md5 & Sha1 Hash (optionally Sha256 / Sha512)
  • Description
  • List of Tags
  • optional website
A bit like the data del.icio.us stores about each bookmark.

A method of distribution of all that information also needs to be created. Either something like a url on the distribution server (like http://server/file-metadata/name/file.exe or http://server/file-metadata/index.php?&filename=file.exe) that would produce a page with the XML or JSON. Something that looks like:

{
"Filename": "executable.exe",
"hashsum": {
"md5": "d41d8cd98f00b204e9800998ecf8427e",
"sha1": "da39a3ee5e6b4b0d3255bfef95601890afd80709",
"sha256": "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855"
}
"metadata": {
"Description": "Description",
"tags": "Info, executable",
"Website": "http://www.wwww.com"
}
}

Or like:

<?xml version="1.0" encoding='UTF-8' ?>
<File Filename="executable.exe" >
<hashsum>

<md5>d41d8cd98f00b204e9800998ecf8427e</md5>
<sha1>da39a3ee5e6b4b0d3255bfef95601890afd80709</sha1>
<sha256>e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855</sha256>

</hashsum>
<metadata>

<Description>Description</Description>
<tags>Info, executable</tags>
<Website>Http://www.www.com </Website>

</metadata>
</File>

Or course the devil is in the details and this should be considered a draft and not a definitive thing. The notion of web API has been done before and I really just wanted to give an idea of how I think things could be done.

Update: I was specifically talking about a RESTful API. I could not remember the name when I wrote this post. Here are a couple of links on the subject:

On sorting files... part 2, the metadata

Continued from my previous post on sorting file, here is my own finding on metadata required.

First of all, not all metadata is made equal. There is a concept know as "metacrap": the fact that metadata is sometimes less useful then no metadata at all. Metadata is a fragile concept: if that part of the system is badly implemented it will crash before it has even left the ground.

Thus the source of metadata is crucial, and it must be verifiable. But I want to kept that part of the system for a future post. I want first to describe the metadata itself.

Let's take an example of existing well know metadata: the ID3 tag of mp3. That containers works great for mp3 files, but it has several downsides. The one relevant to this discussion is the fact that the ID3 is part of the file. So if, for example, you decide to change the genre of a song from "country" to "western", the file itself becomes a different file. The cryptographic hash of the file completly change and a md5sum would be useless. This hash is important because it guaranties that the file you get is the file you want.

Mp3 files are not the only type of file that have inline (inside the file) metadata. The Microsoft win32 binary files also contain some metadata. That information can be seen using "properties" on any executable file. (It's in the "details" tab.) Or by using the GetFileVersionInfo function for any developers. That information, like the id3 tag, is contained inside the file. But that metadata is more buiness data then actual useful 'sorting' information. And Microsoft never designed this information to be user-editable. It's only accessible to the developer of the application. It was never designed as a generic container, It's used to identify binaries.

There is a way to store additional information inside a file without modifying the file itself with either ressources forks or extended attributes but it's not supported on all filesystems. Mac OS X support resource fork in the HFS+ filesystem, as does NTFS with Alternate Data Streams (ADS).Extended Attributes are supported on Mac OSX HFS+, Linux's ext3 and BSD's UFS.

The only problem with storing this informatin in either NTFS ADS, resources forks or Extended attributes is that this information will be deleted if the file is moved to a FAT32 drive (used on thumbdrives).Simply downloading the files might strip the files of ressources fork or extended attributes.

And thus, it need to be distributed alongside the file, as either a single file or bundle of information about several files.

Next post will be about the metadata exchange format.

Wednesday, August 05, 2009

On sorting files... part 1

For months now, I've been looking for a way to automaticaly sort my files. There is a lot of software for creating and maintaining some sort of media library for pictures, movies or music, but there is almost nothing else for "the rest". On my home computer, "the rest" is composed of 2600 files that occupies a grand total of 45G (although 23G of that is composed of iso files of Linux and BSD). Almost all of these files were downloaded from a website somewhere, either application from sourceforge.net, game add-ons from The Elder Scroll Nexus , Fallout 3 Nexus or Simtropolis or drivers from various hardware manifacturers.

Right now, the way I manage these files is mostly by hand. Everything goes into a "Download" folder (that I have since my BBS days), and it's then sorted by hand into several different folders... and that folder is in a perpetual "TO DO:not completely sorted" state... (since those BBS days...)

I wish I could just simply drop these files in a folder where they would automatically get sorted...

...And I don't think it's impossible to implement. Each of the files I have specifically mentioned comes from websites where the content is already classified. Sourceforge classifies projects among various categories and so does all the gaming communities, with Mods, maps, etc. The only thing required is a method of exchanging metadata.

Seems easy enough, but it's not. When you start to think about that problem you realize quickly that it's not really a technical problem. It's more a problem of getting everyone to work together with the same standard.

I have prepared a series of blog posts for each of part of the problem, a sort of brain dump of what I came up with in the past months. A single post would create the "Wall of Text" or TL;DR (too Long; Didn't Read). (And if I try to write something too big, as the Thinkgeek t-shirt says "I never finish anyth")