Lost in the Cloud: File Format Investigations

Any software development process involves a fair amount of extraneous creation. Code is revised, documents created and destroyed, prototypes and demos constructed, all in the pursuit of a final, stable digital object. Digital games add even more to this crush of documentation with an unending multitude of art assets, proprietary file types, and a lack of internal documentation. Since most development today relies on cloud storage and backup, code repositories and all forms of digital spatio-temporal communication, just finding out where everything is stored necessitates significant technical effort and time.

The team for Prom Week, the object at the heart of my current research for the NEH (info here), made use of numerous cloud services throughout the duration of the project. Fortunately, most of the documents are stored on only two services, Dropbox and Google Docs. Unfortunately, the organization is about as structured as I would expect from a rotating development team with intense time pressures and significant distractions. The Dropbox repository proved particularly onerous for analysis. Each team member had their own individual directory, which usually duplicated some files from another major folder. Aside from duplicates, there is no real structure to the folder names or documentation. This is usually not a problem, however, as Dropbox is searchable and I’m assuming when this folder was active each person responsible for a file knew where and what it was. As an outsider to the Prom Week development process, I can usually ascertain what a document relates to, but that is definitely due to the last few months I’ve spent researching the project.

I’m going to continue the focus on Dropbox for two reasons: first, the Google Documents for the project, while interesting and post-worthy, are only 24 in number and 5 in type, and second, the points I want to make about file extensions and confusion in the cloud are easier to argue when I’m dealing with 1.8 gigabytes of haphazardly organized Dropbox data. Those nearly two gigabytes of information breakdown into 2,051 individual files spanning 4 years of creation and modification by 8 people. Now while this appears to be a rather small set of data, making sense of it and potentially using it turns out to be more difficult than I had even assumed. And I’m generally rather cynical about such things. The following post is mainly about file extensions, and is the first in a series on file formats and the cloud.

The major issue for archiving such a collection of documents isn’t necessarily about trying to figure out what they represent at the level of content. Although that does get quite hairy, the major issue I had with the documents is at a much lower level that I’ll discuss in a second. Ascertaining what a document might be about is usually available from context, like the name of the document and its related folders, and from personal development experience. I’ve worked on many software projects with multiple collaborators and so I’m generally keen to the types of documents created. To give a sense of Prom Week’s documentary complexity here is a rough outline of the types of documents I found in just the Dropbox folder:

Research Papers
- outlines
- shared notes
- major versions
- revisions,
- abstracts
- submissions
- figures
- templates
Conferences
- posters
- poster templates
- presentations
- ephemera documents and records
Assignments for undergraduate researchers
Data analysis maps
Demo and Test Programs for different game elements
Screenshots
Demo Videos
Game Files
- background images
- character art
- sound files
- video files
- application project files for a specific Integrated Development Environment (IDE) (Actionscript project files)
- application files
- structured data files
- operating system scripting files
- system configuration files
- IDE configuration files
- source code files
  - game processes
  - data processing
  - asset management
  - software objects
  - data structures
- user interface files
  - mock ups
  - test applications
  - database files
Developer Specific Folders
Secondary Creative Software Files
Web Site Resources
- web embedding files
- website icons
Backup Files

This list is assembled from document names and my personal understanding of Flash game development files. However, not all the files were easy to identify, which leads to what I consider the most pressing issue: file types.

There are many files in the folder that are obviously named but not easy to open. Essentially, you can know the context of a file (what type of file it should be) and still have no idea what program created it or how the data is organized. This leads to the three major ways I figure out how to read a file:

Examine the Context
Search the Internet
Mine the Headers

I’m going to illustrate these approaches through examples of problematic files from the Prom Week Dropbox folder, illuminating the pitfalls of each approach.

The first example is the aptly named e0000d3a.au. Now you probably know exactly what type of file this is (aren’t you smart!) but I had no flippin’ clue. So the first thing I did was examine the context and the file’s full path gave me a pretty big clue:

…/altprom/Jacob’s SFX/VOCAL/utterances/mohawk/mohawk_data/e00/d00/e0000d3a.au

Evidently this is a part of an audio file for the vocal sound effects for Prom Week. The ‘mohawk’ refers to (I think) an earlier version of the be-mohawked character present in demos of the game.

I still don’t know what program is associated with the file extension .au, so I use the second approach, I search the Internet. The first hit is a wikipedia page describing an audio format created by Sun Microsystems and popular on NeXT Workstations and early Web sites. This seems totally off, since I’m positive that no one has used a NeXT machine at UCSC for at least 15 years and possibly never (cue angry UCSC NeXT users). Now NeXT is the progenitor of Apple’s OS X, the former being purchased by Apple in 1996, and is a totally interesting topic not for this blog post. In fact, my iTunes application detected the .au as an audio file but could not run it, which is good sign it’s not a valid Sun .au file.

Looking at the Internet results again, I noticed that the second result is a FAQ answer for the open-source application Audacity. The site asks, “Why does Audacity create a folder full of .au files when I save a project?” Looking at the file in question, our friend e0000d3a.au, I see that she’s in a folder with a bunch of other .au files, there’s a big (.au)dacious party up in there. The FAQ page also mentions that there should be an .aup (Audacity Project File) associated with the .au Audacity Block Files and sure enough, there’s a .aup file in a parent directory.

Now I’ve figured out what type of file .au is referring to, and it makes sense that a student researcher on the project would use a free, open-source audio editor for an academic project. However, I still haven’t mentioned the third identification method, mine the headers, because I actually didn’t need to do that for this particular file. If I had I would have seen this:

Audacity Block File in vi editor

The file clearly states in the header information that it’s associated with Audacity, so I could have examined that first and probably saved a bit of work. Regardless, I’ll explain the process for doing dirt-cheap header analysis on UNIX-based systems. I don’t generally use a Windows PC for anything but gaming, so most of the methodology on this blog will be from the technical context of Mac OS X available tools. All OS X’s flashy graphical flourishes are underwritten by the BSD-derived Darwin operating system and it is UNIX compliant, therefore I’m using mostly common UNIX tools for my surface analysis.

A good deal of file formats, though not all, have some text-based header information at the ‘head’ of the file. You know, at the top. So if you open those files in a format-agnostic text editor and if they have encoded text, you can see what type of file it is. To obtain the screenshot of the Audacity file above I used a terminal application, basically a program that lets a user interact with the command line interface to the Darwin OS running my Mac. Every Mac has the Terminal application installed, so you can follow along if you open it. If you’re on Linux I’m assuming you are already aware of how to access the terminal. When I’m in a command line interface, I just use the command: vi path-to-file to open the file in the vi editor. vi will open pretty much anything, though if it’s not encoded as text it will be gobble-de-gook like the Audacity file above. I keep active files in a convenient place if I just want to snoop so I copy them to my desktop temporarily. Therefore, the command to look at the audio file was:

vi ~/Desktop/e0000d3a.au

Okay, so I’ve covered the types of files and common methods I use to find out about file types. In the next few blog posts I’ll elaborate on how these methods can lead to some confusion with particularly knotty files, and discuss some other issues related to file formats, like versioning and dependent applications. I’ll also try to make them less than 1,526 words in length. Bye for now.

Lost in the Cloud: File Format Investigations

RSS feeds

Archives

Friends of EIS

Gaming Discussion

Lost in the Cloud: File Format Investigations

Tags

RSS feeds

Archives

Friends of EIS

Gaming Discussion