the day the data stood still

I still remember the day that I first understood how data was communicated between programs. I was 17 years old and programming Java applets about microscopes. When it first sank in, I was happy to be able to solve the task in front of me, but I had little idea how fundamental it was. I was trying to get a set of curves that were modeled in 3ds max into my applet so I could draw them on a graph and make them move, and I didn’t know anything about how file formats work. My mentor at the time had the 3D curves converted to 2D (that’s all we needed) and outputted them into a file format he defined. So what was the 2D curve and what is the file format?

The 2D curve was the information I needed to get my task done, it was nothing more than a list of numbers, or pairs of numbers, the (x, y) coordinates of the curve that I could use to draw it. The file format was nothing more than the number of pairs, followed by each x and y coordinate in order. Since I needed to draw a bunch of these curves, and they were essentially “written down” in the file all my program had to do was to read them.

As most people are aware of these days, computers store their information as ones and zeros, or binary. Each 1 or 0 takes up one bit of space, 8 bits make a byte, and we describe most things in terms of bytes (kilobytes, megabytes). To read in my curves, I had to read in a series of numbers. In most cases computers deal with two kinds of numbers, integers and floating point numbers. Integers are the whole numbers, and can usually be stored in 4 bytes. Floating point numbers are almost the same as real numbers (not perfect) and in Java can be stored in either floats (4 bytes) or doubles (8 bytes). The doubles can be more accurate because they can store more information about the number than a float.

So if the first number in the file is an int (integer) I need to read 4 bytes, and then I can use the value found there in my program to see how many coordinates I need to read. Since I want my x and y values to be precise, they were stored as doubles. That means if I have 10 points, I will need to read 20 coordinates (10 x’s and 10 y’s) and a total of 160 bytes. So I tell the program to read 8 bytes 20 times, each time saving the double value to a list in the program. Once I’m done reading I can use this list to draw the curve from the coordinates.

This idea that all you need to know is the order in which the information is written, and how it is represented (integer vs. float vs. double vs. character vs. string…) is very powerful. It is essentially how all file formats work. There is always some sort of header, which contains information about the data that will be found in the file. Just like the number of points, this header allows for the amount of data in the file to be flexible. The program doesn’t need to know ahead of time how much data to read, but it will always start by reading the header. How big the header is decided when you design the file format, in our case it was just one integer.

Things like XML and CSV take this concept to a higher level. Using these formats your program creates a format within the rules of XML or CSV. CSV stands for comma separated values, and its just a list of numbers or words separated by a comma on each line. This way a program that can read CSVs will automatically load the values between the commas into variables.

So as long as you know the format, you can read the data, you can communicate with other programs as well as with programs you write. Open formats are great, and usually you wont have to worry about writing the code to read them. Sometimes you want to get data into a program that you weren’t supposed to get, or maybe the data was only meant to be read by one program (like maybe a wc3/sc/sc2 game replay file?) but enough smart people have messed with it that its pretty well documented. Some people like to figure out formats that are not supposed to be read without permission… and sometimes you just want to get some data from your javascript to your web server!

This is how I think about data formats, I use this thought process every day at school and in just about every project I work on.

2 thoughts on “the day the data stood still

  1. SteveC

    You mention storing native formats in files. You didn’t mention byte ordering. For example, if you write a four-byte int out to a file on an x86 box, then transfer that file over to a PowerPC box, and read the int into a program, you’re going to be hosed unless you take special precautions. That’s because x86 stores integers as “little endian” that is, the four bytes that make up the integer are ordered so that the least significant byte is first, and the most significant last, or “little end first.” The PowerPC thinks that 4 byte integers are stored in the reverse order, with the first byte being the most significant, and the last byte being the least significant, or “big end first”, or, “big endian”. The same problem occurs with networking. The solution to this is to define the file format such that it specifies a byte order. Typically, big-endian (also known as “network byte order”) is used for external data formats. There are functions in the C library like ntohl and htonl — “net to host long” and “host to net long” for converting between “host” byte order (be that big or little endian or some weird mixture like DEC machines could do) and “network” byte order — typically big endian.

Comments are closed.