I still remember the day that I first understood how data was communicated between programs. I was 17 years old and programming Java applets about microscopes. When it first sank in, I was happy to be able to solve the task in front of me, but I had little idea how fundamental it was. I was trying to get a set of curves that were modeled in 3ds max into my applet so I could draw them on a graph and make them move, and I didn’t know anything about how file formats work. My mentor at the time had the 3D curves converted to 2D (that’s all we needed) and outputted them into a file format he defined. So what was the 2D curve and what is the file format?
The 2D curve was the information I needed to get my task done, it was nothing more than a list of numbers, or pairs of numbers, the (x, y) coordinates of the curve that I could use to draw it. The file format was nothing more than the number of pairs, followed by each x and y coordinate in order. Since I needed to draw a bunch of these curves, and they were essentially “written down” in the file all my program had to do was to read them.
As most people are aware of these days, computers store their information as ones and zeros, or binary. Each 1 or 0 takes up one bit of space, 8 bits make a byte, and we describe most things in terms of bytes (kilobytes, megabytes). To read in my curves, I had to read in a series of numbers. In most cases computers deal with two kinds of numbers, integers and floating point numbers. Integers are the whole numbers, and can usually be stored in 4 bytes. Floating point numbers are almost the same as real numbers (not perfect) and in Java can be stored in either floats (4 bytes) or doubles (8 bytes). The doubles can be more accurate because they can store more information about the number than a float.
So if the first number in the file is an int (integer) I need to read 4 bytes, and then I can use the value found there in my program to see how many coordinates I need to read. Since I want my x and y values to be precise, they were stored as doubles. That means if I have 10 points, I will need to read 20 coordinates (10 x’s and 10 y’s) and a total of 160 bytes. So I tell the program to read 8 bytes 20 times, each time saving the double value to a list in the program. Once I’m done reading I can use this list to draw the curve from the coordinates.
This idea that all you need to know is the order in which the information is written, and how it is represented (integer vs. float vs. double vs. character vs. string…) is very powerful. It is essentially how all file formats work. There is always some sort of header, which contains information about the data that will be found in the file. Just like the number of points, this header allows for the amount of data in the file to be flexible. The program doesn’t need to know ahead of time how much data to read, but it will always start by reading the header. How big the header is decided when you design the file format, in our case it was just one integer.
Things like XML and CSV take this concept to a higher level. Using these formats your program creates a format within the rules of XML or CSV. CSV stands for comma separated values, and its just a list of numbers or words separated by a comma on each line. This way a program that can read CSVs will automatically load the values between the commas into variables.
This is how I think about data formats, I use this thought process every day at school and in just about every project I work on.