Wednesday, 27 July 2016

How to Extract Exif Data From a JPEG Photo Using Java

A typical JPEG photo.
Everytime you take a selfie, or share a picture of a cute dog with your friends, you make use of a rather old file format called the Exchangeable image file format, or Exif for short. Exif itself is based on an even older file format called the Tagged Image File Format, or TIFF.

These days, .jpg files, which use Exif, are so ubiquitious and so ordinary that no one really cares how they work or how they are created. This weekend, however, I decided to take a look at the insides of a JPEG file, and tried to extract Exif metadata from it using just Java--without depending on any libraries! I mean, what's the point of the exercise otherwise?

I must say, JPEG files are really well structured and it is quite easy to extract information from them. Nevertheless, you must take steps to read the bytes in the right order because the same file can contain words represented in both little-endian and big-endian formats. Additionally, the file contains pointers to other locations inside it. Therefore, you are going to be finding yourself frequently moving back and forth inside the file.

Without further ado, let's get started. Here are a few details you'll need to know to extract Exif data:
  1. All JPEG files must start with 0xFFD8. These two bytes are often referred to as the Start of Image marker.
  2. The next 2 bytes can either be 0xFFE1 or 0xFFE0. If you see 0xFFE0, it means you are looking at a JFIF application marker--JFIF is, of course, short for JPEG File Interchange Format. If you see 0xFFE1, however, it means you've found the Exif application marker. Note that a file can have both the markers. Therefore, it is a good idea to scan all the bytes of the file looking for the sequence 0xFFE1.
  3. After the marker, the next two bytes specify the size of the associated segment.
  4. The next four bytes form the string Exif.
  5. The next two bytes are just zeroes.
  6. The next two bytes specify whether the Exif data structure follows the Intel format, also known as the little-endian format, or not. If they are 0x4949, it means it uses the Intel format. If they are 0x4D4D, it means it uses the big-endian format. From this point on, for the sake of readability, I'll stick to the little-endian format. If you are dealing with the big-endian format, simply reverse the order of the bytes.
  7. The next two bytes will be 0x2A00.
  8. The next four bytes specify the offset to the first Image File Directory, or IFD for short.
  9. The next two bytes specify the number of entries inside the directory.
  10. Each directory entry is always 12 bytes long. The first two bytes of the entry specify the type of the entry, often called the tag type. The next two bytes specify the data format of the entry. The next four bytes specify the number of components or characters. The last four bytes either contain the value of the tag, or an offset where the value of the tag can be found.
That's a lot of details, but there's more. You need to know the hex values of the tag types to extract the information you need. Here are a few useful tags:
  • Tag 0x010F specifies the manufacturer of the camera.
  • Tag 0x0110 specifies the model of the camera.
  • Tag 0x0131 specifies the software used on the photo.
  • Tag 0x8298 contains copyright information.
  • Tag 0x0132 contains a date string specifying when the photo was taken.

For all of those tags, the data format is string, specified by the hex value 0x0002.

With all this information, let's create a simple Java class that can extract the values of the five tags I listed above.
public class Reader {

 private File file;
 private boolean isIntelFormat = false;
 
 private static final int ERR = 1;
 
 public Reader(String filename) {
  file = new File(filename);   
 }

    public int read() throws IOException{

    }
}
As you can see, its constructor can take a filename. Additionally, I have a boolean to tell if the file uses the Intel format or not. I also have a read() method where I'll be reading the contents of the file.

Inside the read() method, we'll use the RandomAccessFile class because we need to be able to move freely from one position to another inside the JPEG file.
RandomAccessFile raf = new RandomAccessFile(file, "r");
To read two bytes at once, we can use the readShort() method of the RandomAccessFile class. In the following code, you see how I check if the first two bytes are 0xFFD8, and then search for the Exif application marker, 0xFFE1.
// Check if JPEG
if(raf.readShort() == (short)0xFFD8) {
 System.out.println("Looks like a JPEG file");   
} else {
 System.err.println("Not a JPEG file ");
 raf.close();
 return ERR;
}

// Find Exif marker
boolean exifFound = false;
for(long i=raf.getFilePointer(); i < raf.length() - 1; i++) {
 short current = raf.readShort(); // Read next two bytes
 if(current == (short)0xFFE1) {
  exifFound = true;
  System.out.println("Found Exif application marker");
  break;
 }
 // Move only one byte per iteration
 raf.seek(raf.getFilePointer() - 1);
}

if(!exifFound) {
 System.err.println("Couldn't find Exif application marker");
 raf.close();
 return ERR;
}
You can probably tell that it's not a terribly efficient search, but it's easy to understand and quite readable.

We'll skip the next 8 bytes because we're not interested in the data size, the string "Exif", and the two bytes filled with zeroes.
raf.skipBytes(8); // Skip data size and Exif\0\0
Then we check if the file uses the Intel format, and set the boolean class variable accordingly.
// Check if Intel format
isIntelFormat = raf.readShort() == (short)0x4949;
System.out.println("Format: " + (isIntelFormat?"Intel":"Not Intel"));
You can skip the next two bytes, but I'll use them just to confirm if I've got the right byte order.
// Check tag 0x2a00
if(raf.readShort() == (short)0x2A00) {
 System.out.println("Confirmed Intel format");
}
We must now read the next 4 bytes using the readFully() method. Note that I am using a method called convert() to convert the bytes into the right byte order.
// Get offset of IFD
byte[] offsetBytes = new byte[4];
raf.readFully(offsetBytes);
int offset = convert(offsetBytes) - 8;  
raf.skipBytes(offset);
Let's take a look at the convert() method now.
private int convert(byte... bytes) {
 ByteBuffer buffer = ByteBuffer.wrap(bytes);
 if(intelFormat)
  buffer.order(ByteOrder.LITTLE_ENDIAN);
 if(bytes.length == 2)
  return buffer.getShort();
 else
  return buffer.getInt();
}
As you can see in the code above, by simply using the ByteBuffer class, we can quickly change the order of the bytes and then convert the bytes into either a short, two bytes, or an int, 4 bytes.

Let's get back to read() now. By fetching the next two bytes and passing them to the convert() method, we can determine the number of entries present in the IFD.
// Get number of directory entries
int nEntries = convert(raf.readByte(), raf.readByte());
And now we must create a resetPoint, a point where we can return to in the future. Its value is where the IFD started, that is 4 bytes back.
long resetPoint = raf.getFilePointer() - 4;
To facilitate the storage of the offsets, lengths and tag types we'll be encountering inside each directory entry, let us create a simple class called DirectoryDataPointer.
public class DirectoryDataPointer {
 private int length;
 private String type;
 private int offset; 
 
 public DirectoryDataPointer(int offset, int length, String type) {
  super();
  this.length = length;
  this.type = type;
  this.offset = offset;
 }
 
 public int getLength() {
  return length;
 }
 
 public void setLength(int length) {
  this.length = length;
 }

 public String getType() {
  return type;
 }

 public void setType(String type) {
  this.type = type;
 }

 public int getOffset() {
  return offset;
 }

 public void setOffset(int offset) {
  this.offset = offset;
 } 
}
As you can see, the class has three variables, a constructor to initialize them, and getters and setters for all the variables. If you are using an IDE like Eclipse, you can quickly generate most of the code present in this class. Back inside the Reader class and its read() method, let's create an ArrayList for storing all the directory entries.
List<DirectoryDataPointer> directoryDataPointers = new ArrayList<>();
And now all we need to do is read all the directory entries and add them to the ArrayList. The following code, it has inline comments, shows you how to do that:
for(int i=0;i < nEntries;i++) {
 byte[] entry = new byte[12]; // Each entry is always 12 bytes
 raf.readFully(entry); 
 
 int tag = convert(entry[0], entry[1]);
 int length = convert(entry[4], entry[5], entry[6], entry[7]);
 int dataOffset = convert(entry[8], entry[9], entry[10], entry[11]) - 6;
 
 // Make
 if(tag == (short)0x010f) {    
  directoryDataPointers.add(
            new DirectoryDataPointer(dataOffset, length, "Make"));
 }
 
 // Model
 if(tag == (short)0x0110) {
  directoryDataPointers.add(
            new DirectoryDataPointer(dataOffset, length, "Model"));
 }
 
 // Software
 if(tag == (short)0x0131) {
  directoryDataPointers.add(
            new DirectoryDataPointer(dataOffset, length, "Software"));
 }
 
 // Copyright
 if(tag == (short)0x8298) {
  directoryDataPointers.add(
            new DirectoryDataPointer(dataOffset, length, "Copyright"));
 }
 
 // Date/Time
 if(tag == (short)0x0132) {
  directoryDataPointers.add(
            new DirectoryDataPointer(dataOffset, length, "Date/Time"));
 }
}
At this point, we have a lot of information about the tags, but we still don't know their actual values. To fetch those values, we must go to the offsets we determined, and read bytes from there. The seek() and skipBytes() methods of the RandomAccessFile class make it really easy to do that.
System.out.println("\n===START EXIF DATA===");

for(DirectoryDataPointer ddp:directoryDataPointers) {
 raf.seek(resetPoint);
 raf.skipBytes(ddp.getOffset());
 byte[] data = new byte[ddp.getLength()];
 raf.readFully(data);
 System.out.println(ddp.getType() + ": " + new String(data));
}

System.out.println("===END EXIF DATA===");
Now that we have read all the data, we must close the file and return from the read() method.
raf.close();
return 0;
Note that this code is very basic and not very fault-tolerant. But if you have a typical well-formed JPEG file, and you pass it as an argument to the Reader class, you'll see output that looks like this:
Looks like a JPEG file
Found Exif application marker
Format: Intel
Confirmed Intel format

===START EXIF DATA===
Make: Canon
Model: Canon EOS REBEL T2i
Software: Adobe Photoshop CS3 Windows
Date/Time: 2016:07:12 21:03:32
===END EXIF DATA===
I hope you now have a better understanding of how JPEG files store Exif data. If you found this tutorial useful, please share it with your friends.

No comments:

Post a Comment