Thursday, 18 August 2016

Using Protocol Buffers 3.0.0 With Eclipse and Maven

A long time ago, programmers preferred sharing structured data in the form of XML files. However, in recent years, JSON has managed to replace XML as the most preferred data sharing format. Why? Well, compared to XML, JSON is not only easier to parse and generate, but also far more compact.

But what if you want a format that is more compact than JSON? And what if you want it to be more easily readable? Then, you should consider using Google's Protocol Buffers. Protocol buffers are extremely compact, and can very efficiently handle large amounts of data. Using them is as easy as working with objects in your preferred programming language.

In this tutorial, I'm going to show you how to work with Protocol buffers 3.0.0 while using Eclipse Luna and Maven. Therefore, to be able follow this tutorial, you must have:
  • The latest version of Eclipse configured to work with Maven
  • A general understanding of the Java programming language
1. Create and Configure a New Project

Fire up Eclipse and create a new Maven Project. Make sure you specify an appropriate Group Id and Artifact Id.
Next, open the pom.xml file and, in the dependencies tab, add a new dependency for protobuf-java.
Then, add the exec-maven-plugin to it using the following code:
<build>
    <plugins>
        <plugin>
            <groupId>org.codehaus.mojo</groupId>
            <artifactId>exec-maven-plugin</artifactId>
            <version>1.2.1</version>
            <configuration>
                <mainClass>protobuftutorial.Main</mainClass>
            </configuration>
        </plugin>
    </plugins>
</build>
As you can see in the above code, we'll be using a class called Main that will be executed when we run this project. Therefore, add a new Java class called Main inside the src/main/java directory. I am using protobuftutorial as the package name.

Next, open the Run Configurations dialog and set the Goals field to clean verify exec:java.
At this point, our project is configured and ready.

2. Create a Schema for the Protocol Buffers

Protocol buffers need schemas, which are nothing but simple text files with the .proto extension. You can create these schemas inside the src/main/resources. For now, create just one file called country.proto.

This file will represent a country, and will have the following fields:
  • Name, which will be of type string
  • Capital, which will also be of type string
  • Cities, which will be a list of strings
  • Population, which will be an integer
  • HDI (Human Development Index, which can be one of very high, high, medium, low or very low)
Accordingly, add the following code to the proto file:
syntax = "proto3";

option java_package = "protobuftutorial";
option java_outer_classname = "CountryProto";

message Country {
        string name = 1;
        string capital = 2;
        int32 population = 3;
        repeated string city = 4;
        enum HDI {
                VERY_HIGH = 0;
                HIGH = 1;
                MEDIUM = 2;
                LOW = 3;
                VERY_LOW = 4;
        }
        HDI hdi = 5;
}
As you can see, in the first line we specify the syntax we will be using in the file. It can either be proto2 or proto3.

Next, we specify the java package of our project. And then, the name of the Java class that should be generated.

I am sure, the rest of the file looks intuitive to you. Nevertheless, note that to represent a list, you must prefix the data type with the repeated keyword.

Every field in the message should have a unique tag associated with it. The number after the = sign represents the tag.

3. Getting the Protocol Buffer Code Generator

Now that the proto file is ready, we need to compile it. In order to do that, you must download the protocol buffers compiler. The easiest way to get it is to download the binary from GitHub. Make sure you download the file appropriate for your operating system. For example, if you are using Ubuntu 64-bit, you would download protoc-3.0.0-linux-x86_64.zip.

After the download is complete, extract the ZIP file. You'll see that it contains a file called protoc, which is nothing but the compiler.

You can, if you want to, add protoc to your PATH variable.

All you need to do now is compile country.proto using protoc. To do so, open a new terminal and run protoc. As arguments, you must specify the absolute path of the directory in which your Maven project exists, the directory inside which you want to keep the generated files(it will be the same directory in which Main.java is present), and the absolute path of country.proto. Here's a sample call:
./protoc -I=/home/whycouch/protobuftutorial/src/main/ \
--java_out=/home/whycouch/protobuftutorial/src/main/java/ \
/home/whycouch/protobuftutorial/src/main/resources/country.proto
After the compilation, if you refresh your Maven project, you should see a new file called CountryProto.java in it.

4. Working With Protocol Buffers

The CountryProto class allows you to create new Country objects. But first, you must create a Builder for the Country object. Using the Builder, you can set the values of all the fields in the Country object.
Builder countryBuilder = CountryProto.Country.newBuilder();
countryBuilder.setName("United States of America");
countryBuilder.setCapital("Washington, D.C.");
countryBuilder.setHdi(HDI.VERY_HIGH);
countryBuilder.setPopulation(309349689);

countryBuilder.addCity("Houston");
countryBuilder.addCity("Los Angeles");
countryBuilder.addCity("Tucson");
You can write all the above code inside the Main class.

Once you have set all the fields, call the build() method to generate the object.
Country usa = countryBuilder.build();
The Country object can now be stored and shared in a compact manner. For example, here's how you can store it in a file. All you need to do is pass a FileOutputStream to its writeTo() method.
try {
 FileOutputStream output = new FileOutputStream("/tmp/usa.data");
 usa.writeTo(output);
 output.close();
} catch (IOException e) {
 e.printStackTrace();
}
Reading from a protocol buffers file is just as easy. All you need to do is parse it using the appropriate parseFrom() method. You can then call various getters to fetch the values of the fields.
try {
 Country usa = CountryProto.Country.parseFrom(
                    new FileInputStream("/tmp/usa.data"));
 System.out.println(usa.getName());
 System.out.println(usa.getCapital());
 System.out.println(usa.getPopulation());
 System.out.println(usa.getHdi().name());
 for(int i=0;i < usa.getCityCount();i++) {
  System.out.println("- " + usa.getCity(i));
 }
} catch (FileNotFoundException e) {
 e.printStackTrace();
} catch (IOException e) {
 e.printStackTrace();
}
And that's all there is to it. You now know how to use protocol buffers.

I've created a GitHub repository for this tutorial. You can take a look at the full project there.

If you found this tutorial useful, please do like it and share it.

Wednesday, 27 July 2016

How to Extract Exif Data From a JPEG Photo Using Java

A typical JPEG photo.
Everytime you take a selfie, or share a picture of a cute dog with your friends, you make use of a rather old file format called the Exchangeable image file format, or Exif for short. Exif itself is based on an even older file format called the Tagged Image File Format, or TIFF.

These days, .jpg files, which use Exif, are so ubiquitious and so ordinary that no one really cares how they work or how they are created. This weekend, however, I decided to take a look at the insides of a JPEG file, and tried to extract Exif metadata from it using just Java--without depending on any libraries! I mean, what's the point of the exercise otherwise?

I must say, JPEG files are really well structured and it is quite easy to extract information from them. Nevertheless, you must take steps to read the bytes in the right order because the same file can contain words represented in both little-endian and big-endian formats. Additionally, the file contains pointers to other locations inside it. Therefore, you are going to be finding yourself frequently moving back and forth inside the file.

Without further ado, let's get started. Here are a few details you'll need to know to extract Exif data:
  1. All JPEG files must start with 0xFFD8. These two bytes are often referred to as the Start of Image marker.
  2. The next 2 bytes can either be 0xFFE1 or 0xFFE0. If you see 0xFFE0, it means you are looking at a JFIF application marker--JFIF is, of course, short for JPEG File Interchange Format. If you see 0xFFE1, however, it means you've found the Exif application marker. Note that a file can have both the markers. Therefore, it is a good idea to scan all the bytes of the file looking for the sequence 0xFFE1.
  3. After the marker, the next two bytes specify the size of the associated segment.
  4. The next four bytes form the string Exif.
  5. The next two bytes are just zeroes.
  6. The next two bytes specify whether the Exif data structure follows the Intel format, also known as the little-endian format, or not. If they are 0x4949, it means it uses the Intel format. If they are 0x4D4D, it means it uses the big-endian format. From this point on, for the sake of readability, I'll stick to the little-endian format. If you are dealing with the big-endian format, simply reverse the order of the bytes.
  7. The next two bytes will be 0x2A00.
  8. The next four bytes specify the offset to the first Image File Directory, or IFD for short.
  9. The next two bytes specify the number of entries inside the directory.
  10. Each directory entry is always 12 bytes long. The first two bytes of the entry specify the type of the entry, often called the tag type. The next two bytes specify the data format of the entry. The next four bytes specify the number of components or characters. The last four bytes either contain the value of the tag, or an offset where the value of the tag can be found.
That's a lot of details, but there's more. You need to know the hex values of the tag types to extract the information you need. Here are a few useful tags:
  • Tag 0x010F specifies the manufacturer of the camera.
  • Tag 0x0110 specifies the model of the camera.
  • Tag 0x0131 specifies the software used on the photo.
  • Tag 0x8298 contains copyright information.
  • Tag 0x0132 contains a date string specifying when the photo was taken.

For all of those tags, the data format is string, specified by the hex value 0x0002.

With all this information, let's create a simple Java class that can extract the values of the five tags I listed above.
public class Reader {

 private File file;
 private boolean isIntelFormat = false;
 
 private static final int ERR = 1;
 
 public Reader(String filename) {
  file = new File(filename);   
 }

    public int read() throws IOException{

    }
}
As you can see, its constructor can take a filename. Additionally, I have a boolean to tell if the file uses the Intel format or not. I also have a read() method where I'll be reading the contents of the file.

Inside the read() method, we'll use the RandomAccessFile class because we need to be able to move freely from one position to another inside the JPEG file.
RandomAccessFile raf = new RandomAccessFile(file, "r");
To read two bytes at once, we can use the readShort() method of the RandomAccessFile class. In the following code, you see how I check if the first two bytes are 0xFFD8, and then search for the Exif application marker, 0xFFE1.
// Check if JPEG
if(raf.readShort() == (short)0xFFD8) {
 System.out.println("Looks like a JPEG file");   
} else {
 System.err.println("Not a JPEG file ");
 raf.close();
 return ERR;
}

// Find Exif marker
boolean exifFound = false;
for(long i=raf.getFilePointer(); i < raf.length() - 1; i++) {
 short current = raf.readShort(); // Read next two bytes
 if(current == (short)0xFFE1) {
  exifFound = true;
  System.out.println("Found Exif application marker");
  break;
 }
 // Move only one byte per iteration
 raf.seek(raf.getFilePointer() - 1);
}

if(!exifFound) {
 System.err.println("Couldn't find Exif application marker");
 raf.close();
 return ERR;
}
You can probably tell that it's not a terribly efficient search, but it's easy to understand and quite readable.

We'll skip the next 8 bytes because we're not interested in the data size, the string "Exif", and the two bytes filled with zeroes.
raf.skipBytes(8); // Skip data size and Exif\0\0
Then we check if the file uses the Intel format, and set the boolean class variable accordingly.
// Check if Intel format
isIntelFormat = raf.readShort() == (short)0x4949;
System.out.println("Format: " + (isIntelFormat?"Intel":"Not Intel"));
You can skip the next two bytes, but I'll use them just to confirm if I've got the right byte order.
// Check tag 0x2a00
if(raf.readShort() == (short)0x2A00) {
 System.out.println("Confirmed Intel format");
}
We must now read the next 4 bytes using the readFully() method. Note that I am using a method called convert() to convert the bytes into the right byte order.
// Get offset of IFD
byte[] offsetBytes = new byte[4];
raf.readFully(offsetBytes);
int offset = convert(offsetBytes) - 8;  
raf.skipBytes(offset);
Let's take a look at the convert() method now.
private int convert(byte... bytes) {
 ByteBuffer buffer = ByteBuffer.wrap(bytes);
 if(intelFormat)
  buffer.order(ByteOrder.LITTLE_ENDIAN);
 if(bytes.length == 2)
  return buffer.getShort();
 else
  return buffer.getInt();
}
As you can see in the code above, by simply using the ByteBuffer class, we can quickly change the order of the bytes and then convert the bytes into either a short, two bytes, or an int, 4 bytes.

Let's get back to read() now. By fetching the next two bytes and passing them to the convert() method, we can determine the number of entries present in the IFD.
// Get number of directory entries
int nEntries = convert(raf.readByte(), raf.readByte());
And now we must create a resetPoint, a point where we can return to in the future. Its value is where the IFD started, that is 4 bytes back.
long resetPoint = raf.getFilePointer() - 4;
To facilitate the storage of the offsets, lengths and tag types we'll be encountering inside each directory entry, let us create a simple class called DirectoryDataPointer.
public class DirectoryDataPointer {
 private int length;
 private String type;
 private int offset; 
 
 public DirectoryDataPointer(int offset, int length, String type) {
  super();
  this.length = length;
  this.type = type;
  this.offset = offset;
 }
 
 public int getLength() {
  return length;
 }
 
 public void setLength(int length) {
  this.length = length;
 }

 public String getType() {
  return type;
 }

 public void setType(String type) {
  this.type = type;
 }

 public int getOffset() {
  return offset;
 }

 public void setOffset(int offset) {
  this.offset = offset;
 } 
}
As you can see, the class has three variables, a constructor to initialize them, and getters and setters for all the variables. If you are using an IDE like Eclipse, you can quickly generate most of the code present in this class. Back inside the Reader class and its read() method, let's create an ArrayList for storing all the directory entries.
List<DirectoryDataPointer> directoryDataPointers = new ArrayList<>();
And now all we need to do is read all the directory entries and add them to the ArrayList. The following code, it has inline comments, shows you how to do that:
for(int i=0;i < nEntries;i++) {
 byte[] entry = new byte[12]; // Each entry is always 12 bytes
 raf.readFully(entry); 
 
 int tag = convert(entry[0], entry[1]);
 int length = convert(entry[4], entry[5], entry[6], entry[7]);
 int dataOffset = convert(entry[8], entry[9], entry[10], entry[11]) - 6;
 
 // Make
 if(tag == (short)0x010f) {    
  directoryDataPointers.add(
            new DirectoryDataPointer(dataOffset, length, "Make"));
 }
 
 // Model
 if(tag == (short)0x0110) {
  directoryDataPointers.add(
            new DirectoryDataPointer(dataOffset, length, "Model"));
 }
 
 // Software
 if(tag == (short)0x0131) {
  directoryDataPointers.add(
            new DirectoryDataPointer(dataOffset, length, "Software"));
 }
 
 // Copyright
 if(tag == (short)0x8298) {
  directoryDataPointers.add(
            new DirectoryDataPointer(dataOffset, length, "Copyright"));
 }
 
 // Date/Time
 if(tag == (short)0x0132) {
  directoryDataPointers.add(
            new DirectoryDataPointer(dataOffset, length, "Date/Time"));
 }
}
At this point, we have a lot of information about the tags, but we still don't know their actual values. To fetch those values, we must go to the offsets we determined, and read bytes from there. The seek() and skipBytes() methods of the RandomAccessFile class make it really easy to do that.
System.out.println("\n===START EXIF DATA===");

for(DirectoryDataPointer ddp:directoryDataPointers) {
 raf.seek(resetPoint);
 raf.skipBytes(ddp.getOffset());
 byte[] data = new byte[ddp.getLength()];
 raf.readFully(data);
 System.out.println(ddp.getType() + ": " + new String(data));
}

System.out.println("===END EXIF DATA===");
Now that we have read all the data, we must close the file and return from the read() method.
raf.close();
return 0;
Note that this code is very basic and not very fault-tolerant. But if you have a typical well-formed JPEG file, and you pass it as an argument to the Reader class, you'll see output that looks like this:
Looks like a JPEG file
Found Exif application marker
Format: Intel
Confirmed Intel format

===START EXIF DATA===
Make: Canon
Model: Canon EOS REBEL T2i
Software: Adobe Photoshop CS3 Windows
Date/Time: 2016:07:12 21:03:32
===END EXIF DATA===
I hope you now have a better understanding of how JPEG files store Exif data. If you found this tutorial useful, please share it with your friends.

Sunday, 17 July 2016

How to Install and Use SyntaxNet and Parsey McParseface


POS tagging.
Earlier this year, Google open sourced a machine learning project called SyntaxNet. That means, anyone with the right skills can use it, build on it, and hopefully improve it. Google also released a pre-trained parser model for SyntaxNet, called Parsey McParseface. It is said to be one of the most accurate such models available today. What can you do with it? Well, you can use it to determine how words are being used in English sentences. You can also use it to determine the relationships between various words in a sentence. In other words, it is a Part-Of-Speech tagger.

Last week, I spent some time trying to learn how to install and use SyntaxNet and Parsey McParseface. The hardest part was following the installation steps mentioned on SyntaxNet's Github repository. Getting the right versions of Java, Bazel, Python, protobuf, asciitree, numpy and swig was definitely exhausting. What's more, on my modest laptop that has 6GB of RAM, the installation ran for well over 6 hours. If you don't want to spend so much time just to experiment with Parsey McParseface, using a pre-built Docker image is the way to go. In this tutorial, I'll show you how.

Prerequisites:
  • A 64-bit computer with at least 2 GB of RAM
  • The latest version of Docker
  • Ubuntu
1. Pulling the Docker Image

There could be other SyntaxNet images, but in this tutorial, we'll be pulling an image created by brianlow.
docker pull brianlow/syntaxnet-docker
Depending on the speed of your network connection, you might have to wait for a while because the image is about 1GB.

2. Using SyntaxNet

Now that you have the SyntaxNet image, you need to create a new container using it and run a Bash shell on it. The following command shows you how:
docker run --name mcparseface --rm -i -t brianlow/syntaxnet-docker bash
Using Parsey McParseface directly is a little complicated. Thankfully, it comes with a very handy shell script called demo.sh. All you need to do is pass an English sentence to it.
echo "Bob, a resident of Yorkshire, loves his wife and children" \
| syntaxnet/demo.sh
The output of demo.sh is a tree.
Input: Bob , a resident of Yorkshire , loves his wife and children
Parse:
loves VBZ ROOT
 +-- Bob NNP nsubj
 |   +-- , , punct
 |   +-- resident NN appos
 |       +-- a DT det
 |       +-- of IN prep
 |           +-- Yorkshire NNP pobj
 +-- , , punct
 +-- wife NN dobj
     +-- his PRP$ poss
     +-- and CC cc
     +-- children NNS conj
As you can see, in the tree, each word is associated with tags. For example, the word “loves” has a tag “VBZ”, which means present-tense-third-person-verb. You can also see that Parsey McParseface understands that it is the root of the sentence. You can probably tell that “NN” means noun-singular and “NNS” means noun-plural. Then there are less intuitive tags such as “appos”, which is short for appositional modifier, and “amod”, which is short for adjectival modifier. You can find the meanings of all the tags on UniversalDependencies.

I experimented with a few more complicated sentences, and Parsey McParseface had no trouble parsing them. Here’s one example:
Input: Bob 's wife , a grumpy old woman , asked him to sleep in the barn
Parse:
asked VBD ROOT
 +-- wife NN nsubj
 |   +-- Bob NNP poss
 |   |   +-- 's POS possessive
 |   +-- , , punct
 |   +-- woman NN appos
 |       +-- a DT det
 |       +-- grumpy JJ amod
 |       +-- old JJ amod
 +-- him PRP dobj
 +-- sleep VB xcomp
     +-- to TO aux
     +-- in IN prep
         +-- barn NN pobj
             +-- the DT det
Here’s another example, which is slightly ambiguous:
Input: Say hello to my little friend , Bob
Parse:
Say VB ROOT
 +-- hello UH discourse
 +-- to IN prep
     +-- friend NN pobj
         +-- my PRP$ poss
         +-- little JJ amod
         +-- , , punct
         +-- Bob NNP appos
The actual output of the SyntaxNet parser is a CoNLL table. demo.sh passes that table to a Python script called conll2tree to generate the tree. If you are interested in looking at the CoNLL table, all you need to do is comment out the call to conll2tree. Here’s a sample CoNLL table:
The CoNLL format is obviously less intuitive, but is a little easier to work with in a program. For example, I can quickly determine all the nouns present in a sentence using a simple awk program:
awk -F'\t' '$4 == "NOUN" {print $2}' output.conll
With slightly more complex programs, you can determine details such as who the subject is, what adjectives are associated with it, who or what the dative object is, and so on. I hope you are now beginning to understand the significance of this powerful parser.

That’s all for now. Thanks for reading. If you found this tutorial useful, please do share it.