Sunday, 17 July 2016

How to Install and Use SyntaxNet and Parsey McParseface

Earlier this year, Google open sourced a machine learning project called SyntaxNet, so that anyone with the right skills can use it, build on it, and hopefully improve it. Google also released a pre-trained parser model for SyntaxNet, called Parsey McParseface. It is said to be one of the most accurate such models available today. What can you do with it? Well, you can use it to determine how words are being used in English sentences. You can also use it to determine the relationships between various words in a sentence. In other words, it is a Part-Of-Speech tagger.

Last week, I spent some time trying to learn how to install and use SyntaxNet and Parsey McParseface. The hardest part was following the installation steps mentioned on SyntaxNet's Github repository. Getting the right versions of Java, Bazel, Python, protobuf, asciitree, numpy and swig was definitely exhausting. What's more, on my modest laptop that has 6GB of RAM, the installation ran for well over 6 hours. If you don't want to spend so much time just to experiment with Parsey McParseface, using a pre-built Docker image is the way to go. In this tutorial, I'll show you how.

Prerequisites:
  • A 64-bit computer with at least 2 GB of RAM
  • The latest version of Docker
  • Ubuntu
1. Pulling the Docker Image

There could be other SyntaxNet images, but in this tutorial, we'll be pulling an image created by brianlow.
docker pull brianlow/syntaxnet-docker
Depending on the speed of your network connection, you might have to wait for a while because the image is about 1GB.

2. Using SyntaxNet

Now that you have the SyntaxNet image, you need to create a new container using it and run a Bash shell on it. The following command shows you how:
docker run --name mcparseface --rm -i -t brianlow/syntaxnet-docker bash

Using Parsey McParseface directly is a little complicated. Thankfully, it comes with a very handy shell script called demo.sh. All you need to do is pass an English sentence to it.
echo "Bob, a resident of Yorkshire, loves his wife and children" \
| syntaxnet/demo.sh

The output of demo.sh is a tree.
Input: Bob , a resident of Yorkshire , loves his wife and children
Parse:
loves VBZ ROOT
 +-- Bob NNP nsubj
 |   +-- , , punct
 |   +-- resident NN appos
 |       +-- a DT det
 |       +-- of IN prep
 |           +-- Yorkshire NNP pobj
 +-- , , punct
 +-- wife NN dobj
     +-- his PRP$ poss
     +-- and CC cc
     +-- children NNS conj

As you can see, in the tree, each word is associated with tags. For example, the word “loves” has a tag “VBZ”, which means present-tense-third-person-verb. You can also see that Parsey McParseface understands that it is the root of the sentence. You can probably tell that “NN” means noun-singular and “NNS” means noun-plural. Then there are less intuitive tags such as “appos”, which is short for appositional modifier, and “amod”, which is short for adjectival modifier. You can find the meanings of all the tags on UniversalDependencies.

I experimented with a few more complicated sentences, and Parsey McParseface had no trouble parsing them. Here’s one example:
Input: Bob 's wife , a grumpy old woman , asked him to sleep in the barn
Parse:
asked VBD ROOT
 +-- wife NN nsubj
 |   +-- Bob NNP poss
 |   |   +-- 's POS possessive
 |   +-- , , punct
 |   +-- woman NN appos
 |       +-- a DT det
 |       +-- grumpy JJ amod
 |       +-- old JJ amod
 +-- him PRP dobj
 +-- sleep VB xcomp
     +-- to TO aux
     +-- in IN prep
         +-- barn NN pobj
             +-- the DT det

Here’s another example, which is slightly ambiguous:
Input: Say hello to my little friend , Bob
Parse:
Say VB ROOT
 +-- hello UH discourse
 +-- to IN prep
     +-- friend NN pobj
         +-- my PRP$ poss
         +-- little JJ amod
         +-- , , punct
         +-- Bob NNP appos
The actual output of the SyntaxNet parser is a CoNLL table. demo.sh passes that table to a Python script called conll2tree to generate the tree. If you are interested in looking at the CoNLL table, all you need to do is comment out the call to conll2tree. Here’s a sample CoNLL table:

The CoNLL format is obviously less intuitive, but is a little easier to work with in a program. For example, I can quickly determine all the nouns present in a sentence using a simple awk program:
awk -F'\t' '$4 == "NOUN" {print $2}' output.conll
With slightly more complex programs, you can determine details such as who the subject is, what adjectives are associated with it, who or what the dative object is, and so on. I hope you are now beginning to understand the significance of this powerful parser.

That’s all for now. Thanks for reading. If you found this tutorial useful, please do share it.

Sunday, 31 August 2014

Are leaks really leaks?

Yet another iPhone
I think it is time to stop using the word leak and simply say press release instead. With so many technology blogs that thrive on leaks, and with so many leaks, it is becoming glaringly obvious that either all these companies with leaks are either quite incompetent, or simply have extremely competent PR folks.

The iPhone 6 seems to hold the record, with the highest number of leaks and rumours. In fact, these days, some blogs are actually writing round-ups of all the leaks about the iPhone 6. With its release coming up in the near future, you get to read some leaked information about it every other day. I am sure this does more good than harm; there is always a buzz about the product on the Internet, and more and more people get to know about it. Thus, by the time the product is actually launched, everyone is swooning over it. If this is not unintentional, it is definitely a nice strategy.

Of course, the word leak is appealing to the readers; after all, who doesn't like to know a secret? Therefore, most technology blogs won't hesitate to use it. It is really a win-win-win situation; the readers get their spicy news, the blogs get their views, and the product gets some free advertising.

This is all good and acceptable if leaks are rare, and the leaked information is meaningful. When there are too many of them, the word leak loses its charm and importance. Readers won't care much. One must also realize that a blurry photo or two can arouse interest only once in a while.

Wednesday, 14 May 2014

The typical American programmer

Typical American symbolism
The other day a young friend asked me what programming language she should learn, to get a high paying job as soon as possible. I asked her to concentrate on data-structures, algorithms and problem-solving skills in general, instead of focussing on learning one programming language.

But the question still seemed important. I wanted to know what the typical American programmer programs in? How much does she earn? Where is the best place for her to live in?

So, I did my own little 'research' using Google and other websites. Here is what I found. Note that all this data is based the past 12 months (May 2013 to May 2014).

I am focussing on 5 very popular programming languages: Java, Python, Ruby, C# and PHP.

I am also focussing on the following 5 states: Florida, North Carolina, California, New York and Maryland.

1. What is the most preferred programming language?
American interest in programming languages
So, that is what Americans want to learn. Java is clearly the winner here, followed rather closely by Python. All the other languages are far behind them. I am sure one of the main reasons for this is Android. You can't be an Android developer without knowing Java. Moreover, a lot of enterprise systems are still written in JEE. However, a lot of developers from third world countries are Java programmers, and they sure are cheap, so I am assuming the salary of a typical Java programmer won't be very high. We will know if that is actually the case next.

2. What is the average salary of programmers in these languages in the US?

The salaries shown here are the national averages, for that exact designation. So, seniors and associate programmers are bound to get higher or lower salaries. As expected Java developers are paid the least. Now, I am talking about just Java developers (not Android or JEE specialists.) Python and C# seem to be the languages to learn, if you are interested in high salaries.

3. What is the average salary of programmers in these locations?

Here, I am considering the average salary of a "Web Developer". This designation is likely to include most of the languages that I mentioned. So, Florida is the state with the lowest salary, and New York is the one with the highest. This is expected, based on the cost of living in those states.

Based on these, I have come to the conclusion that my young friend should learn Python, and move to New York. The chances of her finding a good job seem to be the best there. But hey, we all should only do what we are interested in doing, and live only where we want to live! So these were merely suggestions.

But I had more questions. What are the salaries in other countries like, compared to the US. Surprisingly, much of the world earns far less than what Americans earn. Here is what I found.
Very interesting indeed. But what are the salaries like in the third world countries, you ask?
Note that those are the yearly salaries of people there. It is more than obvious why companies prefer to outsource projects now.

I have gathered all data from online sources only, so you can always look it up. Some numbers have been rounded off to the nearest 100. That is all for now. Please do leave comments and share.