Sunday, 17 July 2016

How to Install and Use SyntaxNet and Parsey McParseface


POS tagging.
Earlier this year, Google open sourced a machine learning project called SyntaxNet. That means, anyone with the right skills can use it, build on it, and hopefully improve it. Google also released a pre-trained parser model for SyntaxNet, called Parsey McParseface. It is said to be one of the most accurate such models available today. What can you do with it? Well, you can use it to determine how words are being used in English sentences. You can also use it to determine the relationships between various words in a sentence. In other words, it is a Part-Of-Speech tagger.

Last week, I spent some time trying to learn how to install and use SyntaxNet and Parsey McParseface. The hardest part was following the installation steps mentioned on SyntaxNet's Github repository. Getting the right versions of Java, Bazel, Python, protobuf, asciitree, numpy and swig was definitely exhausting. What's more, on my modest laptop that has 6GB of RAM, the installation ran for well over 6 hours. If you don't want to spend so much time just to experiment with Parsey McParseface, using a pre-built Docker image is the way to go. In this tutorial, I'll show you how.

Prerequisites:
  • A 64-bit computer with at least 2 GB of RAM
  • The latest version of Docker
  • Ubuntu
1. Pulling the Docker Image

There could be other SyntaxNet images, but in this tutorial, we'll be pulling an image created by brianlow.
docker pull brianlow/syntaxnet-docker
Depending on the speed of your network connection, you might have to wait for a while because the image is about 1GB.

2. Using SyntaxNet

Now that you have the SyntaxNet image, you need to create a new container using it and run a Bash shell on it. The following command shows you how:
docker run --name mcparseface --rm -i -t brianlow/syntaxnet-docker bash
Using Parsey McParseface directly is a little complicated. Thankfully, it comes with a very handy shell script called demo.sh. All you need to do is pass an English sentence to it.
echo "Bob, a resident of Yorkshire, loves his wife and children" \
| syntaxnet/demo.sh
The output of demo.sh is a tree.
Input: Bob , a resident of Yorkshire , loves his wife and children
Parse:
loves VBZ ROOT
 +-- Bob NNP nsubj
 |   +-- , , punct
 |   +-- resident NN appos
 |       +-- a DT det
 |       +-- of IN prep
 |           +-- Yorkshire NNP pobj
 +-- , , punct
 +-- wife NN dobj
     +-- his PRP$ poss
     +-- and CC cc
     +-- children NNS conj
As you can see, in the tree, each word is associated with tags. For example, the word “loves” has a tag “VBZ”, which means present-tense-third-person-verb. You can also see that Parsey McParseface understands that it is the root of the sentence. You can probably tell that “NN” means noun-singular and “NNS” means noun-plural. Then there are less intuitive tags such as “appos”, which is short for appositional modifier, and “amod”, which is short for adjectival modifier. You can find the meanings of all the tags on UniversalDependencies.

I experimented with a few more complicated sentences, and Parsey McParseface had no trouble parsing them. Here’s one example:
Input: Bob 's wife , a grumpy old woman , asked him to sleep in the barn
Parse:
asked VBD ROOT
 +-- wife NN nsubj
 |   +-- Bob NNP poss
 |   |   +-- 's POS possessive
 |   +-- , , punct
 |   +-- woman NN appos
 |       +-- a DT det
 |       +-- grumpy JJ amod
 |       +-- old JJ amod
 +-- him PRP dobj
 +-- sleep VB xcomp
     +-- to TO aux
     +-- in IN prep
         +-- barn NN pobj
             +-- the DT det
Here’s another example, which is slightly ambiguous:
Input: Say hello to my little friend , Bob
Parse:
Say VB ROOT
 +-- hello UH discourse
 +-- to IN prep
     +-- friend NN pobj
         +-- my PRP$ poss
         +-- little JJ amod
         +-- , , punct
         +-- Bob NNP appos
The actual output of the SyntaxNet parser is a CoNLL table. demo.sh passes that table to a Python script called conll2tree to generate the tree. If you are interested in looking at the CoNLL table, all you need to do is comment out the call to conll2tree. Here’s a sample CoNLL table:
The CoNLL format is obviously less intuitive, but is a little easier to work with in a program. For example, I can quickly determine all the nouns present in a sentence using a simple awk program:
awk -F'\t' '$4 == "NOUN" {print $2}' output.conll
With slightly more complex programs, you can determine details such as who the subject is, what adjectives are associated with it, who or what the dative object is, and so on. I hope you are now beginning to understand the significance of this powerful parser.

That’s all for now. Thanks for reading. If you found this tutorial useful, please do share it.

2 comments:

  1. Great article
    "The actual output of the SyntaxNet parser is a CoNLL table."
    Would you point me in the right direction of how to get this CoNLL table into a MySQL database.

    ReplyDelete
  2. I want to port the syntaxnet model to android native, but there is no documentation given for this model. In google tensorflow example, they have shared an example of identifying the object using tensorflow library. Similar way i want to port syntaxNet to android. Any help in this regard will be much appreciable.

    Basic concepts what i understood, tensorflow model need to initialize first with the respective model, define the input size etc. then run the tensorflow session and expect the POS tag list.

    SyntaxNet is having lots of python script, (I am new to python). please guide me.

    ReplyDelete