Big Data/BI Zone is brought to you in partnership with:

Gary Sieling is a software developer interested in dev-ops, database technologies, and machine learning. He has a computer science degree from the Rochester Institute of Technology. He has worked on many products in the legal and regulatory industries, having worked on and supported several data warehousing applications. Gary is a DZone MVB and is not an employee of DZone and has posted 62 posts at DZone. You can read more from them at their website. View Full User Profile

Book Review: Natural Language Processing with Python

07.10.2013
| 2538 views |
  • submit to reddit

Natural Language Processing with Python” provides a nice overview of NLP techniques and Python, using NLTK (Natural Language Toolkit), a framework maintained by the books authors. It’s intended for use as (I assume) under-grad textbook (some of their examples of “difficult” bits of code will not appear difficult to more experienced programmers).

Don’t be put off by the use of a specific library, or the idea of reading a textbook – the book is written in an easy-to-read, engaging style, and the library makes it easy to get into NLP. Most of their examples could be reproduced in any preferred framework/language, given access to the right data. The framework exists in part to make it very easy to get started with NLP, and provides an easy mechanism to download some useful datasets, as well as APIs that are thorough enough to get you going in no time. I haven’t used NLTK enough yet to have a feeling one way or the other about whether it is suitable for production use, but clearly it is good for prototyping.

The book’s treatment of Python is interesting- various language structures are introduced throughout the book, mostly sprinkled at the end of each chapter. For someone experienced with the language, these could easily be skipped – not knowing Python, I found the examples sufficient to get me writing code in no time, without need for external references.

I found the existence of exercises at the ends of chapters quite helpful, even though I didn’t complete them all, as they provide food for thought and insight into how techniques are used. Different chapters covers basic analysis, part of speech tagging, entity extraction, summarizing text contents, and grammars.

It becomes clear on reading through the book that a lot of NLP is very similar to at ETL data cleaning process, except with the caveat of likely never being fully “solved.” A lot of the techniques are specific tactics large volumes of English text; a lot of the work get you most of the way to a solution, at which point you either are forced to data correct the errors, feed another process that is ok with errors, or just accept it.

There are different angles of approach to NLP problems – ranging from specific tricks and tactics, statistical modelling techniques, to formalized grammars on the more rigid mathematical side. A surprise to me was the coverage that formalized grammars and lambda calculus receive in this book – clearly language does not make formalized grammars easy to develop, and the book covers a series of powerful extensions to concepts I learned in school like context-free grammars, which make them more attractive.

Perhaps more surprising is that the most accurate NLP results appear to come not from a particular approach, but from combining results of different types of algorithms. “Natural Language Processing with Python“ has numerous passing mentions throughout to real use cases, which I find helpful to see the value of the material – text to speech, language translation, entity recognition, text summarization, etc. Overall, a good read.

Published at DZone with permission of Gary Sieling, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)