摘要： Deep learning has yielded amazing advances in natural language processing. Tap into the latest innovations with Explosion, Huggingface, and John Snow Labs.
Natural language processing (NLP) has been a long-standing dream of computer scientists that dates back to the days of ELIZA and even to the fundamental foundations of computing itself (Turing Test, anybody?). NLP has undergone a dramatic revolution in the past few years, with the statistical methods of the past giving way to approaches based on deep learning, or neural networks.
Applying deep learning to NLP has led to massive, sophisticated, general purpose language models, like GPT-3, capable of generating text that is truly indistinguishable from human writing. GPT-3, for example, unlocks features such as those found in Microsoft’s new “no-code” Power Apps platform, where you can enter a natural language description of a query, and the back end will generate the code (a Power Fx expression based on Excel syntax).
NLP has vast potential across the enterprise, and it’s not just the giants like Google or Microsoft that are bringing products to the table. In this article, we’ll look at three different start-ups that run the gamut from providing AI-powered solutions to offering the building blocks for building your own custom NLP solutions.
Most developers who work in NLP circles will have interacted with spaCy, the popular NLP library for Python, but far fewer will have heard about Explosion, the company founded by Matthew Hannibal and Ines Montani that develops spaCy and the commercial annotation tool, Prodigy.
One of the premier NLP toolkits for years and years, spaCy is capable of handling massive production workloads without a sweat, one of its distinguishing features from other libraries of a similar age. If you haven’t used spaCy for a while, you may be surprised to see how well it has kept up with the bleeding edge of modern NLP techniques, with pipelines based on pre-trained Transformer models such as BERT, the ability to integrate custom models from PyTorch or TensorFlow, and support for more than 50 languages out-of-the-box.
While spaCy is open source, Explosion also offers a paid product, Prodigy, which aims to become an invaluable part of the data scientist’s toolkit, enabling expressive, scriptable annotations of data sets, not only with a tight interaction loop with spaCy but also with comprehensive support for annotating images, audio, and video. Prodigy comes with recipes for building pipelines for classification, transcription, bounding boxes, and much more. These should allow data scientists to take a more active role in efficient annotating of data sets, in turn driving down the cost for building rich input data and creating better models.
It has been quite a journey from the company that produced a PyTorch library that provided implementations of Transformer-based NLP models and the Write With Transformer website, to the all-conquering NLP juggernaut that is today’s Huggingface (or 🤗). Not only is Huggingface’s Transformers library the de facto standard for text processing these days, but the turnaround time between finding a new paper or technique and getting it into the library is often measured in days, rather than weeks.
The Huggingface model zoo has expanded beyond a model hub for all sorts of different models (encompassing subjects like domains, languages, size, etc) to comprise a hosted inference API which boasts accelerated implementations of many models, plus an easy-to-use API for working with a host of different data sets. And you can find Huggingface being used by thousands of companies, ranging from applied usage at the likes of Grammarly to research uses by, yes, Microsoft, Google, and Facebook. On top of all this, Huggingface contributes other, smaller libraries to the machine learning ecosystem, such as the recent Accelerate library that takes much of the hassle out of training large models across a set of distributed machines.
Huggingface is not slowing down, either. In recent months we’ve seen audio and image models being added to the platform, and it’s likely that Huggingface will be right there at the forefront as the Transformer architecture continues to eat its way through the deep learning space, conquering all in its path.
John Snow Labs
John Snow Labs is the custodian of Spark NLP, an open source NLP framework that perhaps not surprisingly runs on top of Apache Spark. Incredibly popular in the enterprise, you’ll find it powering all sorts of NLP pipelines in companies for applications like named entity recognition (NER), information retrieval, classification, and sentiment analysis. Like spaCy, it has evolved to fit in with the new paradigms in NLP, coming as standard with an enormous number of deep-learning models (over 700!) and over 400 pipelines for various different applications. It also takes advantage of Apache Spark’s scaling for an easier story for distributed deployment than many of its competitors.
One thing that is interesting is John Snow Labs builds upon Spark NLP with three paid products, two of which are heavily targeted towards the healthcare industry, and the other primarily in that field too, but can be used in other domains. They offer Healthcare AI, a managed platform running on top of Kubernetes for healthcare analysis and research, and a set of add-on packages for Spark NLP allowing for methods such as clinical entity recognition and linking, extracting medical concepts, and de-identifying text.
The other paid product is Spark OCR, which claims to be the best in class OCR solution available. Its ability to capture regions and output in DICOM format as well as PDF betrays a slight bias towards the healthcare domain, but has a suite of more generalised pipelines for image processing, denoising, unskewing, and of course can integrate with Spark NLP for producing easily scalable pipelines that can to end-to-end NER extraction from any given input image.
There’s a lot of embedded knowledge within Spark NLP, and in the healthcare domain, John Snow Labs seems to have an advantage over the other big NLP library provider…and on that note, let’s finish off this round up by talking about them!
What’s next in NLP
What are we likely to see in the NLP space in the coming months? A lot more of the same, I imagine, but bigger; trillion-parameter models are now becoming more of a thing at companies such as Google, Microsoft, and Facebook. While GPT-3 is currently locked away behind OpenAI’s API, expect the open source “re-creation” that is GPT-Neo X to have the 175 billion parameter model released sometime this year, bringing the power of GPT-3 generative capabilities to pretty much anybody on the planet.
Finally, we can expect researchers to continue chipping away at the other end of the scale, trying to make these architectures run faster and more efficiently for smaller devices and for longer documents. And you can rest assured that the results of all that research will be present in the offerings from Explosion, Huggingface, and John Snow Labs too, probably in a matter of weeks after publication.
若喜歡本文，請關注我們的臉書 Please Like our Facebook Page： Big Data In Finance