Stanford University NLP researchers have built Stanza, a multi-human language tool kit. This is certainly worth a look for those working with text from many locales, such as social media. It contains support for running various accurate natural language processing tools on 60+ languages and for accessing the Java Stanford CoreNLP software from Python.
It contains tools, which can be used in a pipeline, to convert a string containing human language text into lists of sentences and words, to generate base forms of those words, their parts of speech and morphological features, to give a syntactic structure dependency parse, and to recognize named entities.
The modules of Stanza are built on top of the PyTorch library. It is built with highly accurate neural network components that also enable efficient training and evaluation with your own annotated data. It is possible to get a much faster performance if Stanza is run on a GPU-enabled machine.
- Minimum efforts required for setting up Native Python implementation;
- Full neural network pipeline for robust text analytics, including tokenization, multi-word token (MWT) expansion, lemmatization, part-of-speech (POS) and morphological features tagging, dependency parsing, and named entity recognition;
- Pre-trained neural models supporting 66 (human) languages.
pip install stanza
To see Stanza’s neural pipeline in action, you can launch the Python interactive interpreter, and try the following commands:
>>> import stanza >>> stanza.download('en') # download English model >>> nlp = stanza.Pipeline('en') # initialize English neural pipeline >>> doc = nlp("Barack Obama was born in Hawaii.") # run annotation over a sentence