natural-language-process-libraries
Programming

Let’s Begin with the 5 Most Important Natural Language Processing Libraries

  •  
  •  
  •  
  •  
  •  
  •  

Assume you have a good understanding of how to approach natural language processing problems. Let’s also assume you’ve agreed on the path you’ll take in attempting to solve your problem. You’ll also need to bring your idea into motion in terms of computation, and there’s a fair chance you’ll use an existing NLP library to assist you.

There are a lot of choices to choose from if you’re programming in Python (I can’t help you if you’re not). While this article is not an endorsement of any specific set of such solutions, it does include an overview of five common libraries to which you can turn for help with your issues.

  1. Datasets with Hugging Faces

The Datasets library from Hugging Face is essentially a bundled array of publicly accessible NLP datasets with a standard set of APIs and data formats, as well as some ancillary features.

With quick, easy-to-use, and powerful data manipulation tools, the largest hub of ready-to-use NLP datasets for ML models is available.

One can easily install Datasets with:

pip install datasets

Datasets also has two main features: one-line dataloaders for several public datasets and fast data pre-processing. However, another important feature of the library is a collection of built-in assessment criteria that are applicable to NLP tasks. Additional features include back-end dataset memory management and interoperability with common Python tools like NumPy and Pandas, as well as machine learning frameworks TensorFlow and PyTorch.

Let’s first look at loading a dataset:

from datasets import load_dataset, list_datasets

print(f”The Hugging Face datasets library contains {len(list_datasets())} datasets”)

squad_dataset = load_dataset(‘squad’)

print(squad_dataset[‘train’][0])
print(squad_dataset)

The Hugging Face datasets library contains 635 datasets

Reusing dataset squad (/home/matt/.cache/huggingface/datasets/squad/plain_text/1.0.0/4c81550d83a2ac7c7ce23783bd8ff36642800e6633c1f18417fb58c3ff50cdd7)

{‘answers’: {‘answer_start’: [515], ‘text’: [‘Saint Bernadette Soubirous’]}, ‘context’: ‘Architecturally, the school has a Catholic character. Atop the Main Building\’s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend “Venite Ad Me Omnes”. Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.’, ‘id’: ‘5733be284776f41900661182’, ‘question’: ‘To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?’, ‘title’: ‘University_of_Notre_Dame’}

DatasetDict({
train: Dataset({
features: [‘id’, ‘title’, ‘context’, ‘question’, ‘answers’],
num_rows: 87599
})
validation: Dataset({
features: [‘id’, ‘title’, ‘context’, ‘question’, ‘answers’],
num_rows: 10570
})
})

Loading a metric is just as straightforward:

from datasets import load_metric, list_metrics

print(f”The Hugging Face datasets library contains {len(list_metrics())} metrics”)
print(f”Available metrics are: {list_metrics()}”)

Load a metric

squad_metric = load_metric(‘squad’)

The Hugging Face datasets library contains 19 metrics

Available metrics are: [‘accuracy’, ‘bertscore’, ‘bleu’, ‘bleurt’, ‘comet’, ‘coval’, ‘f1’, ‘gleu’, ‘glue’, ‘indic_glue’, ‘meteor’, ‘precision’, ‘recall’, ‘rouge’, ‘sacrebleu’, ‘seqeval’, ‘squad’, ‘squad_v2’, ‘xnli’]

It’s up to you what you do with them, but it’s never been easier to load a publicly available dataset and a tried-and-true assessment metric.

  1. TextHero

In its Github repository, TextHero is succinctly defined as:

From zero to hero, text reprocessing, representation, and visualization.

Though this does a good job of illustrating what you could use this library for in a few terms, the repo’s response to the question “why TextHero?” sheds a little more light on the library’s motivation:

Texthero has a single goal: to free up the developer’s time in a very practical way. Working with text data can be tedious, and in most situations, a default pipeline will suffice. It’s never too late to go back and develop previous jobs.

Here’s how to get TextHero now that you know why you may want to use it:

pip install texthero

A quick glance at the getting started guide demonstrates what you can do with only a few lines of code. We will load a dataset, clean it, construct a TF-IDF representation, perform concept component analysis, and plot the results of this PCA using the example below from the TextHero Github repo.

def text_texthero():
import texthero as hero
import pandas as pd

df = pd.read_csv("https://github.com/jbesomi/texthero/raw/master/dataset/bbcsport.csv")

df['pca'] = (
    df['text']
        .pipe(hero.clean)
        .pipe(hero.tfidf)
        .pipe(hero.pca)
    )

hero.scatterplot(df, 'pca', color='topic', title="PCA BBC Sport news")

There’s a lot more you can do with TextHero, so read through the rest of the documentation to learn about data cleaning and preprocessing, visualisation, representation, simple NLP tasks, and more.

  1. spaCy

spaCy was created with the aim of becoming a useful library for developing production-ready systems.

spaCy was created to assist you in doing real work, such as building real goods or gathering real data. The library values your time and makes every effort to not waste it. It’s simple to set up, and the API is straightforward and efficient. SpaCy has been dubbed the Ruby on Rails of Natural Language Processing.

When you’re ready to get down to business, you’ll need to instal spaCy and at least one language model first. We’ll use the English language model in this example. The following lines can be used to instal both the library and the language model:

pip install spacy
python -m spacy download en

To get started with spaCy, we will use this sentence of sample text:

sample = u”I can’t imagine spending $3000 for a single bedroom apartment in N.Y.C.”

Let’s now add spaCy and a list of English stop words to the mix. We also load the English language model as a Language object (called ‘nlp’ in spaCy convention), and then call the nlp object on our sample text, which returns a processed Doc object (called ‘doc’ in spaCy convention).

import spacy
from spacy.lang.en.stop_words import STOP_WORDS

nlp = spacy.load(‘en’)
doc = nlp(sample)

So… that’s what there is to it? The following is taken from the spaCy documentation:

Even after a Doc has been processed (e.g., split into individual words and annotated), it retains all of the original text’s content, including whitespace characters. You can either get a token’s offset in the original string, or you can recreate it by joining the tokens and their trailing whitespace. When using spaCy to process text, you’ll never lose any details this way.

Now, let’s have a look at that processed sample:

Print out tokens

print(“Tokens:\n=======)
for token in doc:
print(token)

Identify stop words

print(“Stop words:\n===========”)
for word in doc:
if word.is_stop == True:
print(word)

POS tagging

print(“POS tagging:\n============”)
for token in doc:
print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
token.shape_, token.is_alpha, token.is_stop)

Print out named entities

print(“Named entities:\n===============”)
for ent in doc.ents:
print(ent.text, ent.start_char, ent.end_char, ent.label_)

Tokens:

I
ca
n’t
imagine
spending
$
3000
for
a
single
bedroom
apartment
in
N.Y.C.

Stop words:

ca
for
a
in

POS tagging:

I -PRON- PRON PRP nsubj X True False
ca can VERB MD aux xx True True
n’t not ADV RB neg x’x False False
imagine imagine VERB VB ROOT xxxx True False
spending spend VERB VBG xcomp xxxx True False
$ $ SYM $ nmod $ False False
3000 3000 NUM CD dobj dddd False False
for for ADP IN prep xxx True True
a a DET DT det x True True
single single ADJ JJ amod xxxx True False
bedroom bedroom NOUN NN compound xxxx True False
apartment apartment NOUN NN pobj xxxx True False
in in ADP IN prep xx True True
N.Y.C. n.y.c. PROPN NNP pobj X.X.X. False False

Named entities:

3000 26 30 MONEY
N.Y.C. 65 71 GPE

SpaCy is an efficient, opinionated NLP tool that can be used for anything from preprocessing to representation to modelling. To see where you can go from here, look at the spaCy documentation.

  1. Hugging Face Transformers

It’s difficult to overstate how important the Hugging Face Transformers library has become to NLP practise. By giving people access to

The big picture, taken straight from the Github repository:

Natural Language Processing for PyTorch and TensorFlow 2.0 at the cutting edge

In 100+ languages, Transformers offers thousands of pre-trained models to perform tasks on texts such as classification, knowledge extraction, question answering, summarization, translation, and text generation. Its aim is to make cutting-edge NLP more accessible to the general public.

Transformers provides APIs that allow you to easily download and use pre-trained models on any document, fine-tune them on your own datasets, and share them with the community on our model hub. Simultaneously, each python module that defines an architecture can be used independently and changed to allow for fast research experiments.

Transformers is backed by the two most common deep learning libraries, PyTorch and TensorFlow, which have a seamless integration that allows you to train your models with one and then load them for inference with the other.

Write With Transformer, the official demo of what the Transformer library can do, allows you to try out the library online.

Installation of this complex library is simple:

pip install transformers

There’s a lot to the Transformers library, and you might spend a long time studying everything there is to know about it. The pipeline API, on the other hand, enables the use of a model right away, with very little configuration required. Here’s an example of how to classify data using a Transformers pipeline (note that you’ll need either TensorFlow or PyTorch to get started):

from transformers import pipeline

Allocate a pipeline for sentiment-analysis

classifier = pipeline(‘sentiment-analysis’)

Classify text

print(classifier(‘I am a fan of KDnuggets, its useful content, and its helpful editors!’))

[{‘label’: ‘POSITIVE’, ‘score’: 0.9954679012298584}]

That’s it. It’s ridiculous. The pipeline employs a pre-trained model as well as the model’s preprocessing, and the results can be very impressive even without fine-tuning.

Here’s another example of a pipeline, this time for answering questions:

from transformers import pipeline

Allocate a pipeline for question-answering

question_answerer = pipeline(‘question-answering’)

Ask a question

answer = question_answerer({
‘question’: ‘Where is KDnuggets headquartered?’,
‘context’: ‘KDnuggets was founded in February of 1997 by Gregory Piatetsky in Brookline, Massachusetts.’
})

Print the answer

print(answer)

{‘score’: 0.9153624176979065, ‘start’: 66, ‘end’: 90, ‘answer’: ‘Brookline, Massachusetts’}

To be sure, these are simple instances, but these pipelines are much more efficient than simply solving trivial KDnuggets-related tasks! More information on pipelines can be found here.

Transformers makes cutting-edge models accessible to the general public. To learn more, go to the library’s Github repository.

  1. Scattertext

Scattertext is a tool for producing visually appealing representations of how language varies across document types. From the project’s Github page:

A tool for searching corpora for distinguishing words and displaying them in an interactive HTML scatter map. Term-related points are labeled narrowly so that they do not conflict with other labels or points.

If you haven’t figured it out yet, installation is done with:

pip install scattertext

The example below is from the Github repo, and it visualises words used at the 2012 US political conventions.

The scatter plot shows the 2,000 most party-associated unigrams as points. Their x- and y-axes reflect the thick ranks of Republican and Democratic speakers’ use of them, respectively.

It’s worth noting that running the example code results in an HTML file that can be accessed and interacted with in your browser.

import scattertext as st

df = st.SampleCorpora.ConventionData2012.get_data().assign(
parse=lambda df: df.text.apply(st.whitespace_nlp_with_sentences)
)

corpus = st.CorpusFromParsedDocuments(
df, category_col=’party’, parsed_col=’parse’
).build().get_unigram_corpus().compact(st.AssociationCompactor(2000))

html = st.produce_scattertext_explorer(
corpus,
category=’democrat’, category_name=’Democratic’, not_category_name=’Republican’,
minimum_term_frequency=0, pmi_threshold_coefficient=0,
width_in_pixels=1000, metadata=corpus.get_df()[‘speaker’],
transform=st.Scalers.dense_rank
)
open(‘./demo_compact.html’, ‘w’).write(html)

The following are the results of viewing the saved HTML file (note that this is a static image that is not interactive):

Scattertext has a specific and specified function that it fulfills admirably. The visualizations are both visually stunning and, more importantly, informative. More information and examples of what else can be done with the library can be found in their Github repository.


  •  
  •  
  •  
  •  
  •  
  •