Natural Language Processing with NLTK (Intermediate)

Written by

Wilco team

•

October 17, 2024

Introduction

Natural Language Processing is a branch of artificial intelligence that deals with the interaction between computers and humans through the natural language. The ultimate objective of NLP is to read, decipher, understand, and make sense of the human language in a valuable way. NLTK is a leading platform for building Python programs to work with human language data and provides easy-to-use interfaces to a wide variety of NLP tasks.

Tokenization

Tokenization is the process of breaking down text document apart into those pieces. It's one of the essential steps in NLP.


    # Importing necessary library
    import nltk
    from nltk.tokenize import word_tokenize

    # Example text
    text = "This is an example text for NLTK tokenization"
    tokens = word_tokenize(text)
    print(tokens)

Stemming and Lemmatization

Stemming and Lemmatization are Text Normalization (or sometimes called Word Normalization) techniques in the field of Natural Language Processing that are used to prepare text, words, and documents for further processing.


    from nltk.stem import PorterStemmer
    from nltk.stem import WordNetLemmatizer

    # Initialize stemmer and lemmatizer
    porter = PorterStemmer()
    lemmatizer = WordNetLemmatizer()

    # Stemming example
    print("Stemming - trees : ",porter.stem("trees"))
    # Lemmatization example
    print("Lemmatization - trees : ",lemmatizer.lemmatize("trees"))

Part-of-Speech Tagging

Part-of-speech (POS) is a grammatical term that deals with the roles words play when you use them together in sentences. NLTK can automatically tag words.


    from nltk import pos_tag

    # Example text
    text = "This is an example text for NLTK Part-of-Speech tagging"
    tokens = word_tokenize(text)

    # POS tagging
    tagged = pos_tag(tokens)
    print(tagged)

Sentiment Analysis and Text Classification

Sentiment analysis is the interpretation and classification of emotions (positive, negative and neutral) within text data using text analysis techniques. NLTK allows you to perform sentiment analysis and text classification.


    # Importing necessary library
    from nltk.sentiment import SentimentIntensityAnalyzer

    # Initialize the sentiment intensity analyzer
    sia = SentimentIntensityAnalyzer()

    # Example text
    text = "This is an awesome course!"

    # Get sentiment score
    sentiment = sia.polarity_scores(text)
    print(sentiment)

Top 10 Key Takeaways

Natural Language Processing (NLP) is the interaction between computers and humans through the natural language.
NLTK is a leading platform for building Python programs to work with human language data.
Tokenization is the process of breaking down text document apart into those pieces.
Stemming and Lemmatization are Text Normalization techniques in the field of Natural Language Processing.
Stemming is the process of reducing inflected words to their word stem, base or root form.
Lemmatization, unlike Stemming, reduces the inflected words properly ensuring that the root word belongs to the language.
Part-of-speech is a grammatical term that deals with the roles words play when you use them together in sentences.
NLTK can automatically tag words with their parts of speech (POS).
Sentiment analysis is the interpretation and classification of emotions (positive, negative and neutral) within text data.
NLTK provides the SentimentIntensityAnalyzer for easy sentiment analysis.

Ready to start learning? Start the quest now