Techno Blender
Digitally Yours.

Summarize a Text with Python — Continued | by Leo van der Meulen | Nov, 2022

0 42


How to efficiently summarize a text with Python and NLTK and as a bonus the detection of the language of a text

Photo by Mel Poole on Unsplash

In the article ‘Summarize a text with Python’ of last month I showed how to create a summary for a given text. Since then, I have been using this code frequently and found some flaws in the usage of this code. The summarize method is replaces with a class performing this function, e.g. making it easier to use the same language and summary length. The previous article was very popular so I would love to share the updates with you!

Improvements made are:

  • Introduced a Summarizer class, storing general data in attributes
  • Use the NLTK corpus builtin stop word lists, keeping the possibility to use your own list
  • Auto detect the language of a text to load the stop word list for this language
  • Call the summary function with a string or a list of strings
  • Optional sentence weighting on length
  • Incorporated the summary method for text files

The result can be found on my Github . Feel free to use it or adapt to your own wishes.

The basics of the Summarizer class

So let’s start with the basics of the Summarizer. The class stores the language name, stop words set and the default length for the summaries to generate:

The basic idea is to use the stop word lists from NLTK. NLTK supports 24 languages, including English, Dutch, German, Turkish and Greek. It is possible to add a prepared stop word list (set_stop_words, line 32) or present a file with the words (read_stopwords_from_file, line 42). The set_language method can be used to load the word set from NLTK by specifying the name of the language. The default length of a summary can be changed by calling set_summary_length.

The easiest way to use the class is to use the constructor:

summ = Summarizer(language='dutch', summary_length=3)
#or
summ = Summarizer('dutch', 3)

The language identifier, the stop word list and the summary length are stored in attributes and will be used by the summary methods.

Summarizing a text

The core of the class is the summarize method. This method follows the same logic as the summarize function from the previous article

1. Count occurrences per word in the text (stop words excluded)
2. Calculate weight per used word
3. Calculate sentence weight by summarizing weight per word
4. Find sentences with the highest weight
5. Place these sentences in their original order

The working of this algorithm is explained in the previous article. Only the differences are described here.

Where the previous implementation only accepted a string, the new implementation accepts single strings and list of strings.

Line 19 to 26 transform the input to a list of strings. If the input is a single string, it is converted to a list of sentences by tokenizing the input. If the input is one sentence, an array of one sentence is created.. In line 31 it is now possible to iterate over this list, independent of the input.

Another change is the implementation of an option to use a weighted sentence weight (lines 50–51). The weight of the individual words is divided by the length of the sentence. Some excellent feedback on the previous article pointed out the problem that shorter sentences are undervalued in the previous implementation. A short sentence with important words might receive a lower value than a long sentence with more low importance words. Enabling or disabling this option depends on the input text. Some experiments might be needed to determine the best setting for your usage.

Summarizing a text file

The challenge of summarizing a large text file was already mentioned. With this rewrite, this has been added. The method summarize_file summarizes the content of a file. This large text is summarized by

1. Split the text in chunks of n sentences
2. Summarize each chunk of sentences
3. Concatenate these summaries

First, the contents of the file are read into a single string (lines 20–22). The text is cleaned from newlines and superfluous spaces. The text is then splitted into sentences using the NLTK tokenizer (lines 26–28) and chunks of sentences are created (line 29) with a length of split_at.

For each of this chunks the summary is determined using the previous method. These separate summaries are concatenated to form the final, full summary of the file (lines 32–36).

Auto detect language

The final addition is a method to detect the language. There are several libraries available to perform this function, like Spacy LanguageDetector, Pycld, TextBlob and GoogleTrans. But it is always more fun and educational to built your own.

Here, we will use the stop word lists from NLTK to built a language detector, and thus limiting it to the languages with stop word lists in NLTK. The idea is, that we can count the number of occurrences of stop words in a text. If we do this for each language, the language with the highest count is the language the text is written in. Simple, not the best, but sufficient and fun:

Determining the number of stop word occurrences per language in the text is performed in lines 16 to 22. The (nltk.corpus.)stopwords.fileids() returns a list of all available language in the NLTK corpus. For each of these languages the stop words are obtained and it is determined how often they occur in the given text. The results are stored in a dictionary with the language as key and the number of occurrences as value.

By taking the language with the highest frequency (line 26) we obtain the estimated language. The class is initialized according to this language (line 27) and the language name is returned.

Final words

The code underwent same major changes since the last release, making it easier to use. The language detection is a nice addition, though honestly, better implementations are already available. It is added of an example how this functionality can be built.

The quality of the summaries still surprises me, despite the relative simple approach. The big advantage is that the algorithm works for all languages, while NLP implementations usually work for a very limited number of languages, especially English.

The full code is available on Github, feel free to use it and built your own implementation on top of it.

I hope you enjoyed this article. For more inspiration, check some of my other articles:

If you like this story, please hit the Follow button!

Disclaimer: The views and opinions included in this article belong only to the author.


How to efficiently summarize a text with Python and NLTK and as a bonus the detection of the language of a text

Photo by Mel Poole on Unsplash

In the article ‘Summarize a text with Python’ of last month I showed how to create a summary for a given text. Since then, I have been using this code frequently and found some flaws in the usage of this code. The summarize method is replaces with a class performing this function, e.g. making it easier to use the same language and summary length. The previous article was very popular so I would love to share the updates with you!

Improvements made are:

  • Introduced a Summarizer class, storing general data in attributes
  • Use the NLTK corpus builtin stop word lists, keeping the possibility to use your own list
  • Auto detect the language of a text to load the stop word list for this language
  • Call the summary function with a string or a list of strings
  • Optional sentence weighting on length
  • Incorporated the summary method for text files

The result can be found on my Github . Feel free to use it or adapt to your own wishes.

The basics of the Summarizer class

So let’s start with the basics of the Summarizer. The class stores the language name, stop words set and the default length for the summaries to generate:

The basic idea is to use the stop word lists from NLTK. NLTK supports 24 languages, including English, Dutch, German, Turkish and Greek. It is possible to add a prepared stop word list (set_stop_words, line 32) or present a file with the words (read_stopwords_from_file, line 42). The set_language method can be used to load the word set from NLTK by specifying the name of the language. The default length of a summary can be changed by calling set_summary_length.

The easiest way to use the class is to use the constructor:

summ = Summarizer(language='dutch', summary_length=3)
#or
summ = Summarizer('dutch', 3)

The language identifier, the stop word list and the summary length are stored in attributes and will be used by the summary methods.

Summarizing a text

The core of the class is the summarize method. This method follows the same logic as the summarize function from the previous article

1. Count occurrences per word in the text (stop words excluded)
2. Calculate weight per used word
3. Calculate sentence weight by summarizing weight per word
4. Find sentences with the highest weight
5. Place these sentences in their original order

The working of this algorithm is explained in the previous article. Only the differences are described here.

Where the previous implementation only accepted a string, the new implementation accepts single strings and list of strings.

Line 19 to 26 transform the input to a list of strings. If the input is a single string, it is converted to a list of sentences by tokenizing the input. If the input is one sentence, an array of one sentence is created.. In line 31 it is now possible to iterate over this list, independent of the input.

Another change is the implementation of an option to use a weighted sentence weight (lines 50–51). The weight of the individual words is divided by the length of the sentence. Some excellent feedback on the previous article pointed out the problem that shorter sentences are undervalued in the previous implementation. A short sentence with important words might receive a lower value than a long sentence with more low importance words. Enabling or disabling this option depends on the input text. Some experiments might be needed to determine the best setting for your usage.

Summarizing a text file

The challenge of summarizing a large text file was already mentioned. With this rewrite, this has been added. The method summarize_file summarizes the content of a file. This large text is summarized by

1. Split the text in chunks of n sentences
2. Summarize each chunk of sentences
3. Concatenate these summaries

First, the contents of the file are read into a single string (lines 20–22). The text is cleaned from newlines and superfluous spaces. The text is then splitted into sentences using the NLTK tokenizer (lines 26–28) and chunks of sentences are created (line 29) with a length of split_at.

For each of this chunks the summary is determined using the previous method. These separate summaries are concatenated to form the final, full summary of the file (lines 32–36).

Auto detect language

The final addition is a method to detect the language. There are several libraries available to perform this function, like Spacy LanguageDetector, Pycld, TextBlob and GoogleTrans. But it is always more fun and educational to built your own.

Here, we will use the stop word lists from NLTK to built a language detector, and thus limiting it to the languages with stop word lists in NLTK. The idea is, that we can count the number of occurrences of stop words in a text. If we do this for each language, the language with the highest count is the language the text is written in. Simple, not the best, but sufficient and fun:

Determining the number of stop word occurrences per language in the text is performed in lines 16 to 22. The (nltk.corpus.)stopwords.fileids() returns a list of all available language in the NLTK corpus. For each of these languages the stop words are obtained and it is determined how often they occur in the given text. The results are stored in a dictionary with the language as key and the number of occurrences as value.

By taking the language with the highest frequency (line 26) we obtain the estimated language. The class is initialized according to this language (line 27) and the language name is returned.

Final words

The code underwent same major changes since the last release, making it easier to use. The language detection is a nice addition, though honestly, better implementations are already available. It is added of an example how this functionality can be built.

The quality of the summaries still surprises me, despite the relative simple approach. The big advantage is that the algorithm works for all languages, while NLP implementations usually work for a very limited number of languages, especially English.

The full code is available on Github, feel free to use it and built your own implementation on top of it.

I hope you enjoyed this article. For more inspiration, check some of my other articles:

If you like this story, please hit the Follow button!

Disclaimer: The views and opinions included in this article belong only to the author.

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.

Leave a comment