Product Details Table of Contents. Table of Contents 1. Introduction; 2. Metrics of performance; 3. Average performance and variability; 4. Errors in experimental measurements; 5. Comparing alternatives; 6. Measurement tools and techniques; 7. Benchmark programs; 8. Linear regression models; 9. Design of experiments; Simulation and random number generation; Queueing analysis; Glossary; Appendices. Average Review. Write a Review. Cambridge University Press. Formally, NLP is a specialized field of computer science and artificial intelligence with roots in computational linguistics.
It is primarily concerned with designing and building applications and systems that enable interaction between machines and natural languages that have been evolved for use by humans. Hence, often it is perceived as a niche area to work on. And people usually tend to focus more on machine learning or statistical learning. When I started delving into the world of data science, even I was overwhelmed by the challenges in analyzing and modeling on text data.
Thus, there is no pre-requisite to buy any of these books to learn NLP. When building the content and examples for this article, I was thinking if I should focus on a toy dataset to explain things better, or focus on an existing dataset from one of the main sources for data science datasets. Then I thought, why not build an end-to-end tutorial, where we scrape the web to get some text data and showcase examples based on that! The source data which we will be working on will be news articles, which we have retrieved from inshorts , a website that gives us short, word news articles on a wide variety of topics, and they even have an app for it!
In this article, we will be working with text data from news articles on technology, sports and world news. I will be covering some basics on how to scrape and retrieve these news articles from their website in the next section.
- The Physical Metallurgy of Fracture. Fourth International Conference on Fracture, June 1977, University of Waterloo, Canada?
- The Practitioner's Guide to Governance as Leadership.
- Biostatistics and Epidemiology: A Primer for Health and Biomedical Professionals.
- Nitric Oxide in Pulmonary Processes: Role in Physiology and Pathophysiology of Lung Disease?
Typically, any NLP-based problem can be solved by a methodical workflow that has a sequence of steps. The major steps are depicted in the following figure. We usually start with a corpus of text documents and follow standard processes of text wrangling and pre-processing, parsing and basic exploratory data analysis. Based on the initial insights, we usually represent the text using relevant feature engineering techniques.
Depending on the problem at hand, we either focus on building predictive supervised models or unsupervised models, which usually focus more on pattern mining and grouping. Finally, we evaluate the model and the overall success criteria with relevant stakeholders or customers, and deploy the final model for future usage.
We will be scraping inshorts , the website, by leveraging python to retrieve news articles. We will be focusing on articles on technology, sports and world affairs. A typical news category landing page is depicted in the following figure, which also highlights the HTML section for the textual content of each article. Thus, we can see the specific HTML tags which contain the textual content of each news article in the landing page mentioned above.
We will be using this information to extract news articles by leveraging the BeautifulSoup and requests libraries.
- Hidden Unity in Natures Laws?
- Sediment and Dredged Material Treatment: Vol. 2 (Sustainable Management of Sediment Resources).
- The Right Side of the Sixties: Reexamining Conservatism’s Decade of Transformation;
- Measuring Computer Performance: A Practitioner's Guide!
We will now build a function which will leverage requests to access and get the HTML content from the landing pages of each of the three news categories. Then, we will use BeautifulSoup to parse and extract the news headline and article textual content for all the news articles in each category. We find the content by accessing the specific HTML tags and classes, where they are present a sample of which I depicted in the previous figure. It is pretty clear that we extract the news headline, article text and category and build out a data frame, where each row corresponds to a specific news article.
We will now invoke this function and build our dataset. We, now, have a neatly formatted dataset of news articles and you can quickly check the total number of news articles with the following code. There are usually multiple steps involved in cleaning and pre-processing textual data. We will be leveraging a fair bit of nltk and spacy , both state-of-the-art libraries in NLP. We will remove negation words from stop words, since we would want to keep them as they might be useful, especially during sentiment analysis.
We leverage a standard set of contractions available in the contractions. Please add it in the same directory you run your code from, else it will not work. Often, unstructured text contains a lot of noise, especially if you use techniques like web or screen scraping. It is quite evident from the above output that we can remove unnecessary HTML tags and retain the useful textual information from any document.
Hence, we need to make sure that these characters are converted and standardized into ASCII characters. The preceding function shows us how we can easily convert accented characters to normal English characters, which helps standardize the words in our corpus. Contractions are shortened version of words or syllables.
They often exist in either written or spoken forms in the English language. These shortened versions or contractions of words are created by removing specific letters and sounds. In case of English contractions, they are often created by removing one of the vowels from the word. Converting each contraction to its expanded, original form helps with text standardization. We can see how our function helps expand the contractions from the preceding output.
Are there better ways of doing this? If we have enough examples, we can even train a deep learning model for better performance. Special characters and symbols are usually non-alphanumeric characters or even occasionally numeric characters depending on the problem , which add to the extra noise in unstructured text.
Measuring computer performance : a practitioner's guide
Usually, simple regular expressions regexes can be used to remove them. To understand stemming, you need to gain some perspective on what word stems represent. Word stems are also known as the base form of a word, and we can create new words by attaching affixes to them in a process known as inflection.
Consider the word JUMP. In this case, the base word JUMP is the word stem. The figure shows how the word stem is present in all its inflections, since it forms the base on which each inflection is built upon using affixes. The reverse process of obtaining the base form of a word from its inflected form is known as stemming. Stemming helps us in standardizing words to their base or root stem, irrespective of their inflections, which helps many applications like classifying or clustering text, and even in information retrieval.
The Porter stemmer is based on the algorithm developed by its inventor, Dr. Martin Porter. Originally, the algorithm is said to have had a total of five different phases for reduction of inflections to their stems, where each phase has its own set of rules. Do note that usually stemming has a fixed set of rules, hence, the root stems may not be lexicographically correct. Which means, the stemmed words may not be semantically correct, and might have a chance of not being present in the dictionary as evident from the preceding output. Lemmatization is very similar to stemming, where we remove word affixes to get to the base form of a word.
However, the base form in this case is known as the root word, but not the root stem. The difference being that the root word is always a lexicographically correct word present in the dictionary , but the root stem may not be so. Thus, root word, also known as the lemma , will always be present in the dictionary. Both nltk and spacy have excellent lemmatizers. We will be using spacy here. You can see that the semantics of the words are not affected by this, yet our text is still standardized.
Do note that the lemmatization process is considerably slower than stemming, because an additional step is involved where the root form or lemma is formed by removing the affix from the word if and only if the lemma is present in the dictionary. Words which have little or no significance, especially when constructing meaningful features from text, are known as stopwords or stop words. These are usually words that end up having the maximum frequency if you do a simple term or word frequency in a corpus.
Typically, these can be articles, conjunctions, prepositions and so on. Some examples of stopwords are a , an , the , and the like. There is no universal stopword list, but we use a standard English language stopwords list from nltk. You can also add your own domain-specific stopwords as needed. We will first combine the news headline and the news article text together to form a document for each piece of news. Then, we will pre-process them. Thus, you can see how our text pre-processor helps in pre-processing our news articles!
After this, you can save this dataset to disk if needed, so that you can always load it up later for future analysis. For any language, syntax and structure usually go hand in hand, where a set of specific rules, conventions, and principles govern the way words are combined into phrases; phrases get combines into clauses; and clauses get combined into sentences. We will be talking specifically about the English language syntax and structure in this section.
In English, words usually combine together to form other constituent units. These constituents include words, phrases, clauses, and sentences. Knowledge about the structure and syntax of language is helpful in many areas like text processing, annotation, and parsing for further operations such as text classification or summarization.
9 editions of this work
Typical parsing techniques for understanding text syntax are mentioned below. We will be looking at all of these techniques in subsequent sections. Thus, a sentence typically follows a hierarchical structure consisting the following components,. Parts of speech POS are specific lexical categories to which words are assigned, based on their syntactic context and role.
Usually, words can fall into one of the following major categories. Besides these four major categories of parts of speech , there are other categories that occur frequently in the English language. These include pronouns, prepositions, interjections, conjunctions, determiners, and many others. POS tags are used to annotate words and depict their POS, which is really helpful to perform specific analysis, such as narrowing down upon nouns and seeing which ones are the most prominent, word sense disambiguation, and grammar analysis.
We will be leveraging both nltk and spacy which usually use the Penn Treebank notation for POS tagging.
We can see that each of these libraries treat tokens in their own way and assign specific tags for them. Based on what we see, spacy seems to be doing slightly better than nltk. Based on the hierarchy we depicted earlier, groups of words make up phrases. There are five major categories of phrases:. Shallow parsing, also known as light parsing or chunking , is a popular natural language processing technique of analyzing the structure of a sentence to break it down into its smallest constituents which are tokens such as words and group them together into higher-level phrases.
This includes POS tags as well as phrases from a sentence. We will leverage the conll corpus for training our shallow parser model.source site
Practitioner's Guide to Empirically Based Measures of Anxiety
This corpus is available in nltk with chunk annotations and we will be using around 10K records for training our model. A sample annotated sentence is depicted as follows. From the preceding output, you can see that our data points are sentences that are already annotated with phrases and POS tags metadata that will be useful in training our shallow parser model.
We will leverage two chunking utility functions, tree2conlltags , to get triples of word, tag, and chunk tags for each token, and conlltags2tree to generate a parse tree from these token triples. We will be using these functions to train our parser. A sample is depicted below. The chunk tags use the IOB format. This notation represents Inside, Outside, and Beginning. The B- prefix before a tag indicates it is the beginning of a chunk, and I- prefix indicates that it is inside a chunk. The O tag indicates that the token does not belong to any chunk.
The B- tag is always used when there are subsequent tags of the same type following it without the presence of O tags between them. We will also define a parse function to perform shallow parsing on new sentences. Thus you can see it has identified two noun phrases NP and one verb phrase VP in the news article. We can also visualize this in the form of a tree as follows. You might need to install ghostscript in case nltk throws an error.
The preceding output gives a good sense of structure after shallow parsing the news headline. Constituent-based grammars are used to analyze and determine the constituents of a sentence. These grammars can be used to model or represent the internal structure of sentences in terms of a hierarchically ordered structure of their constituents. Each and every word usually belongs to a specific lexical category in the case and forms the head word of different phrases.
These phrases are formed based on rules called phrase structure rules. Phrase structure rules form the core of constituency grammars, because they talk about syntax and rules that govern the hierarchy and ordering of the various constituents in the sentences.
These rules cater to two things primarily. While there are several rules refer to Chapter 1, Page Text Analytics with Python, if you want to dive deeper , the most important rule describes how to divide a sentence or a clause. The parser will process input sentences according to these rules, and help in building a parse tree. We will be using nltk and the StanfordParser here to generate parse trees.
Prerequisites: Download the official Stanford Parser from here , which seems to work quite well. You can try out a later version by going to this website and checking the Release History section. After downloading, unzip it to a known location in your filesystem.
Once done, you are now ready to use the parser from nltk , which we will be exploring soon. A PCFG is a context-free grammar that associates a probability with each of its production rules. The probability of a parse tree generated from a PCFG is simply the production of the individual probabilities of the productions used to generate it.
We can see the constituency parse tree for our news headline. We can see the nested hierarchical structure of the constituents in the preceding output as compared to the flat structure in shallow parsing.