Need help with textclean?
Click the “chat” button below for chat support from the developer who created it, or find similar developers for support.

About the developer

trinker
153 Stars 23 Forks 216 Commits 3 Opened issues

Description

Tools for cleaning and normalizing text data

Services available

!
?

Need anything else?

Contributors list

# 120,011
R
Shell
TeX
GitHub
167 commits
# 341,652
R
microso...
JavaScr...
Vim
1 commit

Project Status: Active - The project has reached a stable, usable
state and is being actively
developed. Build
Status Coverage
Status

textclean is a collection of tools to clean and normalize text. Many of these tools have been taken from the qdap package and revamped to be more intuitive, better named, and faster. Tools are geared at checking for substrings that are not optimal for analysis and replacing or removing them (normalizing) with more analysis friendly substrings (see Sproat, Black, Chen, Kumar, Ostendorf, & Richards, 2001, doi:10.1006/csla.2001.0169) or extracting them into new variables. For example, emoticons are often used in text but not always easily handled by analysis algorithms. The

replace_emoticon()
function replaces emoticons with word equivalents.

Other R packages provide some of the same functionality (e.g., english, gsubfn, mgsub, stringi, stringr, qdapRegex). textclean differs from these packages in that it is designed to handle all of the common cleaning and normalization tasks with a single, consistent, pre-configured toolset (note that textclean uses many of these terrific packages as a backend). This means that the researcher spends less time on munging, leading to quicker analysis. This package is meant to be used jointly with the textshape package, which provides text extraction and reshaping functionality. textclean works well with the qdapRegex package which provides tooling for substring substitution and extraction of pre-canned regular expressions. In addition, the functions of textclean are designed to work within the piping of the tidyverse framework by consistently using the first argument of functions as the data source. The textclean subbing and replacement tools are particularly effective within a

dplyr::mutate
statement.

Table of Contents

Functions

The main functions, task category, & descriptions are summarized in the table below:

Function Task Description
mgsub subbing Multiple gsub
fgsub subbing Functional matching replacement gsub
sub_holder subbing Hold a value prior to a strip
swap subbing Simultaneously swap patterns 1 & 2
strip deletion Remove all non word characters
drop_empty_row filter rows Remove empty rows
drop_row/keep_row filter rows Filter rows matching a regex
drop_NA filter rows Remove NA text rows
drop_element/keep_element filter elements Filter matching elements from a vector
match_tokens filter elements Filter out tokens from strings that match a regex criteria
replace_contractions replacement Replace contractions with both words
replace_date replacement Replace dates
replace_email replacement Replace emails
replace_emoji replacement Replace emojis with word equivalent or unique identifier
replace_emoticon replacement Replace emoticons with word equivalent
replace_grade replacement Replace grades (e.g., “A+”) with word equivalent
replace_hash replacement Replace Twitter style hash tags (e.g., #rstats)
replace_html replacement Replace HTML tags and symbols
replace_incomplete replacement Replace incomplete sentence end-marks
replace_internet_slang replacement Replace Internet slang with word equivalents
replace_kern replacement Replace spaces for >2 letter, all cap, words containing spaces in between letters
replace_misspelling replacement Replace misspelled words with their most likely replacement
replace_money replacement Replace money in the form of $\d+.?\d{0,2}
replace_names replacement Replace common first/last names
replace_non_ascii replacement Replace non-ASCII with equivalent or remove
replace_number replacement Replace common numbers
replace_ordinal replacement Replace common ordinal number form
replace_rating replacement Replace ratings (e.g., “10 out of 10”, “3 stars”) with word equivalent
replace_symbol replacement Replace common symbols
replace_tag replacement Replace Twitter style handle tag (e.g., @trinker)
replace_time replacement Replace time stamps
replace_to/replace_from replacement Remove from/to begin/end of string to/from a character(s)
replace_tokens replacement Remove or replace a vector of tokens with a single value
replace_url replacement Replace URLs
replace_white replacement Replace regex white space characters
replace_word_elongation replacement Replace word elongations with shortened form
add_comma_space replacement Replace non-space after comma
add_missing_endmark replacement Replace missing endmarks with desired symbol
make_plural replacement Add plural endings to singular noun forms
check_text check Text report of potential issues
has_endmark check Check if an element has an end-mark

Installation

To download the development version of textclean:

Download the zip ball or tar ball, decompress and run

R CMD INSTALL
on it, or use the pacman package to install the development version:
if (!require("pacman")) install.packages("pacman")
pacman::p_load_gh(
    "trinker/lexicon",    
    "trinker/textclean"
)

Contact

You are welcome to:
- submit suggestions and bug-reports at: https://github.com/trinker/textclean/issues

Contributing

Contributions are welcome from anyone subject to the following rules:

  • Abide by the code of conduct.
  • Follow the style conventions of the package (indentation, function & argument naming, commenting, etc.)
  • All contributions must be consistent with the package license (GPL-2)
  • Submit contributions as a pull request. Clearly state what the changes are and try to keep the number of changes per pull request as low as possible.
  • If you make big changes, add your name to the ‘Author’ field.

Demonstration

Load the Packages/Data

library(dplyr)
library(textshape)
library(lexicon)
library(textclean)

Check Text

One of the most useful tools in textclean is

check_text
which scans text variables and reports potential problems. Not all potential problems are definite problems for analysis but the report provides an overview of what may need further preparation. The report also provides suggested functions for the reported problems. The report provides information on the following:
  1. contraction - Text elements that contain contractions
  2. date - Text elements that contain dates
  3. digit - Text elements that contain digits/numbers
  4. email - Text elements that contain email addresses
  5. emoticon - Text elements that contain emoticons
  6. empty - Text elements that contain empty text cells (all white space)
  7. escaped - Text elements that contain escaped back spaced characters
  8. hash - Text elements that contain Twitter style hash tags (e.g., #rstats)
  9. html - Text elements that contain HTML markup
  10. incomplete - Text elements that contain incomplete sentences (e.g., uses ending punctuation like ‘…’)
  11. kern - Text elements that contain kerning (e.g., ‘The B O M B!’)
  12. list_column - Text variable that is a list column
  13. missing_value - Text elements that contain missing values
  14. misspelled - Text elements that contain potentially misspelled words
  15. no_alpha - Text elements that contain elements with no alphabetic (a-z) letters
  16. no_endmark - Text elements that contain elements with missing ending punctuation
  17. no_space_after_comma - Text elements that contain commas with no space afterwards
  18. non_ascii - Text elements that contain non-ASCII text
  19. non_character - Text variable that is not a character column (likely
    factor
    )
  20. non_split_sentence - Text elements that contain unsplit sentences (more than one sentence per element)
  21. tag - Text elements that contain Twitter style handle tags (e.g., @trinker)
  22. time - Text elements that contain timestamps
  23. url - Text elements that contain URLs

Note that

check_text
is running multiple checks and may be slower on larger texts. The user may provide a sample of text for review or use the
checks
argument to specify the exact checks to conduct and thus limit the compute time.

Here is an example:

x i want. . thet them ther .", "I am ! that|", "", NA, 
    ""they" they,were there", ".", "   ", "?", "3;", "I like goud eggs!", 
    "bi\xdfchen Z\xfcrcher", "i 4like...", "\\tgreat",  "She said \"yes\"")
Encoding(x) i want. . thet them ther .
## 6: "they" they,were there
## 
## *Suggestion: Consider running `replace_html`
## 
## 
## ==========
## INCOMPLETE
## ==========
## 
## The following observations contain incomplete sentences (e.g., uses ending punctuation like '...'):
## 
## 13
## 
## This issue affected the following text:
## 
## 13: i 4like...
## 
## *Suggestion: Consider using `replace_incomplete`
## 
## 
## =============
## MISSING VALUE
## =============
## 
## The following observations contain missing values:
## 
## 5
## 
## *Suggestion: Consider running `drop_NA`
## 
## 
## ========
## NO ALPHA
## ========
## 
## The following observations contain elements with no alphabetic (a-z) letters:
## 
## 4, 7, 8, 9, 10
## 
## This issue affected the following text:
## 
## 4: 
## 7: .
## 8:    
## 9: ?
## 10: 3;
## 
## *Suggestion: Consider cleaning the raw text or running `filter_row`
## 
## 
## ==========
## NO ENDMARK
## ==========
## 
## The following observations contain elements with missing ending punctuation:
## 
## 1, 3, 4, 6, 8, 10, 12, 14, 15
## 
## This issue affected the following text:
## 
## 1: i like
## 3: I am ! that|
## 4: 
## 6: "they" they,were there
## 8:    
## 10: 3;
## 12: bißchen Zürcher
## 14: \tgreat
## 15: She said "yes"
## 
## *Suggestion: Consider cleaning the raw text or running `add_missing_endmark`
## 
## 
## ====================
## NO SPACE AFTER COMMA
## ====================
## 
## The following observations contain commas with no space afterwards:
## 
## 6
## 
## This issue affected the following text:
## 
## 6: "they" they,were there
## 
## *Suggestion: Consider running `add_comma_space`
## 
## 
## =========
## NON ASCII
## =========
## 
## The following observations contain non-ASCII text:
## 
## 12
## 
## This issue affected the following text:
## 
## 12: bißchen Zürcher
## 
## *Suggestion: Consider running `replace_non_ascii`
## 
## 
## ==================
## NON SPLIT SENTENCE
## ==================
## 
## The following observations contain unsplit sentences (more than one sentence per element):
## 
## 2, 3
## 
## This issue affected the following text:
## 
## 2: 

i want.

. thet them ther . ## 3: I am ! that| ## ## *Suggestion: Consider running `textshape::split_sentence`

And if all is well the user should be greeted by a cow:

y 

Row Filtering

It is useful to drop/remove empty rows or unwanted rows (for example the researcher dialogue from a transcript). The

drop_empty_row
&
drop_row
do empty row do just this. First I’ll demo the removal of empty rows.
## create a data set wit empty rows
(dat 

Next we drop out rows. The

drop_row
function takes a data set, a column (named or numeric position) and regex terms to search for. The
terms
argument takes regex(es) allowing for partial matching.
terms
is case sensitive but can be changed via the
ignore.case
argument.
drop_row(dataframe = DATA, column = "person", terms = c("sam", "greg"))

person sex adult state code

1 teacher m 1 What should we do? K3

2 sally f 0 How can we be certain? K6

3 sally f 0 What are you talking about? K9

4 researcher f 1 Shall we move on? Good then. K10

drop_row(DATA, 1, c("sam", "greg"))

person sex adult state code

1 teacher m 1 What should we do? K3

2 sally f 0 How can we be certain? K6

3 sally f 0 What are you talking about? K9

4 researcher f 1 Shall we move on? Good then. K10

keep_row(DATA, 1, c("sam", "greg"))

person sex adult state code

1 sam m 0 Computer is fun. Not too fun. K1

2 greg m 0 No it's not, it's dumb. K2

3 sam m 0 You liar, it stinks! K4

4 greg m 0 I am telling the truth! K5

5 greg m 0 There is no way. K7

6 sam m 0 I distrust you. K8

7 greg m 0 I'm hungry. Let's eat. You already? K11

drop_row(DATA, "state", c("Comp"))

person sex adult state code

1 greg m 0 No it's not, it's dumb. K2

2 teacher m 1 What should we do? K3

3 sam m 0 You liar, it stinks! K4

4 greg m 0 I am telling the truth! K5

5 sally f 0 How can we be certain? K6

6 greg m 0 There is no way. K7

7 sam m 0 I distrust you. K8

8 sally f 0 What are you talking about? K9

9 researcher f 1 Shall we move on? Good then. K10

10 greg m 0 I'm hungry. Let's eat. You already? K11

drop_row(DATA, "state", c("I "))

person sex adult state code

1 sam m 0 Computer is fun. Not too fun. K1

2 greg m 0 No it's not, it's dumb. K2

3 teacher m 1 What should we do? K3

4 sam m 0 You liar, it stinks! K4

5 sally f 0 How can we be certain? K6

6 greg m 0 There is no way. K7

7 sally f 0 What are you talking about? K9

8 researcher f 1 Shall we move on? Good then. K10

9 greg m 0 I'm hungry. Let's eat. You already? K11

drop_row(DATA, "state", c("you"), ignore.case = TRUE)

person sex adult state code

1 sam m 0 Computer is fun. Not too fun. K1

2 greg m 0 No it's not, it's dumb. K2

3 teacher m 1 What should we do? K3

4 greg m 0 I am telling the truth! K5

5 sally f 0 How can we be certain? K6

6 greg m 0 There is no way. K7

7 researcher f 1 Shall we move on? Good then. K10

Stripping

Often it is useful to remove all non relevant symbols and case from a text (letters, spaces, and apostrophes are retained). The

strip
function accomplishes this. The
char.keep
argument allows the user to retain characters.
strip(DATA$state)

[1] "computer is fun not too fun" "no it's not it's dumb" "what should we do" "you liar it stinks"

[5] "i am telling the truth" "how can we be certain" "there is no way" "i distrust you"

[9] "what are you talking about" "shall we move on good then" "i'm hungry let's eat you already"

strip(DATA$state, apostrophe.remove = TRUE)

[1] "computer is fun not too fun" "no its not its dumb" "what should we do" "you liar it stinks" "i am telling the truth"

[6] "how can we be certain" "there is no way" "i distrust you" "what are you talking about" "shall we move on good then"

[11] "im hungry lets eat you already"

strip(DATA$state, char.keep = c("?", "."))

[1] "computer is fun. not too fun." "no it's not it's dumb." "what should we do?" "you liar it stinks"

[5] "i am telling the truth" "how can we be certain?" "there is no way." "i distrust you."

[9] "what are you talking about?" "shall we move on? good then." "i'm hungry. let's eat. you already?"

Subbing

Multiple Subs

gsub
is a great tool but often the user wants to replace a vector of elements with another vector.
mgsub
allows for a vector of patterns and replacements. Note that the first argument of
mgsub
is the data, not the
pattern
as is standard with base R’s
gsub
. This allows
mgsub
to be used in a magrittr pipeline more easily. Also note that by default
fixed = TRUE
. This means the search
pattern
is not a regex per-se. This makes the replacement much faster when a regex search is not needed.
mgsub
also reorders the patterns to ensure patterns contained within patterns don’t over write the longer pattern. For example if the pattern
c('i', 'it')
is given the longer
'it'
is replaced first (though
order.pattern = FALSE
can be used to negate this feature).
mgsub(DATA$state, c("it's", "I'm"), c("<>", "<>"))

[1] "Computer is fun. Not too fun." "No <> not, <> dumb." "What should we do?"

[4] "You liar, it stinks!" "I am telling the truth!" "How can we be certain?"

[7] "There is no way." "I distrust you." "What are you talking about?"

[10] "Shall we move on? Good then." "<> hungry. Let's eat. You already?"

mgsub(DATA$state, "[[:punct:]]", "<>", fixed = FALSE)

[1] "Computer is fun<> Not too fun<>" "No it<>s not<> it<>s dumb<>"

[3] "What should we do<>" "You liar<> it stinks<>"

[5] "I am telling the truth<>" "How can we be certain<>"

[7] "There is no way<>" "I distrust you<>"

[9] "What are you talking about<>" "Shall we move on<> Good then<>"

[11] "I<>m hungry<> Let<>s eat<> You already<>"

mgsub(DATA$state, c("i", "it"), c("<>", "[[IT]]"))

[1] "Computer <>s fun. Not too fun." "No [[IT]]'s not, [[IT]]'s dumb." "What should we do?" "You l<>ar, [[IT]] st<>nks!"

[5] "I am tell<>ng the truth!" "How can we be certa<>n?" "There <>s no way." "I d<>strust you."

[9] "What are you talk<>ng about?" "Shall we move on? Good then." "I'm hungry. Let's eat. You already?"

mgsub(DATA$state, c("i", "it"), c("<>", "[[IT]]"), order.pattern = FALSE)

[1] "Computer <>s fun. Not too fun." "No <>t's not, <>t's dumb." "What should we do?" "You l<>ar, <>t st<>nks!"

[5] "I am tell<>ng the truth!" "How can we be certa<>n?" "There <>s no way." "I d<>strust you."

[9] "What are you talk<>ng about?" "Shall we move on? Good then." "I'm hungry. Let's eat. You already?"

Safe Substitutions

The default behavior of

mgsub
is optimized for speed. This means that it is very fast at multiple substitutions and in most cases works efficiently. However, it is not what Mark Ewing describes as “safe” substitution. In his vignette for the mgsub package, Mark defines “safe” as:
  1. Longer matches are preferred over shorter matches for substitution first
  2. No placeholders are used so accidental string collisions don’t occur

Because safety is sometimes required,

textclean::mgsub
provides a
safe
argument using the mgsub package as the backend. In addition to the
safe
argument the
mgsub_regex_safe
function is available to make the usage more explicit. The safe mode comes at the cost of speed.
x 

Match, Extract, Operate, Replacement Subs

Again,

gsub
is a great tool but sometimes the user wants to match a pattern, extract that pattern, operate a function over that pattern, and then replace the original match. The
fgsub
function allows the user to perform this operation. It is a stripped down version of
gsubfn
from the gsubfn package. For more versatile needs please see the gsubfn package.

In this example the regex looks for words that contain a lower case letter followed by the same letter at least 2 more times. It then extracts these words, splits them appart into letters, reverses the string, pastes them back together, wraps them with double angle braces, and then puts them back at the original locations.

fgsub(
    x = c(NA, 'df dft sdf', 'sd fdggg sd dfhhh d', 'ddd'),
    pattern = "\\b\\w*([a-z])(\\1{2,})\\w*\\b",
    fun = function(x) {paste0('<>')}
)

[1] NA "df dft sdf" "sd <> sd <> d" "<>"

In this example we extract numbers, strip out non-digits, coerce them to numeric, cut them in half, round up to the closest integer, add the commas back, and replace back into the original locations.

fgsub(
    x = c(NA, 'I want 32 grapes', 'he wants 4 ice creams', 'they want 1,234,567 dollars'),
    pattern = "[\\d,]+",
    fun = function(x) {prettyNum(ceiling(as.numeric(gsub('[^0-9]', '', x))/2), big.mark = ',')}
)

[1] NA "I want 16 grapes" "he wants 2 ice creams" "they want 617,284 dollars"

Stashing Character Pre-Sub

There are times the user may want to stash a set of characters before subbing out and then return the stashed characters. An example of this is when a researcher wants to remove punctuation but not emoticons. The

subholder
function provides tooling to stash the emoticons, allow a punctuation stripping, and then return the emoticons. First I’ll create some fake text data with emoticons, then stash the emoticons (using a unique text key to hold their place), then I’ll strip out the punctuation, and last put the stashed emoticons back.
(fake_dat 

Of course with clever regexes you can achieve the same thing:

ord_emos 

The pure regex approach can be a bit trickier (less safe) and more difficult to reason about. It also relies on the less general

(*SKIP)(*FAIL)
backtracking control verbs that are only implemented in a few applications like Perl & PCRE. Still, it’s nice to see an alternative regex approach for comparison.

Replacement

textclean contains tools to replace substrings within text with other substrings that may be easier to analyze. This section outlines the uses of these tools.

Contractions

Some analysis techniques require contractions to be replaced with their multi-word forms (e.g., “I’ll” -> “I will”).

replace_contrction
provides this functionality.
x 

Dates

x >')

[1] NA "<> and <>" "and <> but then there's <> too"

Emojis

Similar to emoticons, emoji tokens may be ignored if they are not in a computer readable form.

replace_emoji
replaces emojis with their word forms equivalents.
x  "                                            
## [3] "A gift to my fellow nfl loving #rstats folks this package is 💥💥"

replace_emoji(x)

[1] "Hello, helpful! package cross mark alien monster debugme: Easy & efficient debugging for R packages man <9f><8f><80><8d> laptop computer @GaborCsardi https://buff.ly/2nNKcps #rstats"

[2] "Did you ever get bored and accidentally create a package to make #Rstats speak on a Mac? I have -> "

[3] "A gift to my fellow nfl loving #rstats folks this package is collision collision "

Emoticons

Some analysis techniques examine words, meaning emoticons may be ignored.

replace_emoticon
replaces emoticons with their word forms equivalents.
x 

Grades

In analysis where grades may be discussed it may be useful to convert the letter forms into word meanings. The

replace_grade
can be used for this task.
text 

HTML

Sometimes HTML tags and symbols stick around like pesky gnats. The

replace_html
function makes light work of them.
x Random text with symbols:   < > & " '",
    "

More text

¢ £ ¥ € © ®" )

replace_html(x)

[1] " Random text with symbols: < > & " '" " More text cents pounds yen euro (c) (r)"

Incomplete Sentences

Sometimes an incomplete sentence is denoted with multiple end marks or no punctuation at all.

replace_incomplete
standardizes these sentences with a pipe (
|
) endmark (or one of the user’s choice).
x 

Internet Slang

Often in informal written and spoken communication (e.g., Twitter, texting, Facebook, etc.) people use Internet slang, shorter abbreviations and acronyms, to replace longer word sequences. These replacements may obfuscate the meaning when the machine attempts to analyze the text. The

replace_internet_slang
function replaces the slang with longer word equivalents that are more easily analyzed by machines.
x 

Kerning

In typography kerning is the adjustment of spacing. Often, in informal writing, adding manual spaces (a form of kerning) coupled with all capital letters is used for emphasis (e.g.,

"She's the B O M B!"
). These word forms would look like noise in most analysis and would likely be removed as a stopword when in fact they likely carry a great deal of meaning. The
replace_kern
function looks for 3 or more consecutive capital letters with spaces in between and removes the spaces.
x 

Money

There are times one may want to replace money mentions with text or normalized versions. The

replace_money
function is designed to complete this task.
x >')

[1] NA "<> into "three dollars, sixteen cents"" "<> too"

[4] "fff"

Names

Often one will want to standardize text by removing first and last names. The

replace_names
function quickly removes/replaces common first and last names. This can be made more targeted by feeding a vector of names extracted via a named entity extractor.
x >')

[1] "<> <> is not here" "<> is not a nice person" "<> will do it" NA

Non-ASCII Characters

R can choke on non-ASCII characters. They can be re-encoded but the new encoding may lack interpretability (e.g., ¢ may be converted to

\xA2
which is not easily understood or likely to be matched in a hash look up).
replace_non_ascii
attempts to replace common non-ASCII characters with a text representation (e.g., ¢ becomes “cent”) Non recognized non-ASCII characters are simply removed (unless
remove.nonconverted = FALSE
).
x  mu "            "30 cent "

Numbers

Some analysis requires numbers to be converted to text form.

replace_number
attempts to perform this task.
replace_number
handles comma separated numbers as well.
x 

Ratings

Some texts use ratings to convey satisfaction with a particular object. The

replace_rating
function replaces the more abstract rating with word equivalents.
x 

Ordinal Numbers

Again, some analysis requires numbers, including ordinal numbers, to be converted to text form.

replace_ordinal
attempts to perform this task for ordinal number 1-100 (i.e., 1st - 100th).
x 

Symbols

Text often contains short-hand representations of words/phrases. These symbols may contain analyzable information but in the symbolic form they cannot be parsed. The

replace_symbol
function attempts to replace the symbols
c("$", "%", "#", "@", "& "w/")
with their word equivalents.
x 

Time Stamps

Often times the researcher will want to replace times with a text or normalized version. The

replace_time
function works well for this task. Notice that replacement takes a function that can operate on the extracted pattern.
x >
replace_time(x, replacement = '<

We use cookies. If you continue to browse the site, you agree to the use of cookies. For more information on our use of cookies please see our Privacy Policy.