gregextr: Extract Pattern Occurrences

Description

regextr2 and gregextr2 extract, respectively, first and all (i.e., globally) occurrences of a pattern. Their replacement versions substitute the matching substrings with new content.

Usage

regextr2(
  x,
  pattern,
  ...,
  ignore_case = FALSE,
  fixed = FALSE,
  capture_groups = FALSE
)

gregextr2(
  x,
  pattern,
  ...,
  ignore_case = FALSE,
  fixed = FALSE,
  capture_groups = FALSE
)

regextr2(x, pattern, ..., ignore_case = FALSE, fixed = FALSE) <- value

gregextr2(x, pattern, ..., ignore_case = FALSE, fixed = FALSE) <- value

Arguments

x

character vector whose elements are to be examined

pattern

character vector of nonempty search patterns

...

further arguments to stri_locate, e.g., omit_empty, locale, dotall

ignore_case

single logical value; indicates whether matching should be case-insensitive

fixed

single logical value; FALSE for matching with regular expressions (see about_search_regex); TRUE for fixed pattern matching (about_search_fixed); NA for the Unicode collation algorithm (about_search_coll)

capture_groups

single logical value; whether matches individual capture groups should be extracted separately

value

character vector (for regextr) or list of character vectors (for gregextr) defining the replacement strings

Details

Convenience functions based on gregexpr2 and gsubstrl (amongst others). Provided as pipe operator-friendly alternatives to [DEPRECATED] regmatches and [DEPRECATED] strcapture.

They are fully vectorised with respect to x, pattern, and value.

Note that, unlike in gsub2, each substituted chunk can be replaced with different content. However, references to matches to capture groups cannot be made.

Value

capture_groups is FALSE, regextr2 returns a character vector and gregextr2 gives a list of character vectors.

Otherwise, regextr2 returns a list of character vectors, giving the whole match as well as matches to the individual capture groups. In gregextr2, this will be a matrix with as many columns as there are matches.

Missing values in the inputs are propagated consistently. In regextr2, a no-match is always denoted with NA (or series thereof). In gregextr2, the corresponding result is empty (unless we mean a no-match to an optional capture group within a matching substring). Note that this function distinguishes between a missing input and a no-match.

Their replacement versions return a character vector.

These functions preserve the attributes of the longest inputs (unless they are dropped due to coercion).

Author(s)

Marek Gagolewski

See Also

The official online manual of stringx at https://stringx.gagolewski.com/

Related function(s): paste, nchar, strsplit, gsub2 grepl2, gregexpr2, gsubstrl,

Examples

x <- c(aca1="acacaca", aca2="gaca", noaca="actgggca", na=NA)
regextr2(x, "(?<x>a)(?<y>cac?)")
##   aca1   aca2  noaca     na 
## "acac"  "aca"     NA     NA
gregextr2(x, "(?<x>a)(?<y>cac?)")
## $aca1
## [1] "acac" "aca" 
## 
## $aca2
## [1] "aca"
## 
## $noaca
## character(0)
## 
## $na
## [1] NA
regextr2(x, "(?<x>a)(?<y>cac?)", capture_groups=TRUE)
## $aca1
##             x      y 
## "acac"    "a"  "cac" 
## 
## $aca2
##           x     y 
## "aca"   "a"  "ca" 
## 
## $noaca
##     x  y 
## NA NA NA 
## 
## $na
##     x  y 
## NA NA NA
gregextr2(x, "(?<x>a)(?<y>cac?)", capture_groups=TRUE)
## $aca1
##   [,1]   [,2] 
##   "acac" "aca"
## x "a"    "a"  
## y "cac"  "ca" 
## 
## $aca2
##   [,1] 
##   "aca"
## x "a"  
## y "ca" 
## 
## $noaca
##  
##  
## x
## y
## 
## $na
##   [,1]
##   NA  
## x NA  
## y NA
# substitution - note the different replacement strings:
`gregextr2<-`(x, "(?<x>a)(?<y>cac?)", value=list(c("!", "?"), "#"))
##       aca1       aca2      noaca         na 
##       "!?"       "g#" "actgggca"         NA
# references to capture groups can only be used in gsub and sub:
gsub2(x, "(?<x>a)(?<y>cac?)", "{$1}{$2}")
##              aca1              aca2             noaca                na 
## "{a}{cac}{a}{ca}"        "g{a}{ca}"        "actgggca"                NA
regextr2(x, "(?<x>a)(?<y>cac?)") <- "\U0001D554\U0001F4A9"
print(x)  # x was modified 'in-place'
##       aca1       aca2      noaca         na 
##   "𝕔💩aca"     "g𝕔💩" "actgggca"         NA