gregexpr: Locate Pattern Occurrences¶
Description¶
regexpr2
and gregexpr2
locate, respectively, first and all (i.e., globally) occurrences of a pattern. regexec2
and gregexec2
can additionally pinpoint the matches to parenthesised subexpressions (regex capture groups).
Usage¶
regexpr2(x, pattern, ..., ignore_case = FALSE, fixed = FALSE)
gregexpr2(x, pattern, ..., ignore_case = FALSE, fixed = FALSE)
regexec2(x, pattern, ..., ignore_case = FALSE, fixed = FALSE)
gregexec2(x, pattern, ..., ignore_case = FALSE, fixed = FALSE)
regexpr(
pattern,
x = text,
...,
ignore.case = FALSE,
fixed = FALSE,
perl = FALSE,
useBytes = FALSE,
text
)
gregexpr(
pattern,
x = text,
...,
ignore.case = FALSE,
fixed = FALSE,
perl = FALSE,
useBytes = FALSE,
text
)
regexec(
pattern,
x = text,
...,
ignore.case = FALSE,
fixed = FALSE,
perl = FALSE,
useBytes = FALSE,
text
)
gregexec(
pattern,
x = text,
...,
ignore.case = FALSE,
fixed = FALSE,
perl = FALSE,
useBytes = FALSE,
text
)
Arguments¶
|
character vector whose elements are to be examined |
|
character vector of nonempty search patterns |
|
further arguments to |
|
single logical value; indicates whether matching should be case-insensitive |
|
single logical value; |
|
not used (with a warning if attempting to do so) [DEPRECATED] |
|
alias to the |
Details¶
These functions are fully vectorised with respect to both x
and pattern
.
Use substrl
and gsubstrl
to extract or replace the identified chunks. Also, consider using regextr2
and gregextr2
directly instead.
Value¶
regexpr2
and [DEPRECATED] regexpr
return an integer vector which gives the start positions of the first substrings matching a pattern. The match.length
attribute gives the corresponding match lengths. If there is no match, the two values are set to -1.
gregexpr2
and [DEPRECATED] gregexpr
yield a list whose elements are integer vectors with match.length
attributes, giving the positions of all the matches. For consistency with regexpr2
, a no-match is denoted with a single -1, hence the output is guaranteed to consist of non-empty integer vectors.
regexec2
and [DEPRECATED] regexec
return a list of integer vectors giving the positions of the first matches and the locations of matches to the consecutive parenthesised subexpressions (which can only be recognised if fixed=FALSE
). Each vector is equipped with the match.length
attribute.
gregexec2
and [DEPRECATED] gregexec
generate a list of matrices, where each column corresponds to a separate match; the first row is the start index of the match, the second row gives the position of the first captured group, and so forth. Their match.length
attributes are matrices of corresponding sizes.
These functions preserve the attributes of the longest inputs (unless they are dropped due to coercion). Missing values in the inputs are propagated consistently.
Differences from Base R¶
Replacements for base gregexpr
(and others) implemented with stri_locate
.
there are inconsistencies between the argument order and naming in
grepl
,strsplit
, andstartsWith
(amongst others); e.g., where the needle can precede the haystack, the use of the forward pipe operator,|>
, is less convenient [fixed here]base R implementation is not portable as it is based on the system PCRE or TRE library (e.g., some Unicode classes may not be available or matching thereof can depend on the current
LC_CTYPE
category [fixed here]not suitable for natural language processing [fixed here – use
fixed=NA
]two different regular expression libraries are used (and historically, ERE was used in place of TRE) [here, ICU Java-like regular expression engine is only available, hence the
perl
argument has no meaning]not vectorised w.r.t.
pattern
[fixed here]ignore.case=TRUE
cannot be used withfixed=TRUE
[fixed here]no attributes are preserved [fixed here; see Value]
in
regexec
,match.length
attribute is unnamed even if the capture groups are (butgregexec
sets dimnames of both start positions and lengths) [fixed here]regexec
andgregexec
withfixed
other thanFALSE
make little sense. [this argument is [DEPRECATED] inregexec2
andgregexec2
]gregexec
does not always yield a list of matrices [fixed here]a no-match to a conditional capture group is assigned length 0 [fixed here]
no-matches result in a single -1, even if capture groups are defined in the pattern [fixed here]
See Also¶
The official online manual of stringx at https://stringx.gagolewski.com/
Related function(s): paste
, nchar
, strsplit
, gsub2
, grepl2
, gregextr2
, gsubstrl
Examples¶
x <- c(aca1="acacaca", aca2="gaca", noaca="actgggca", na=NA)
regexpr2(x, "(A)[ACTG]\\1", ignore_case=TRUE)
## aca1 aca2 noaca na
## 1 2 -1 NA
## attr(,"match.length")
## [1] 3 3 -1 NA
regexpr2(x, "aca") >= 0 # like grepl2
## aca1 aca2 noaca na
## TRUE TRUE FALSE NA
gregexpr2(x, "aca", fixed=TRUE, overlap=TRUE)
## $aca1
## [1] 1 3 5
## attr(,"match.length")
## [1] 3 3 3
##
## $aca2
## [1] 2
## attr(,"match.length")
## [1] 3
##
## $noaca
## [1] -1
## attr(,"match.length")
## [1] -1
##
## $na
## [1] NA
## attr(,"match.length")
## [1] NA
# two named capture groups:
regexec2(x, "(?<x>a)(?<y>cac?)")
## $aca1
## x y
## 1 1 2
## attr(,"match.length")
## x y
## 4 1 3
##
## $aca2
## x y
## 2 2 3
## attr(,"match.length")
## x y
## 3 1 2
##
## $noaca
## x y
## -1 -1 -1
## attr(,"match.length")
## x y
## -1 -1 -1
##
## $na
## x y
## NA NA NA
## attr(,"match.length")
## x y
## NA NA NA
gregexec2(x, "(?<x>a)(?<y>cac?)")
## $aca1
## [,1] [,2]
## 1 5
## x 1 5
## y 2 6
## attr(,"match.length")
## [,1] [,2]
## 4 3
## x 1 1
## y 3 2
##
## $aca2
## [,1]
## 2
## x 2
## y 3
## attr(,"match.length")
## [,1]
## 3
## x 1
## y 2
##
## $noaca
## [,1]
## -1
## x -1
## y -1
## attr(,"match.length")
## [,1]
## -1
## x -1
## y -1
##
## $na
## [,1]
## NA
## x NA
## y NA
## attr(,"match.length")
## [,1]
## NA
## x NA
## y NA
# extraction:
gsubstrl(x, gregexpr2(x, "(A)[ACTG]\\1", ignore_case=TRUE))
## $aca1
## [1] "aca" "aca"
##
## $aca2
## [1] "aca"
##
## $noaca
## character(0)
##
## $na
## [1] NA
gregextr2(x, "(A)[ACTG]\\1", ignore_case=TRUE) # equivalent
## $aca1
## [1] "aca" "aca"
##
## $aca2
## [1] "aca"
##
## $noaca
## character(0)
##
## $na
## [1] NA