sort: Sort Strings

Description

The sort method for objects of class character (sort.character) uses the locale-sensitive Unicode collation algorithm to arrange strings in a vector with regards to a chosen lexicographic order.

xtfrm2 and [DEPRECATED] xtfrm generate an integer vector that sort in the same way as its input, and hence can be used in conjunction with order or rank.

Usage

xtfrm2(x, ...)

## Default S3 method:
xtfrm2(x, ...)

## S3 method for class 'character'
xtfrm2(
  x,
  ...,
  locale = NULL,
  strength = 3L,
  alternate_shifted = FALSE,
  french = FALSE,
  uppercase_first = NA,
  case_level = FALSE,
  normalisation = FALSE,
  numeric = FALSE
)

xtfrm(x)

## Default S3 method:
xtfrm(x)

## S3 method for class 'character'
xtfrm(x)

## S3 method for class 'character'
sort(
  x,
  ...,
  decreasing = FALSE,
  na.last = NA,
  locale = NULL,
  strength = 3L,
  alternate_shifted = FALSE,
  french = FALSE,
  uppercase_first = NA,
  case_level = FALSE,
  normalisation = FALSE,
  numeric = FALSE
)

Arguments

x

character vector whose elements are to be sorted

...

further arguments passed to other methods

locale

NULL or "" for the default locale (see stri_locale_get) or a single string with a locale identifier, see stri_locale_list

strength

see stri_opts_collator

alternate_shifted

see stri_opts_collator

french

see stri_opts_collator

uppercase_first

see stri_opts_collator

case_level

see stri_opts_collator

normalisation

see stri_opts_collator

numeric

see stri_opts_collator

decreasing

single logical value; if FALSE, the ordering is nondecreasing (weakly increasing)

na.last

single logical value; if TRUE, then missing values are placed at the end; if FALSE, they are put at the beginning; if NA, then they are removed from the output whatsoever.

Details

What ‘xtfrm’ stands for the current author does not know, but would appreciate someone’s enlightening him.

Value

sort.character returns a character vector, with only the names attribute preserved. Note that the output vector may be shorter than the input one.

xtfrm2.character and xtfrm.character return an integer vector; most attributes are preserved.

Differences from Base R

Replacements for the default S3 methods sort and xtfrm for character vectors implemented with stri_sort and stri_rank.

  • Collation in different locales is difficult and non-portable across platforms [fixed here – using services provided by ICU]

  • Overloading xtfrm.character has no effect in R, because S3 method dispatch is done internally with hard-coded support for character arguments. Thus, we needed to replace the generic xtfrm with the one that calls UseMethod [fixed here]

  • xtfrm does not support customisation of the linear ordering relation it is based upon [fixed by introducing ... argument to the new generic, xtfrm2]

  • Neither order, rank, nor sort.list is a generic, therefore they should have to be rewritten from scratch to allow the inclusion of our patches; interestingly, order even calls xtfrm, but only for classed objects [not fixed here – see Examples for a workaround]

  • xtfrm for objects of type character does not preserve the names attribute (but does so for numeric) [fixed here]

  • sort seems to preserve only the names attribute which makes sense if na.last is NA, because the resulting vector might be shorter [not fixed here as it would break compatibility with other sorting methods]

  • Note that sort by default removes missing values whatsoever, whereas order has na.last=TRUE [not fixed here as it would break compatibility with other sorting methods]

Author(s)

Marek Gagolewski

See Also

The official online manual of stringx at https://stringx.gagolewski.com/

Related function(s): strcoll

Examples

x <- c("a1", "a100", "a101", "a1000", "a10", "a10", "a11", "a99", "a10", "a1")
base::sort.default(x)   # lexicographic sort
##  [1] "a1"    "a1"    "a10"   "a10"   "a10"   "a100"  "a1000" "a101"  "a11"  
## [10] "a99"
sort(x, numeric=TRUE)   # calls stringx:::sort.character
##  [1] "a1"    "a1"    "a10"   "a10"   "a10"   "a11"   "a99"   "a100"  "a101" 
## [10] "a1000"
xtfrm2(x, numeric=TRUE)  # calls stringx:::xtfrm2.character
##  [1]  1  8  9 10  3  3  6  7  3  1
rank(xtfrm2(x, numeric=TRUE), ties.method="average")  # ranks with averaged ties
##  [1]  1.5  8.0  9.0 10.0  4.0  4.0  6.0  7.0  4.0  1.5
order(xtfrm2(x, numeric=TRUE))    # ordering permutation
##  [1]  1 10  5  6  9  7  8  2  3  4
x[order(xtfrm2(x, numeric=TRUE))] # equivalent to sort()
##  [1] "a1"    "a1"    "a10"   "a10"   "a10"   "a11"   "a99"   "a100"  "a101" 
## [10] "a1000"
# order a data frame w.r.t. decreasing ids and increasing vals
d <- data.frame(vals=round(runif(length(x)), 1), ids=x)
d[order(-xtfrm2(d[["ids"]], numeric=TRUE), d[["vals"]]), ]
##    vals   ids
## 4   0.9 a1000
## 3   0.4  a101
## 2   0.8  a100
## 8   0.9   a99
## 7   0.5   a11
## 6   0.0   a10
## 9   0.6   a10
## 5   0.9   a10
## 1   0.3    a1
## 10  0.5    a1