Mryqu's Notes


  • 首页

  • 搜索
close

R语言字符处理

时间: 2013-07-13   |   分类: DataScience     |   阅读: 911 字 ~5分钟

字符处理
Encoding(x)
Encoding(x) <- value

enc2native(x)
enc2utf8(x)
读取或设置字符向量的编码
> ## x is intended to be in latin1
> x <- "fa\xE7ile"
> Encoding(x)
[1] "latin1"
> Encoding(x) <- "latin1"
> xx <- iconv(x, "latin1", "UTF-8")
> Encoding(c(x, xx))
[1] "latin1" "UTF-8"
> Encoding(xx) <- "bytes" # will be encoded in hex
> cat("xx = ", xx, "\n", sep = "")
xx = fa\xc3\xa7ile
nchar(x, type = "chars", allowNA = FALSE)
返回字符长度,在我的测试中allowNA参数没有作用?
nzchar(x) 判断是否空字符

对于缺失值NA,nchar和nzchar函数认为是字符数为2的字符串。
所以在对字符串进行测量之前,最好先使用is.na()函数判断一下是否是NA。
对于NULL,nchar和nzchar函数会忽略掉。
> nchar(c("em","yqu","",NA))
[1] 2 3 0 2
> nzchar(c("em","yqu","",NA))
[1] TRUE TRUE FALSE TRUE > nzchar(c("em","yqu",NULL,"",NA))
[1] TRUE TRUE FALSE TRUE
> nchar(c("em","yqu",NULL,"",NA))
[1] 2 3 0 2
> nchar(NULL)
integer(0)
> nzchar(NULL)
logical(0)
substr(x, start, stop)
substring(text, first, last = 1000000L)
substr(x, start, stop) <- value
substring(text, first, last = 1000000L) <- value
提取或替换字符向量的子字段,substring同substr功能一样,兼容S语言。
参数start大于stop时,抽取时返回"",替换时无操作。
如果x包含NA,对应结果为NA。
> substr("abcdef", 2, 4)
[1] "bcd"
> substr("abcdef", -3, 9)
[1] "abcdef"
> substring("abcdef", 1:6, 1:6)
[1] "a" "b" "c" "d" "e" "f"
> x <-c("asfef", "qwerty", "yuiop[", "b", "stuff.blah.yech")
> substring(x, 2, 4:5)
[1] "sfe" "wert" "uio" "" "tuf"
strtrim(x, width)
按显示宽度截断字符串
> x<-c("abcdef",NA,"66")
> strtrim(x,c(2,1,3))
[1] "ab" NA "66"
paste (..., sep = " ", collapse = NULL)
paste0(..., collapse = NULL)
通过sep连接间隔连接对象,返回字符串向量
设定collapse的话,会通过collapse连接间隔
将上一步的字符串向量连接成一个字符串
paste0(..., collapse)等同于paste(..., sep = "", collapse)
> paste(1:6) # same as as.character(1:6)
[1] "1" "2" "3" "4" "5" "6"
> paste("A", 1:6, sep = "=")
[1] "A=1" "A=2" "A=3" "A=4" "A=5" "A=6"
> paste("A", 1:6, sep = "=", collapse=";")
[1] "A=1;A=2;A=3;A=4;A=5;A=6"
strsplit(x, split, fixed = FALSE, perl = FALSE, useBytes = FALSE)
基于split子句分割字符向量x
fixed为TRUE的话,完全匹配split;
否则,基于正则表达式
可以使用split=NULL来分割每个字符。
> x <- c(as = "mfe", qu = "qwerty", "70", "yes")
> strsplit(x, "e")
$as
[1] "mf"

$qu [1] “qw” “rty”

[[3]] [1] “70”

[[4]] [1] “y” “s”

strsplit(“Hello world!”, NULL) [[1]] [1] “H” “e” “l” “l” “o” " " “w” “o” “r” “l” “d” “!”

Note that ‘split’ is a regexp!

unlist(strsplit(“a.b.c”, “.”)) [1] "" "" "" "" ""

If you really want to split on ‘.’, use

unlist(strsplit(“a.b.c”, “[.]”)) [1] “a” “b” “c” unlist(strsplit(“a.b.c”, “.”, TRUE)) [1] “a” “b” “c”

字符转换和大小写转换
chartr(old, new, x)
将x中的字符old变换为字符new
> x <- “MiXeD cAsE 123”
chartr(“iXs”, “why”, x)
[1] “MwheD cAyE 123”
chartr(“a-cX”, “D-Fw”, x)
[1] “MiweD FAsE 123”
tolower(x)
toupper(x)
casefold(x, upper = FALSE)
casefold是为了兼容S-PLUS而实现的tolower和toupper函数封装器。
> x <- “MiXeD cAsE 123”
tolower(x)
[1] “mixed case 123”
toupper(x)
[1] “MIXED CASE 123”
格式化输出
sprintf(fmt, …) 系统C库函数sprintf封装器
> sprintf("%s is %f feet tall\n", “Sven”, 7.1)
[1] “Sven is 7.100000 feet tall\n”
format 格式化输出
formatC 格式化(C语言风格)输出
strwrap(x, width = 0.9 * getOption(“width”),
indent = 0, exdent = 0, prefix = “”,
simplify = TRUE, initial = prefix)
将字符串封装成格式化段落
> str <- “Now is the time "
strwrap(str, width=60,indent=1)
[1] " Now is the time”
strwrap(str, width=60,indent=2)
[1] " Now is the time"
strwrap(str, width=60,indent=3)
[1] " Now is the time"
strwrap(str, prefix=“kx>”)
[1] “kx>Now is the time”
字符串匹配
pmatch(x, table, nomatch = NA_integer_, duplicates.ok = FALSE)
局部字符串匹配,返回匹配的下标。

pmatch的行为因duplicates.ok参数而异。
当duplicates.ok为TRUE,有完全匹配的情况返回第一个完全匹配的下标,否则有唯一一个局部匹配的情况返回该唯一一个局部匹配的下标,没有匹配则返回nomatch参数值。
空字符串与任何字符串都不匹配,甚至是空字符串。
当duplicates.ok为FALSE,table中的值一旦匹配都被排除用于后继匹配,空字符串例外。
NA被视为字符常量"NA"。
> pmatch(c("", “ab”, “ab”), c(“abc”, “ab”), dup = FALSE)
[1] NA 2 1
pmatch(c("", “ab”, “ab”), c(“abc”, “ab”), dup = TRUE)
[1] NA 2 2
pmatch(“m”, c(“mean”, “median”, “mode”)) # returns NA
[1] NA
charmatch(x, table, nomatch = NA_integer_)
局部字符串匹配,返回匹配的下标。

charmatch与uplicates.ok为TRUE的pmatch近似,当有单个完全匹配的情况返回该完全匹配的下标,否则有唯一一个局部匹配的情况返回该唯一一个局部匹配的下标,有多个完全匹配或局部匹配返回0,没有匹配则返回nomatch参数值。
charmatch允许匹配空字符串。
NA被视为字符常量"NA"。
> charmatch(c("", "ab", "ab"), c("abc","ab"))
[1] 0 2 2
> charmatch("m", c("mean", "median", "mode")) # returns 0
[1] 0
match(x, table, nomatch = NA_integer_, incomparables = NULL)
x %in% table
值匹配,不限于字符串
> sstr <-c("e","ab","M",NA,"@","bla","P","%")
> sstr[sstr %in% c(letters, LETTERS)]
[1] "e" "M" "P"
模式匹配和替换
grep(pattern,x,ignore.case=FALSE,
perl=FALSE,value=FALSE,fixed=FALSE,
useBytes=FALSE,invert=FALSE)
返回匹配下标
grepl(pattern,x,ignore.case=FALSE,
perl=FALSE,fixed=FALSE,useBytes=FALSE)
返回匹配逻辑结果
sub(pattern,replacement,x,ignore.case=FALSE,
perl=FALSE,fixed=FALSE,useBytes=FALSE)
替换第一个匹配的字符串
gsub(pattern,replacement,x,ignore.case=FALSE,
perl=FALSE,fixed=FALSE,useBytes=FALSE)
替换全部匹配的字符串
regexpr(pattern,text,ignore.case=FALSE,
perl=FALSE,fixed=FALSE,useBytes=FALSE)
返回第一个匹配的下标和匹配长度
gregexpr(pattern,text,ignore.case=FALSE,
perl=FALSE,fixed=FALSE,useBytes=FALSE)
返回全部匹配的下标和匹配长度
regexec(pattern,text,ignore.case=FALSE,
fixed=FALSE,useBytes=FALSE)
返回第一个匹配的下标和匹配长度

这些函数(除了不支持Perl风格正则表达式的regexec函数)可以工作在三种模式下:
  1. fixed = TRUE: 使用精确匹配
  2. perl = TRUE: 使用Perl风格正则表达式
  3. fixed = FALSE且perl = FALSE: 使用POSIX 1003.2扩展正则表达式
useBytes = TRUE时逐字节匹配,否则逐字符匹配。
其主要作用是避免对多字节字符码中无效输入和虚假匹配的错误/告警,但是对于regexpr,它改变了输出的解释。
它会阻止标记编码的输入进行转换,尤其任一输入被标记为“字节”时强制禁止转换。
> str<-c("Now is ","the"," time ")
> grep(" +", str)
[1] 1 3
> grepl(" +", str)
[1] TRUE FALSE TRUE
> sub(" +", "", str)
[1] "Nowis " "the" "time "
> sub("[[:space:]]+", "", str) ## white space, POSIX-style
[1] "Nowis " "the" "time "
> sub("\\s+", "", str, perl = TRUE) ## Perl-style white space
[1] "Nowis " "the" "time "
> gsub(" +", "", str)
[1] "Nowis" "the" "time"
> regexpr(" +", str)
[1] 4 -1 1
attr(,"match.length")
[1] 1 -1 1
attr(,"useBytes")
[1] TRUE
> gregexpr(" +", str)
[[1]]
[1] 4 7
attr(,"match.length")
[1] 1 1
attr(,"useBytes")
[1] TRUE

[[2]] [1] -1 attr(,“match.length”) [1] -1 attr(,“useBytes”) [1] TRUE

[[3]] [1] 1 6 attr(,“match.length”) [1] 1 2 attr(,“useBytes”) [1] TRUE

regexec(" +", str) [[1]] [1] 4 attr(,“match.length”) [1] 1

[[2]] [1] -1 attr(,“match.length”) [1] -1

[[3]] [1] 1 attr(,“match.length”) [1] 1

regmatches(x, m, invert = FALSE)
regmatches(x, m, invert = FALSE) <- value
抽取或替换正则表达式匹配子串
invert = TRUE则抽取或替换不匹配子串
> str<-c(“Now is “,“the”,” time “)

m<-regexpr(” +",str) regmatches(str,m)<- “kx” str [1] “Nowkxis " “the” “kxtime "

str<-c(“Now is “,“the”,” time “) m<-gregexpr(” +",str) regmatches(str,m, invert=TRUE)<- “kx” str [1] “kx kx kx” “kx” “kx kx kx”

agrep(pattern, x, max.distance = 0.1,
costs = NULL, ignore.case = FALSE,
value = FALSE, fixed = TRUE,
useBytes = FALSE)

agrepl(pattern, x, max.distance = 0.1,
costs = NULL, ignore.case = FALSE,
fixed = TRUE, useBytes = FALSE)

使用广义Levenshtein编辑距离进行字符串近似匹配待进一步研究
> str <- c(“1 lazy”, “1”, “1 LAZY”)
agrep(“laysy”, str, max = 2)
[1] 1
grepRaw(pattern, x, offset = 1L,
ignore.case = FALSE, value = FALSE,
fixed = FALSE, all = FALSE,
invert = FALSE)
对原始数据向量进行模式匹配
> raws <- charToRaw(“Now is the time “)
raws
[1] 4e 6f 77 20 69 73 20 74 68 65 20 74 69 6d 65 20
grepRaw(charToRaw(” +”),raws)
[1] 4
glob2rx(pattern, trim.head = FALSE, trim.tail = TRUE)
将通配符模式变成正则表达式
> glob2rx(“abc.”)
[1] “^abc\.”
glob2rx(“a?b.”)
[1] “^a.b\.”
glob2rx(“a?b.”, trim.tail = FALSE)
[1] “^a.b\..$”
glob2rx(”.doc”)
[1] “^.\.doc$”
glob2rx(”.doc”, trim.head = TRUE)
[1] “\.doc$”
glob2rx(".t*")
[1] “^.\.t”
glob2rx(".t??")
[1] “^.\.t..$”
glob2rx("[")
[1] “^.\["

标题:R语言字符处理
作者:mryqu
声明: 本博客所有文章除特别声明外,均采用 CC BY-NC-SA 3.0 CN 许可协议。转载请注明出处!

#R语言# #字符# #处理# #函数#
PostgreSQL与MySQL数据库分区
R语言数值计算
  • 文章目录
  • 站点概览

Programmer & Architect

662 日志
27 分类
1472 标签
GitHub Twitter FB Page
  • Note that ‘split’ is a regexp!
  • If you really want to split on ‘.’, use
© 2009 - 2023 Mryqu's Notes
Powered by - Hugo v0.120.4
Theme by - NexT
0%