数据的各种形式
一般而言, 数据的每一行的不同元素可由tab键(txt文件), 逗号(csv文件), 空格等隔开.
另外, 数据有行名和列名, 数据存储的时候可选择保存或者不保存.
下面给出几种常见的数据的形式(txt文件的分隔符指定为tab键, csv文件的分隔符指定为逗号).数据的内容是一致的, 行名是("a1", "a2"), 列名是("A", "B", "C").
a.txt:
A B Ca1 0.1 0.2 1a2 a "a" 1
b.txt:
A B C0.1 0.2 1a "a" 1
c.txt:
0.1 0.2 1a "a" 1
a.csv:
,A,B,Ca1,0.1,0.2,1a2,a,"a",1
b.csv:
A,B,C0.1,0.2,1a,"a",1
c.csv:
0.1,0.2,1a,"a",1
数据的输入
read.table函数的声明如下:
read.table(file, header = FALSE, sep = "", quote = "\"'", dec = ".", row.names, col.names, as.is = !stringsAsFactors, na.strings = "NA", colClasses = NA, nrows = -1, skip = 0, check.names = TRUE, fill = !blank.lines.skip, strip.white = FALSE, blank.lines.skip = TRUE, comment.char = "#", allowEscapes = FALSE, flush = FALSE, stringsAsFactors = default.stringsAsFactors(), fileEncoding = "", encoding = "unknown", text, skipNul = FALSE)
下面仅从读入上述数据初步说明read.table函数参数的含义, 具体参数含义请自行查阅帮助文档.
a.txt <- read.table("a.txt", quote = "'") # a.txt <- read.table("a.txt", header = TRUE, quote = "", row.names = 1)b.txt <- read.table("b.txt", header = TRUE, quote = "")c.txt <- read.table("c.txt", quote = "")a.csv <- read.table("a.csv", header = TRUE, sep = ",", quote = "", row.names = 1)b.csv <- read.table("b.csv", header = TRUE, sep = ",", quote = "")c.csv <- read.table("c.csv", sep = ",", quote = "")a.csv <- read.csv("a.csv", quote = "", row.names = 1)b.csv <- read.csv("b.csv", quote = "")c.csv <- read.csv("c.csv", header = FALSE, quote = "")
其中read.csv函数是为处理csv文件而由read.table函数变形而来.
由上述代码可以看出,
1. header是指定是否原文件是否包含列名, read.table默认值为FALSE, read.csv默认值为TRUE. 值得注意的是a.txt文件包含列名, 但是用read.table读取时使用了默认参数FALSE, 没有出错的原因在于: 在没指定header参数时, 如果第一行的元素比其他行的元素少一个, 自动将header参数设置为TRUE.
2. sep是用来指定分隔符的. 默认值为"", 表示1个或多个空格, tab键, 新的一行, 回车键. read.csv默认值为",".
3. quote用来指定包围字符型变量的符号, 读入后自动将其剔除. read.table默认值为"\"'", read.csv默认值为"\"", 故为了保留a与"a"的差别, 将quote设置为"".
4. row.names用于指定行名, 如果是数值k时, 则将第k列设为列名, 并将其从数据中移除. 值得注意的是a.txt文件包含行名, 但是用read.table默认不设置row.names参数, 没有出错的原因在于: 在没指定row.names参数时, 如果第一行的元素比其他行的元素少一个, 自动将row.names参数设置为1.
输出数据
write.table函数的声明如下:
write.table(x, file = "", append = FALSE, quote = TRUE, sep = " ", eol = "\n", na = "NA", dec = ".", row.names = TRUE, col.names = TRUE, qmethod = c("escape", "double"), fileEncoding = "")
下面仅从输出上述数据得到与输入前一样的文件来说明write.table函数参数的含义, 具体参数含义请自行查阅帮助文档.
a.txt <- write.table(a.txt, "a.txt", quote = FALSE, sep = "\t") b.txt <- write.table(b.txt, "b.txt", quote = FALSE, sep = "\t", row.names = FALSE)c.txt <- write.table(c.txt, "c.txt", quote = FALSE, sep = "\t", row.names = FALSE, col.names = FALSE)a.csv <- write.table(a.csv, "a.csv", quote = FALSE, sep = ",", col.names = NA)b.csv <- write.table(b.csv, "b.csv", quote = FALSE, sep = ",", row.names = FALSE)c.csv <- write.table(c.csv, "c.csv", quote = FALSE, sep = ",", row.names = FALSE, col.names = FALSE)a.csv <- write.csv(a.csv, "a.csv", quote = FALSE)b.csv <- write.csv(b.csv, "b.csv", quote = FALSE, row.names = FALSE)c.csv <- write.csv(c.csv, "c.csv", quote = FALSE, row.names = FALSE, col.names = FALSE)
其中write.csv函数是为处理csv文件而由write.table函数变形而来.
由上述代码可以看出,
1. quote用来指定是否为字符型变量添加双引号.
2. sep是用来指定分隔符的. write.table默认值为" ",write.csv默认值为","且不能更改.
3. row.names指定是否输出行名.
4. col.names指定是否输出列名. write.table可设置为NA, 表示列名前空出一个位置, write.csv不能设置为NA, 如果设置为TRUE, 默认空出一个位置, 不可更改.
拓展阅读
1. write.table数据的默认形式
default.table
"A" "B" "C""a1" "0.1" "0.2" 1"a2" "a" "\"a\"" 1
运行以下代码,default.table文件不改变
input <- read.table("default.table")write.table(input, "default.table")
2. write.csv不存在默认的数据形式
3. R主要用于数据分析, 文本处理更适合交给perl, 使数据类型兼容perl
default.table
A B Ca1 0.1 0.2 1a2 a "a" 1
读取和输出代码如下:
default.table <- read.table("default.table", sep = "\t", quote = "") # 通常情况下, 不需要特别指定quote, 数据里面有可能使用了空格等字符, 故应指定分隔符write.table(default.table, "default.table", quote = FALSE, sep = "\t", col.names = NA)
4. 使用append保存中间运行结果
例如, 需要生成每行100个U[0, 1]分布随机数, 下面代码表示每生成100个随机数就保存一次.
colnames <- t(c("", paste("X", 1:100, sep = "")))write.table(colnames, "random.table", quote = FALSE, sep = "\t", row.names = FALSE, col.names = FALSE)i <- 1while (1) { write.table(t(c(i, runif(100))), append = TRUE, "random.table", quote = FALSE, sep = "\t", row.names = FALSE, col.names = FALSE) i <- i + 1}
5. 长度不一的数据读取
awful.table
A B Ca1 0.1 0.2 1a2 a "a" 1a3 3 3 3a4 4 4 4 4a5 5 5 5 5 5a6 6 6 6 6 6
显然, 有些行的元素比列名的个数多, read.table应该不支持自动列名补全(如果有错, 麻烦指正), 故header设置为FALSE, 为了使每一行的元素个数一致, 设置fill为TRUE, 如果要设置空的单元格为NA, 可设置na.strings为"", 为了使sep分割的空字符成为一个单元, 可以设置strip.white为TRUE, 为了不使含有字符型元素的列成为factor, 可以设置stringsAsFactors为FALSE;
值得注意的是, 这种情况下, 需要特别指定列名, 否则, R程序会自动根据前5行计算出列的个数, 故这里要设置col.names = paste("X", 1:7, sep = "").
awful.table <- read.table("awful.table", header = FALSE, sep = "\t", quote = "", col.names = paste("X", 1:7, sep = ""), na.strings = "", fill = TRUE, strip.white = TRUE, stringsAsFactors = FALSE)colnames(awful.table) <- ifelse(is.na(awful.table[1, ]), colnames(awful.table), awful.table[1, ])awful.table <- awful.table[-1, ]rownames(awful.table) <- awful.table[, 1]awful.table <- awful.table[, -1]