本教材之智慧財產權, 屬木刻思股份有限公司所有。
如果有朋友,覺得此教材很棒,希望能分享給朋友,或是拿此教材開課。非常歡迎大家來信至 course@agilearning.io 請求教材的使用授權唷!
Extract
[T] Transform
[L] Load
packages
與函式 functions
readLine
| read.table
?readLines
?read.table
XLConnect
| xlsx
library(XLConnect)
## Loading required package: XLConnectJars
## XLConnect 0.2-11 by Mirai Solutions GmbH [aut],
## Martin Studer [cre],
## The Apache Software Foundation [ctb, cph] (Apache POI, Apache Commons
## Codec),
## Stephen Colebourne [ctb, cph] (Joda-Time Java library)
## http://www.mirai-solutions.com ,
## http://miraisolutions.wordpress.com
library(xlsx)
## Loading required package: rJava
## Loading required package: xlsxjars
##
## Attaching package: 'xlsx'
##
## The following objects are masked from 'package:XLConnect':
##
## createFreezePane, createSheet, createSplitPane, getCellStyle,
## getSheets, loadWorkbook, removeSheet, saveWorkbook,
## setCellStyle, setColumnWidth, setRowHeight
readxl
library(readxl)
magrittr
| dplyr
library(magrittr)
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
RSQLite
library(RSQLite)
## Loading required package: DBI
shell
輸入指令下載資料集git clone https://github.com/suensummit/RCourseBankSinoPac.git
setwd("RCourseBankSinoPac")
dir()
## [1] "data" "img" "index.html"
## [4] "index.Rmd" "installPackages.R" "LICENSE"
## [7] "main.css" "mops.html" "README.md"
## [10] "temp.Rmd" "tryCodes" "week2handouts.html"
## [13] "week2handouts.Rmd"
rJava
安裝在 Windows 系統裡使用 R,有時候會遇到一些棘手的 error;其中除了跟系統編碼(顯示亂碼)相關的問題以外,有一大宗都跟 rJava
這個套件有關。
今天先帶領大家操作一次,當遇到問題時,可以如何去解決。
library(rJava)
getwd
setwd
dir
tab
control
+ enter
shift
+ control
+ c
library(Lahman)
## View Data
head(Batting)
## playerID yearID stint teamID lgID G AB R H X2B X3B HR RBI SB CS BB
## 1 abercda01 1871 1 TRO NA 1 4 0 0 0 0 0 0 0 0 0
## 2 addybo01 1871 1 RC1 NA 25 118 30 32 6 0 0 13 8 1 4
## 3 allisar01 1871 1 CL1 NA 29 137 28 40 4 5 0 19 3 1 2
## 4 allisdo01 1871 1 WS3 NA 27 133 28 44 10 2 2 27 1 1 0
## 5 ansonca01 1871 1 RC1 NA 25 120 29 39 11 3 0 16 6 2 2
## 6 armstbo01 1871 1 FW1 NA 12 49 9 11 2 1 0 5 0 1 0
## SO IBB HBP SH SF GIDP
## 1 0 NA NA NA NA NA
## 2 0 NA NA NA NA NA
## 3 5 NA NA NA NA NA
## 4 2 NA NA NA NA NA
## 5 1 NA NA NA NA NA
## 6 1 NA NA NA NA NA
tail(Batting)
## playerID yearID stint teamID lgID G AB R H X2B X3B HR RBI SB
## 99841 zeidjo01 2014 1 HOU AL 23 1 0 0 0 0 0 0 0
## 99842 zieglbr01 2014 1 ARI NL 68 1 0 0 0 0 0 0 0
## 99843 zimmejo02 2014 1 WAS NL 32 55 3 10 1 0 0 1 0
## 99844 zimmery01 2014 1 WAS NL 61 214 26 60 19 1 5 38 0
## 99845 zobribe01 2014 1 TBA AL 146 570 83 155 34 3 10 52 10
## 99846 zuninmi01 2014 1 SEA AL 131 438 51 87 20 2 22 60 0
## CS BB SO IBB HBP SH SF GIDP
## 99841 0 0 1 0 0 0 0 0
## 99842 0 0 1 0 0 0 0 0
## 99843 0 2 21 0 0 9 1 0
## 99844 0 22 37 0 0 0 4 6
## 99845 5 75 84 4 1 2 6 8
## 99846 3 17 158 1 17 0 4 12
## View(Batting)
summary(Batting)
## playerID yearID stint teamID
## Length:99846 Min. :1871 Min. :1.000 CHN : 4768
## Class :character 1st Qu.:1932 1st Qu.:1.000 PHI : 4671
## Mode :character Median :1971 Median :1.000 PIT : 4621
## Mean :1963 Mean :1.077 SLN : 4581
## 3rd Qu.:1996 3rd Qu.:1.000 CIN : 4438
## Max. :2014 Max. :5.000 CLE : 4402
## (Other):72365
## lgID G AB R
## AA: 1890 Min. : 0.00 Min. : 0.0 Min. : 0.00
## AL:45631 1st Qu.: 13.00 1st Qu.: 7.0 1st Qu.: 0.00
## FL: 470 Median : 35.00 Median : 58.0 Median : 5.00
## NA: 737 Mean : 51.46 Mean :150.6 Mean : 19.98
## NL:50639 3rd Qu.: 81.00 3rd Qu.:252.0 3rd Qu.: 30.00
## PL: 147 Max. :165.00 Max. :716.0 Max. :192.00
## UA: 332 NA's :5149 NA's :5149
## H X2B X3B HR
## Min. : 0.00 Min. : 0.000 Min. : 0.000 Min. : 0.000
## 1st Qu.: 1.00 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.000
## Median : 11.00 Median : 2.000 Median : 0.000 Median : 0.000
## Mean : 39.43 Mean : 6.654 Mean : 1.385 Mean : 2.944
## 3rd Qu.: 64.00 3rd Qu.:10.000 3rd Qu.: 2.000 3rd Qu.: 3.000
## Max. :262.00 Max. :67.000 Max. :36.000 Max. :73.000
## NA's :5149 NA's :5149 NA's :5149 NA's :5149
## RBI SB CS BB
## Min. : 0.00 Min. : 0.000 Min. : 0.000 Min. : 0.00
## 1st Qu.: 0.00 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.00
## Median : 4.00 Median : 0.000 Median : 0.000 Median : 3.00
## Mean : 18.04 Mean : 3.182 Mean : 1.337 Mean : 13.88
## 3rd Qu.: 27.00 3rd Qu.: 2.000 3rd Qu.: 1.000 3rd Qu.: 20.00
## Max. :191.00 Max. :138.000 Max. :42.000 Max. :232.00
## NA's :5573 NA's :6449 NA's :28603 NA's :5149
## SO IBB HBP SH
## Min. : 0.00 Min. : 0.00 Min. : 0.000 Min. : 0.000
## 1st Qu.: 2.00 1st Qu.: 0.00 1st Qu.: 0.000 1st Qu.: 0.000
## Median : 10.00 Median : 0.00 Median : 0.000 Median : 1.000
## Mean : 21.57 Mean : 1.23 Mean : 1.114 Mean : 2.486
## 3rd Qu.: 30.00 3rd Qu.: 1.00 3rd Qu.: 1.000 3rd Qu.: 3.000
## Max. :223.00 Max. :120.00 Max. :51.000 Max. :67.000
## NA's :12987 NA's :41712 NA's :7959 NA's :11487
## SF GIDP
## Min. : 0.00 Min. : 0.000
## 1st Qu.: 0.00 1st Qu.: 0.000
## Median : 0.00 Median : 1.000
## Mean : 1.16 Mean : 3.225
## 3rd Qu.: 2.00 3rd Qu.: 5.000
## Max. :19.00 Max. :36.000
## NA's :41181 NA's :31257
## explore data structure
str(Batting)
## 'data.frame': 99846 obs. of 22 variables:
## $ playerID: chr "abercda01" "addybo01" "allisar01" "allisdo01" ...
## $ yearID : int 1871 1871 1871 1871 1871 1871 1871 1871 1871 1871 ...
## $ stint : int 1 1 1 1 1 1 1 1 1 1 ...
## $ teamID : Factor w/ 149 levels "ALT","ANA","ARI",..: 136 111 39 142 111 56 111 24 56 24 ...
## $ lgID : Factor w/ 7 levels "AA","AL","FL",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ G : int 1 25 29 27 25 12 1 31 1 18 ...
## $ AB : int 4 118 137 133 120 49 4 157 5 86 ...
## $ R : int 0 30 28 28 29 9 0 66 1 13 ...
## $ H : int 0 32 40 44 39 11 1 63 1 13 ...
## $ X2B : int 0 6 4 10 11 2 0 10 1 2 ...
## $ X3B : int 0 0 5 2 3 1 0 9 0 1 ...
## $ HR : int 0 0 0 2 0 0 0 0 0 0 ...
## $ RBI : int 0 13 19 27 16 5 2 34 1 11 ...
## $ SB : int 0 8 3 1 6 0 0 11 0 1 ...
## $ CS : int 0 1 1 1 2 1 0 6 0 0 ...
## $ BB : int 0 4 2 0 2 0 1 13 0 0 ...
## $ SO : int 0 0 5 2 1 1 0 1 0 0 ...
## $ IBB : int NA NA NA NA NA NA NA NA NA NA ...
## $ HBP : int NA NA NA NA NA NA NA NA NA NA ...
## $ SH : int NA NA NA NA NA NA NA NA NA NA ...
## $ SF : int NA NA NA NA NA NA NA NA NA NA ...
## $ GIDP : int NA NA NA NA NA NA NA NA NA NA ...
str(attributes(Batting))
## List of 3
## $ names : chr [1:22] "playerID" "yearID" "stint" "teamID" ...
## $ row.names: int [1:99846] 1 2 3 4 5 6 7 8 9 10 ...
## $ class : chr "data.frame"
class(Batting)
## [1] "data.frame"
## data shape
dim(Batting)
## [1] 99846 22
ncol(Batting)
## [1] 22
nrow(Batting)
## [1] 99846
library(httr)
library(rvest)
## Loading required package: xml2
## Connector
Target_URL = "http://tw.stock.yahoo.com/d/s/major_2330.html"
res <- GET(Target_URL)
doc_str <- content(res, type = "text", encoding = "big5")
## Parser
if (.Platform$OS.type == "windows"){
Sys.setlocale(category='LC_ALL', locale='C')
data_table <- doc_str %>% read_html(encoding = "big-5") %>% html_nodes(xpath = "//table[1]//table[2]") %>% html_table(header=TRUE)
Sys.setlocale(category='LC_ALL', locale='cht')
data_table <- apply(data_table[[1]],2,function(x) iconv(x,from = "utf8"))
colnames(data_table) <- iconv(colnames(data_table), from = "utf8")
} else{
data_table <- doc_str %>% read_html(encoding = "big-5") %>% html_nodes(xpath = "//table[1]//table[2]") %>% html_table(header=TRUE)
data_table <- data_table[[1]]
}
# View(data_table)
data_table
## 買超券商 買進 賣出 買超 賣超券商 買進 賣出 賣超
## 1 台灣東方匯 3318 184 3134 凱基證券 2088 9677 -7589
## 2 摩根大通 2277 186 2091 康和證券 116 4069 -3953
## 3 花旗環球 2571 1124 1447 港商麥格里 1272 5051 -3779
## 4 永豐金證券 2772 1350 1422 國泰證券 255 2377 -2122
## 5 宏遠證券 1296 30 1266 新加坡商瑞 1954 2838 -884
## 6 台灣摩根士 2667 1502 1165 香港上海匯 36 405 -369
## 7 美商高盛 2723 1639 1084 光隆 2 3 -1
## 8 元大證券 2292 1397 895 豐德證券 5 5 0
## 9 美林 2075 1205 870 港商聯昌 0 0 0
## 10 玉山證券 667 28 639 富隆證券 2 2 0
## 11 日盛證券 796 178 618 萬泰 0 0 0
## 12 元富 883 298 585 大鼎證券 0 0 0
## 13 富邦證券 804 303 501 福邦證券 0 0 0
## 14 群益金鼎 537 130 407 全泰證券 0 0 0
## 15 兆豐證券 575 171 404 大和國泰 0 0 0
head(data_table)
## 買超券商 買進 賣出 買超 賣超券商 買進 賣出 賣超
## 1 台灣東方匯 3318 184 3134 凱基證券 2088 9677 -7589
## 2 摩根大通 2277 186 2091 康和證券 116 4069 -3953
## 3 花旗環球 2571 1124 1447 港商麥格里 1272 5051 -3779
## 4 永豐金證券 2772 1350 1422 國泰證券 255 2377 -2122
## 5 宏遠證券 1296 30 1266 新加坡商瑞 1954 2838 -884
## 6 台灣摩根士 2667 1502 1165 香港上海匯 36 405 -369
colnames(data_table) <- c("券商", "買進", "賣出", "差距", "券商", "買進", "賣出", "差距")
stockMajor <- rbind(data_table[,1:4], data_table[,5:8])
# stockMajor <- cbind(日期 = c(Sys.Date()), 股票代號 = c("2330"), 股票名稱 = c("台積電"), stockMajor)
stockMajor <- cbind(日期 = c("2015-12-31"), 股票代號 = c("2330"), 股票名稱 = c("台積電"), stockMajor)
# View(stockMajor)
head(stockMajor)
## 日期 股票代號 股票名稱 券商 買進 賣出 差距
## 1 2015-12-31 2330 台積電 台灣東方匯 3318 184 3134
## 2 2015-12-31 2330 台積電 摩根大通 2277 186 2091
## 3 2015-12-31 2330 台積電 花旗環球 2571 1124 1447
## 4 2015-12-31 2330 台積電 永豐金證券 2772 1350 1422
## 5 2015-12-31 2330 台積電 宏遠證券 1296 30 1266
## 6 2015-12-31 2330 台積電 台灣摩根士 2667 1502 1165
readLines
,先把前幾筆資料讀進來觀察res <- readLines(con = "data/cl_info_other.csv", n = 10L, encoding = "BIG-5")
iconv(res,from='utf8',to='big5')
## [1] "etl_dt,bank_code,data_dt,bank_nm,mortgage_cnt,mortgage_bal,decorator_hse_cnt,decorator_hse_bal,ln_car_cnt,ln_car_bal,ln_worker_cnt,ln_worker_bal,other_cl_cnt,other_cl_bal"
## [2] "2013-11-26 22:30:07.971327,020 ,2006-01-01 00:00:00,\xa4\xe9\xb0\u04f7\xe7\xc1J\xb9\xea\xb7~\xbbȦ\xe6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0"
## [3] "2013-11-26 22:30:07.974241,N000 ,2006-01-01 00:00:00,\xa5x\xa5_\xbf\xa4\xb2H\xa4\xf4\xb2Ĥ@\xabH\xa5ΦX\xa7@\xaa\xc0,9924.0,16873000000.0,173.0,183000000.0,2.0,1000000.0,953.0,372000000.0,1101.0,309000000.0"
## [4] "2013-11-26 22:30:07.979319,809 ,2006-01-01 00:00:00,\xb8U\xae\xf5\xb0\u04f7~\xbbȦ\xe6,4051.0,5624000000.0,1329.0,1471000000.0,3128.0,926000000.0,0.0,0.0,1697098.0,76111000000.0"
## [5] "2013-11-26 22:30:07.995118,146 ,2006-01-01 00:00:00,\xa5x\xa4\xa4\xa5\xab\xb2ĤG\xabH\xa5ΦX\xa7@\xaa\xc0,11167.0,18949000000.0,118.0,97000000.0,11.0,0.0,0.0,0.0,1027.0,399000000.0"
## [6] "2013-11-26 22:30:08.015108,115 ,2006-01-01 00:00:00,\xb0\U000b6a65\xab\xb2ĤG\xabH\xa5ΦX\xa7@\xaa\xc0,1551.0,3773000000.0,336.0,470000000.0,178.0,47000000.0,0.0,0.0,2295.0,515000000.0"
## [7] "2013-11-26 22:30:08.052642,N005 ,2006-01-01 00:00:00,\xb0\xaa\xb6\xaf\xb2ĤT\xabH\xa5ΦX\xa7@\xaa\xc0,1903.0,2476000000.0,7828.0,11257000000.0,0.0,0.0,984.0,580000000.0,2866.0,946000000.0"
## [8] "2013-11-26 22:30:08.229513,803 ,2006-01-01 00:00:00,\xc1p\xa8\xb9\xb0\u04f7~\xbbȦ\xe6,35924.0,64191000000.0,11682.0,21458000000.0,49817.0,15844000000.0,277.0,61000000.0,377762.0,13496000000.0"
## [9] "2013-11-26 22:30:08.231682,114 ,2006-01-01 00:00:00,\xb0\U000b6a65\xab\xb2Ĥ@\xabH\xa5ΦX\xa7@\xaa\xc0,5351.0,7501000000.0,1192.0,881000000.0,0.0,0.0,0.0,0.0,5281.0,2681000000.0"
## [10] "2013-11-26 22:30:08.056627,222 ,2006-01-01 00:00:00,\xbc\xea\xb4\U000bf932Ĥ@\xabH\xa5ΦX\xa7@\xaa\xc0,62.0,144000000.0,22.0,39000000.0,0.0,0.0,0.0,0.0,172.0,79000000.0"
head(res)
## [1] "etl_dt,bank_code,data_dt,bank_nm,mortgage_cnt,mortgage_bal,decorator_hse_cnt,decorator_hse_bal,ln_car_cnt,ln_car_bal,ln_worker_cnt,ln_worker_bal,other_cl_cnt,other_cl_bal"
## [2] "2013-11-26 22:30:07.971327,020 ,2006-01-01 00:00:00,日商瑞穗實業銀行,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0"
## [3] "2013-11-26 22:30:07.974241,N000 ,2006-01-01 00:00:00,台北縣淡水第一信用合作社,9924.0,16873000000.0,173.0,183000000.0,2.0,1000000.0,953.0,372000000.0,1101.0,309000000.0"
## [4] "2013-11-26 22:30:07.979319,809 ,2006-01-01 00:00:00,萬泰商業銀行,4051.0,5624000000.0,1329.0,1471000000.0,3128.0,926000000.0,0.0,0.0,1697098.0,76111000000.0"
## [5] "2013-11-26 22:30:07.995118,146 ,2006-01-01 00:00:00,台中市第二信用合作社,11167.0,18949000000.0,118.0,97000000.0,11.0,0.0,0.0,0.0,1027.0,399000000.0"
## [6] "2013-11-26 22:30:08.015108,115 ,2006-01-01 00:00:00,基隆市第二信用合作社,1551.0,3773000000.0,336.0,470000000.0,178.0,47000000.0,0.0,0.0,2295.0,515000000.0"
read.table
colClasses
指定資料中欄位的大小及資料型態,讓讀檔速度加快(執行時不用去猜資料的大小、格式與資料型態)# library(RCurl)
# Cl_info = read.table(sep=",", header=TRUE, stringsAsFactors=F, file=textConnection(getURL("https://raw.githubusercontent.com/suensummit/RCourseBankSinoPac/gh-pages/data/cl_info_other.csv")))
Cl_info = read.table(file = "data/cl_info_other.csv", sep = ",", stringsAsFactors = F, header = T, colClasses = c("character", "character", "character", "character", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric"), encoding = "utf8")
head(Cl_info)
## etl_dt bank_code data_dt
## 1 2013-11-26 22:30:07.971327 020 2006-01-01 00:00:00
## 2 2013-11-26 22:30:07.974241 N000 2006-01-01 00:00:00
## 3 2013-11-26 22:30:07.979319 809 2006-01-01 00:00:00
## 4 2013-11-26 22:30:07.995118 146 2006-01-01 00:00:00
## 5 2013-11-26 22:30:08.015108 115 2006-01-01 00:00:00
## 6 2013-11-26 22:30:08.052642 N005 2006-01-01 00:00:00
## bank_nm mortgage_cnt mortgage_bal decorator_hse_cnt
## 1 日商瑞穗實業銀行 0 0.0000e+00 0
## 2 台北縣淡水第一信用合作社 9924 1.6873e+10 173
## 3 萬泰商業銀行 4051 5.6240e+09 1329
## 4 台中市第二信用合作社 11167 1.8949e+10 118
## 5 基隆市第二信用合作社 1551 3.7730e+09 336
## 6 高雄第三信用合作社 1903 2.4760e+09 7828
## decorator_hse_bal ln_car_cnt ln_car_bal ln_worker_cnt ln_worker_bal
## 1 0.0000e+00 0 0.00e+00 0 0.00e+00
## 2 1.8300e+08 2 1.00e+06 953 3.72e+08
## 3 1.4710e+09 3128 9.26e+08 0 0.00e+00
## 4 9.7000e+07 11 0.00e+00 0 0.00e+00
## 5 4.7000e+08 178 4.70e+07 0 0.00e+00
## 6 1.1257e+10 0 0.00e+00 984 5.80e+08
## other_cl_cnt other_cl_bal
## 1 0 0.0000e+00
## 2 1101 3.0900e+08
## 3 1697098 7.6111e+10
## 4 1027 3.9900e+08
## 5 2295 5.1500e+08
## 6 2866 9.4600e+08
str(Cl_info)
## 'data.frame': 9041 obs. of 14 variables:
## $ etl_dt : chr "2013-11-26 22:30:07.971327" "2013-11-26 22:30:07.974241" "2013-11-26 22:30:07.979319" "2013-11-26 22:30:07.995118" ...
## $ bank_code : chr "020 " "N000 " "809 " "146 " ...
## $ data_dt : chr "2006-01-01 00:00:00" "2006-01-01 00:00:00" "2006-01-01 00:00:00" "2006-01-01 00:00:00" ...
## $ bank_nm : chr "日商瑞穗實業銀行" "台北縣淡水第一信用合作社" "萬泰商業銀行" "台中市第二信用合作社" ...
## $ mortgage_cnt : num 0 9924 4051 11167 1551 ...
## $ mortgage_bal : num 0.00 1.69e+10 5.62e+09 1.89e+10 3.77e+09 ...
## $ decorator_hse_cnt: num 0 173 1329 118 336 ...
## $ decorator_hse_bal: num 0.00 1.83e+08 1.47e+09 9.70e+07 4.70e+08 ...
## $ ln_car_cnt : num 0 2 3128 11 178 ...
## $ ln_car_bal : num 0.00 1.00e+06 9.26e+08 0.00 4.70e+07 ...
## $ ln_worker_cnt : num 0 953 0 0 0 984 277 0 0 0 ...
## $ ln_worker_bal : num 0.00 3.72e+08 0.00 0.00 0.00 5.80e+08 6.10e+07 0.00 0.00 0.00 ...
## $ other_cl_cnt : num 0 1101 1697098 1027 2295 ...
## $ other_cl_bal : num 0.00 3.09e+08 7.61e+10 3.99e+08 5.15e+08 ...
file
就是指讀入的檔案路徑sep
指的是欄位分割用的符號,通常 csv 檔案格式是透過,做分割True
, 讓讀入的字串用 factor
儲存,那麼資料就會轉為整數儲存與對照表False
,表示第一行是不是表格標頭,輸出的 data.frame 欄位名的 colnames
write.table
寫資料write.table(x = Cl_info, file = "data/cl_info_other_update.csv", sep = ",", quote = FALSE)
gdata
## Need Perl on Windows
library(gdata)
## gdata: read.xls support for 'XLS' (Excel 97-2004) files ENABLED.
##
## gdata: read.xls support for 'XLSX' (Excel 2007+) files ENABLED.
##
## Attaching package: 'gdata'
##
## The following objects are masked from 'package:dplyr':
##
## combine, first, last
##
## The following object is masked from 'package:stats':
##
## nobs
##
## The following object is masked from 'package:utils':
##
## object.size
XLConnect
and xlsx
rJava
,另外在處理中文編碼時需要使用 iconv
輔助。## Need Java/rJava
library(XLConnect)
library(xlsx)
套件::函式(public)
套件:::函式(private)
readxl::excel_sheets
取得 Excel 檔案的表單資訊(sheet)# library(readxl)
filename <- "data/10401信用卡重要資訊揭露.xlsx"
sheetNames <- readxl::excel_sheets(filename)
sheetNames
## [1] "10401"
readxl:::xlsx_col_types
觀察資料中欄位的大小及資料型態,讓讀檔速度加快(執行時不用去猜資料的大小、格式與資料型態)colTypes <- readxl:::xlsx_col_types(path = filename, nskip = 10, n = 1)
colTypes
## [1] "text" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## [8] "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
readxl::read_excel
讀檔、xlsx::write.xlsx2
寫檔resxl <- readxl::read_excel(filename, sheet = sheetNames[1], col_names = FALSE, skip = 8, col_types = colTypes)
colnames(resxl) <- c("金融機構名稱", "流通卡數", "有效卡數", "當月發卡數", "當月停卡數", "循環信用餘額", "未到期分期付款餘額", "當月簽帳金額", "當月預借現金金額", "逾期三個月以上帳款占應收帳款餘額含催收款之比率", "逾期六個月以上帳款占應收帳款餘額含催收款之比率", "備抵呆帳提足率", "當月轉銷呆帳金額", "當年度轉銷呆帳金額累計至資料月份")
dataLength <- sum(1 - is.na(resxl$流通卡數))
resxl <- resxl[1:dataLength,]
# colnames(resxl) <- iconv(colnames(resxl), 'utf8', 'big5')
str(resxl)
## Classes 'tbl_df' and 'data.frame': 38 obs. of 14 variables:
## $ 金融機構名稱 : chr "臺灣銀行" "臺灣土地銀行" "合作金庫商業銀行" "第一商業銀行" ...
## $ 流通卡數 : num 225744 123416 400723 813420 749151 ...
## $ 有效卡數 : num 115423 54717 237647 555037 517241 ...
## $ 當月發卡數 : num 502 654 2606 15215 12108 ...
## $ 當月停卡數 : num 1591 817 7719 8392 9691 ...
## $ 循環信用餘額 : num 241782 204883 602344 1309186 665331 ...
## $ 未到期分期付款餘額 : num 14702 28732 193005 915441 807054 ...
## $ 當月簽帳金額 : num 653624 337772 1697919 3554150 3140069 ...
## $ 當月預借現金金額 : num 1489 570 4136 14732 1005 ...
## $ 逾期三個月以上帳款占應收帳款餘額含催收款之比率: num 0.326 0.3382 0.9299 0.2323 0.0819 ...
## $ 逾期六個月以上帳款占應收帳款餘額含催收款之比率: num 0.165 0.248 0.898 0 0 ...
## $ 備抵呆帳提足率 : num 518 1696 251 1484 1368 ...
## $ 當月轉銷呆帳金額 : num 318 1895 0 5498 0 ...
## $ 當年度轉銷呆帳金額累計至資料月份 : num 318 1895 0 5498 0 ...
# View(resxl)
function
批次執行。setwd("./data/")
filenames <- dir(pattern = "*.xlsx")
readCreditExcel <- function(filename) {
sheetNames <- readxl::excel_sheets(filename)
colTypes <- readxl:::xlsx_col_types(path = filename, nskip = 10, n = 1)
resxl <- readxl::read_excel(filename, sheet = sheetNames[1], col_names = FALSE, skip = 8, col_types = colTypes)
colnames(resxl) <- c("金融機構名稱", "流通卡數", "有效卡數", "當月發卡數", "當月停卡數", "循環信用餘額", "未到期分期付款餘額", "當月簽帳金額", "當月預借現金金額", "逾期三個月以上帳款占應收帳款餘額含催收款之比率", "逾期六個月以上帳款占應收帳款餘額含催收款之比率", "備抵呆帳提足率", "當月轉銷呆帳金額", "當年度轉銷呆帳金額累計至資料月份")
dataLength <- sum(1 - is.na(resxl$流通卡數))
resxl <- resxl[1:dataLength,]
resxl <- cbind(月份 = substr(filename, 1, 5), resxl)
return(resxl)
}
filenames
## [1] "10401信用卡重要資訊揭露.xlsx" "10402信用卡重要資訊揭露.xlsx"
## [3] "10403信用卡重要資訊揭露.xlsx" "10404信用卡重要資訊揭露.xlsx"
## [5] "10405信用卡重要資訊揭露.xlsx" "10406信用卡重要資訊揭露.xlsx"
## [7] "10407信用卡重要資訊揭露.xlsx" "10408信用卡重要資訊揭露.xlsx"
## [9] "10409信用卡重要資訊揭露.xlsx" "10410信用卡重要資訊揭露.xlsx"
creditData <- lapply(filenames, readCreditExcel)
creditData <- do.call("rbind", creditData)
# View(creditData)
setwd("../")
dplyr
dplyr
vignette
library(dplyr)
vignette(all = TRUE, package = "dplyr")
vignette("introduction", package = "dplyr")
data.frame
或資料庫中的表格 tbl
)select
對欄位做篩選filter
對資料列做篩選mutate
變更或新增欄位arrange
排序、排列group_by
+ summarise
分類bind
合併資料表select
Cl_demo1 = select(資料表, 欄位1, 欄位2, 欄位3)
select data_dt, bank_nm, mortgage_bal from Cl_info ;
# Example 1: select
Cl_demo1 = select(Cl_info, data_dt, bank_nm, mortgage_bal)
head(Cl_demo1)
## data_dt bank_nm mortgage_bal
## 1 2006-01-01 00:00:00 日商瑞穗實業銀行 0.0000e+00
## 2 2006-01-01 00:00:00 台北縣淡水第一信用合作社 1.6873e+10
## 3 2006-01-01 00:00:00 萬泰商業銀行 5.6240e+09
## 4 2006-01-01 00:00:00 台中市第二信用合作社 1.8949e+10
## 5 2006-01-01 00:00:00 基隆市第二信用合作社 3.7730e+09
## 6 2006-01-01 00:00:00 高雄第三信用合作社 2.4760e+09
filter
Cl_demo2 = filter(Cl_info, mortgage_bal > 1000000)
select * from Cl_info where mortgage > 1000000 ;
# Example 2: filter
Cl_demo2 = filter(Cl_info, mortgage_bal > 1000000)
head(Cl_demo2)
## etl_dt bank_code data_dt
## 1 2013-11-26 22:30:07.974241 N000 2006-01-01 00:00:00
## 2 2013-11-26 22:30:07.979319 809 2006-01-01 00:00:00
## 3 2013-11-26 22:30:07.995118 146 2006-01-01 00:00:00
## 4 2013-11-26 22:30:08.015108 115 2006-01-01 00:00:00
## 5 2013-11-26 22:30:08.052642 N005 2006-01-01 00:00:00
## 6 2013-11-26 22:30:08.229513 803 2006-01-01 00:00:00
## bank_nm mortgage_cnt mortgage_bal decorator_hse_cnt
## 1 台北縣淡水第一信用合作社 9924 1.6873e+10 173
## 2 萬泰商業銀行 4051 5.6240e+09 1329
## 3 台中市第二信用合作社 11167 1.8949e+10 118
## 4 基隆市第二信用合作社 1551 3.7730e+09 336
## 5 高雄第三信用合作社 1903 2.4760e+09 7828
## 6 聯邦商業銀行 35924 6.4191e+10 11682
## decorator_hse_bal ln_car_cnt ln_car_bal ln_worker_cnt ln_worker_bal
## 1 1.8300e+08 2 1.0000e+06 953 3.72e+08
## 2 1.4710e+09 3128 9.2600e+08 0 0.00e+00
## 3 9.7000e+07 11 0.0000e+00 0 0.00e+00
## 4 4.7000e+08 178 4.7000e+07 0 0.00e+00
## 5 1.1257e+10 0 0.0000e+00 984 5.80e+08
## 6 2.1458e+10 49817 1.5844e+10 277 6.10e+07
## other_cl_cnt other_cl_bal
## 1 1101 3.0900e+08
## 2 1697098 7.6111e+10
## 3 1027 3.9900e+08
## 4 2295 5.1500e+08
## 5 2866 9.4600e+08
## 6 377762 1.3496e+10
mutate
Cl_demo3 = mutate(資料表, 新欄位名 = 運算式)
select mortgage_bal/1000000 as mortage from Cl_info ;
# Example 3: mutate
Cl_demo3 = mutate(Cl_info, mortage = mortgage_bal/1000000)
head(Cl_demo3)
## etl_dt bank_code data_dt
## 1 2013-11-26 22:30:07.971327 020 2006-01-01 00:00:00
## 2 2013-11-26 22:30:07.974241 N000 2006-01-01 00:00:00
## 3 2013-11-26 22:30:07.979319 809 2006-01-01 00:00:00
## 4 2013-11-26 22:30:07.995118 146 2006-01-01 00:00:00
## 5 2013-11-26 22:30:08.015108 115 2006-01-01 00:00:00
## 6 2013-11-26 22:30:08.052642 N005 2006-01-01 00:00:00
## bank_nm mortgage_cnt mortgage_bal decorator_hse_cnt
## 1 日商瑞穗實業銀行 0 0.0000e+00 0
## 2 台北縣淡水第一信用合作社 9924 1.6873e+10 173
## 3 萬泰商業銀行 4051 5.6240e+09 1329
## 4 台中市第二信用合作社 11167 1.8949e+10 118
## 5 基隆市第二信用合作社 1551 3.7730e+09 336
## 6 高雄第三信用合作社 1903 2.4760e+09 7828
## decorator_hse_bal ln_car_cnt ln_car_bal ln_worker_cnt ln_worker_bal
## 1 0.0000e+00 0 0.00e+00 0 0.00e+00
## 2 1.8300e+08 2 1.00e+06 953 3.72e+08
## 3 1.4710e+09 3128 9.26e+08 0 0.00e+00
## 4 9.7000e+07 11 0.00e+00 0 0.00e+00
## 5 4.7000e+08 178 4.70e+07 0 0.00e+00
## 6 1.1257e+10 0 0.00e+00 984 5.80e+08
## other_cl_cnt other_cl_bal mortage
## 1 0 0.0000e+00 0
## 2 1101 3.0900e+08 16873
## 3 1697098 7.6111e+10 5624
## 4 1027 3.9900e+08 18949
## 5 2295 5.1500e+08 3773
## 6 2866 9.4600e+08 2476
arrange
Cl_demo4 = arrange(資料表, 欄位1, desc(欄位2)))
select * from Cl_info order by mortage, data_dt desc ;
# Example 4: arrange
Cl_demo4 = arrange(Cl_info, mortgage_bal, desc(data_dt))
head(Cl_demo4)
## etl_dt bank_code data_dt
## 1 2014-04-19 07:59:29.921756 093 2014-02-01 00:00:00
## 2 2014-04-19 07:59:29.824336 040 2014-02-01 00:00:00
## 3 2014-04-19 07:59:30.146603 090 2014-02-01 00:00:00
## 4 2014-04-19 07:59:29.907231 N026 2014-02-01 00:00:00
## 5 2014-04-19 07:59:30.024388 023 2014-02-01 00:00:00
## 6 2014-04-19 07:59:30.088385 N023 2014-02-01 00:00:00
## bank_nm mortgage_cnt mortgage_bal decorator_hse_cnt
## 1 荷蘭商安智銀行 0 0 0
## 2 中華開發工業銀行 0 0 0
## 3 加拿大商豐業銀行 0 0 0
## 4 日商瑞穗銀行 0 0 0
## 5 泰國盤谷銀行 0 0 0
## 6 大陸商交通銀行 0 0 0
## decorator_hse_bal ln_car_cnt ln_car_bal ln_worker_cnt ln_worker_bal
## 1 0 0 0 0 0
## 2 0 0 0 0 0
## 3 0 0 0 0 0
## 4 0 0 0 0 0
## 5 0 0 0 0 0
## 6 0 0 0 0 0
## other_cl_cnt other_cl_bal
## 1 0 0e+00
## 2 8 2e+06
## 3 0 0e+00
## 4 0 0e+00
## 5 0 0e+00
## 6 0 0e+00
group_by
+ summarise
Cl_demo5 = summarise(group_by(資料表, 用以分組的欄位), 彙總欄位)
select mean(mortgage_cnt) from Cl_info group by bank_nm ;
# Example 5: group_by + summarise
Cl_demo5 = summarise(group_by(Cl_info, bank_nm), mean(mortgage_cnt))
head(Cl_demo5)
## Source: local data frame [6 x 2]
##
## bank_nm mean(mortgage_cnt)
## (chr) (dbl)
## 1 三信商業銀行 6.069959e+03
## 2 上海商業儲蓄銀行 3.268605e+04
## 3 中國信託商業銀行 1.427962e+05
## 4 中國國際商業銀行 3.420471e+04
## 5 中國輸出入銀行 0.000000e+00
## 6 中華開發工業銀行 1.704082e+00
# Example:
creditDataCompete <- select(creditData, 金融機構名稱, 月份, 流通卡數, 有效卡數, 當月發卡數, 當月停卡數, 逾期三個月以上帳款占應收帳款餘額含催收款之比率) %>% mutate(使用率 = (有效卡數/流通卡數)*100, 當月增減 = 當月發卡數 - 當月停卡數) %>% arrange(金融機構名稱)
head(creditDataCompete)
## 金融機構名稱 月份 流通卡數 有效卡數 當月發卡數 當月停卡數
## 1 三信商業銀行 10401 16466 11127 494 73
## 2 三信商業銀行 10402 16556 11162 202 58
## 3 三信商業銀行 10403 16702 11241 304 72
## 4 三信商業銀行 10404 16750 11326 187 60
## 5 三信商業銀行 10405 16804 11376 223 81
## 6 三信商業銀行 10406 16797 11383 241 109
## 逾期三個月以上帳款占應收帳款餘額含催收款之比率 使用率 當月增減
## 1 0.3446950 67.57561 421
## 2 0.3056296 67.41967 144
## 3 0.1372363 67.30332 232
## 4 0.1709332 67.61791 127
## 5 0.2034689 67.69817 142
## 6 0.4154962 67.76805 132
library(DBI)
# Create an ephemeral in-memory RSQLite database
con <- dbConnect(RSQLite::SQLite(), ":memory:")
dbListTables(con)
## character(0)
dbWriteTable(con, "Batting", Batting)
## [1] TRUE
dbListTables(con)
## [1] "Batting"
dbListFields(con, "Batting")
## [1] "playerID" "yearID" "stint" "teamID" "lgID" "G"
## [7] "AB" "R" "H" "X2B" "X3B" "HR"
## [13] "RBI" "SB" "CS" "BB" "SO" "IBB"
## [19] "HBP" "SH" "SF" "GIDP"
# dbReadTable(con, "Batting")
# You can fetch all results:
res <- dbSendQuery(con, "SELECT * FROM Batting WHERE playerID = 'jeterde01'")
dbFetch(res)
## playerID yearID stint teamID lgID G AB R H X2B X3B HR RBI SB CS
## 1 jeterde01 1995 1 NYA AL 15 48 5 12 4 1 0 7 0 0
## 2 jeterde01 1996 1 NYA AL 157 582 104 183 25 6 10 78 14 7
## 3 jeterde01 1997 1 NYA AL 159 654 116 190 31 7 10 70 23 12
## 4 jeterde01 1998 1 NYA AL 149 626 127 203 25 8 19 84 30 6
## 5 jeterde01 1999 1 NYA AL 158 627 134 219 37 9 24 102 19 8
## 6 jeterde01 2000 1 NYA AL 148 593 119 201 31 4 15 73 22 4
## 7 jeterde01 2001 1 NYA AL 150 614 110 191 35 3 21 74 27 3
## 8 jeterde01 2002 1 NYA AL 157 644 124 191 26 0 18 75 32 3
## 9 jeterde01 2003 1 NYA AL 119 482 87 156 25 3 10 52 11 5
## 10 jeterde01 2004 1 NYA AL 154 643 111 188 44 1 23 78 23 4
## 11 jeterde01 2005 1 NYA AL 159 654 122 202 25 5 19 70 14 5
## 12 jeterde01 2006 1 NYA AL 154 623 118 214 39 3 14 97 34 5
## 13 jeterde01 2007 1 NYA AL 156 639 102 206 39 4 12 73 15 8
## 14 jeterde01 2008 1 NYA AL 150 596 88 179 25 3 11 69 11 5
## 15 jeterde01 2009 1 NYA AL 153 634 107 212 27 1 18 66 30 5
## 16 jeterde01 2010 1 NYA AL 157 663 111 179 30 3 10 67 18 5
## 17 jeterde01 2011 1 NYA AL 131 546 84 162 24 4 6 61 16 6
## 18 jeterde01 2012 1 NYA AL 159 683 99 216 32 0 15 58 9 4
## 19 jeterde01 2013 1 NYA AL 17 63 8 12 1 0 1 7 0 0
## 20 jeterde01 2014 1 NYA AL 145 581 47 149 19 1 4 50 10 2
## BB SO IBB HBP SH SF GIDP
## 1 3 11 0 0 0 0 0
## 2 48 102 1 9 6 9 13
## 3 74 125 0 10 8 2 14
## 4 57 119 1 5 3 3 13
## 5 91 116 5 12 3 6 12
## 6 68 99 4 12 3 3 14
## 7 56 99 3 10 5 1 13
## 8 73 114 2 7 3 3 14
## 9 43 88 2 13 3 1 10
## 10 46 99 1 14 16 2 19
## 11 77 117 3 11 7 3 15
## 12 69 102 4 12 7 4 13
## 13 56 100 3 14 3 2 21
## 14 52 85 0 9 7 4 24
## 15 72 90 4 5 4 1 18
## 16 63 106 4 9 1 3 22
## 17 46 81 0 6 4 5 10
## 18 45 90 1 5 6 1 24
## 19 8 10 1 1 0 1 3
## 20 35 87 0 6 8 4 15
dbClearResult(res)
## [1] TRUE
# Or a chunk at a time
res <- dbSendQuery(con, "SELECT * FROM Batting WHERE playerID = 'jeterde01'")
while(!dbHasCompleted(res)){
chunk <- dbFetch(res, n = 5)
print(nrow(chunk))
}
## [1] 5
## [1] 5
## [1] 5
## [1] 5
## [1] 0
# Clear the result
dbClearResult(res)
## [1] TRUE
# Disconnect from the database
dbDisconnect(con)
## [1] TRUE
?src_sqlite
res <- read.csv("data/bank/bank-full.csv", header = TRUE, sep = ";")
head(res)
## age job marital education default balance housing loan contact
## 1 58 management married tertiary no 2143 yes no unknown
## 2 44 technician single secondary no 29 yes no unknown
## 3 33 entrepreneur married secondary no 2 yes yes unknown
## 4 47 blue-collar married unknown no 1506 yes no unknown
## 5 33 unknown single unknown no 1 no no unknown
## 6 35 management married tertiary no 231 yes no unknown
## day month duration campaign pdays previous poutcome y
## 1 5 may 261 1 -1 0 unknown no
## 2 5 may 151 1 -1 0 unknown no
## 3 5 may 76 1 -1 0 unknown no
## 4 5 may 92 1 -1 0 unknown no
## 5 5 may 198 1 -1 0 unknown no
## 6 5 may 139 1 -1 0 unknown no
str(res)
## 'data.frame': 45211 obs. of 17 variables:
## $ age : int 58 44 33 47 33 35 28 42 58 43 ...
## $ job : Factor w/ 12 levels "admin.","blue-collar",..: 5 10 3 2 12 5 5 3 6 10 ...
## $ marital : Factor w/ 3 levels "divorced","married",..: 2 3 2 2 3 2 3 1 2 3 ...
## $ education: Factor w/ 4 levels "primary","secondary",..: 3 2 2 4 4 3 3 3 1 2 ...
## $ default : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 2 1 1 ...
## $ balance : int 2143 29 2 1506 1 231 447 2 121 593 ...
## $ housing : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 2 2 ...
## $ loan : Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 2 1 1 1 ...
## $ contact : Factor w/ 3 levels "cellular","telephone",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ day : int 5 5 5 5 5 5 5 5 5 5 ...
## $ month : Factor w/ 12 levels "apr","aug","dec",..: 9 9 9 9 9 9 9 9 9 9 ...
## $ duration : int 261 151 76 92 198 139 217 380 50 55 ...
## $ campaign : int 1 1 1 1 1 1 1 1 1 1 ...
## $ pdays : int -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
## $ previous : int 0 0 0 0 0 0 0 0 0 0 ...
## $ poutcome : Factor w/ 4 levels "failure","other",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ y : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
head
+ tail
+ summary
readLines
+ read.table
readxl:::xlsx_col_types
gdata
XLConnect
+ xlsx
readxl
dplyr
<-> SQLXLConnect::loadWorkbook
取得 Excel 檔案的表單資訊(sheet)xlsx::read.xlsx2
,先把前幾行資料讀進來觀察,預先設定欄位的大小及資料型態 colClasses
,讓讀檔速度加快(執行時不用去猜資料的大小、格式與資料型態)xlsx::read.xlsx2
讀檔(較快),xlsx::write.xlsx2
寫檔