Translate

Wednesday, 6 September 2017

Reading files into R

The below are some of the options available to import files into R.
 

 For csv files, use read.csv as below.
 
  
read.csv("test.csv")
 
  V1         V2
1  F -0.5786439
2  E  0.2472908
3  U  0.2748309
4  R  1.1791559
5  K -0.1258598
6  X -0.8898289
7  L  0.4627274
8  C -0.7088007
 
To select first 3 rows,
 
 read.csv("test.csv", nrow = 3)
 
   V1         V2
1  F -0.5786439
2  E  0.2472908
3  U  0.2748309

 
To skip first 2 rows and extract next 3 rows,
 
 
read.csv("test.csv", nrow = 3, skip = 2)
 
  E X0.247290774914572
1 U          0.2748309
2 R          1.1791559
3 K         -0.1258598
 
If the file is not csv, use read.table, but separator/delimiter needs to be specified.
 

read.table("test.csv", sep = ",", nrow = 3, skip = 2)
 
  V1        V2
1  E 0.2472908
2  U 0.2748309
3  R 1.1791559
 
For large files, use fread() function in data.table package for improved speed of import.
 

data.table::fread("test.csv", sep = ",")
 
   V1         V2
1:  F -0.5786439
2:  E  0.2472908
3:  U  0.2748309
4:  R  1.1791559
5:  K -0.1258598
6:  X -0.8898289
7:  L  0.4627274
8:  C -0.7088007

 
To read first 3 rows,
 
fread("test.csv", sep = ",", nrow = 3)
 
   V1         V2
1:  F -0.5786439
2:  E  0.2472908
3:  U  0.2748309
 

To skip first 2 lines and read next 3 rows do below. Note the fread will treat header as a row when skipping.
 
fread("test.csv", sep = ",", nrow = 3, skip = 2)

   V1        V2
1:  E 0.2472908
2:  U 0.2748309
3:  R 1.1791559
 
readLines() is best for checking the contents and delimiter of the file prior to importing, as it is not restricted by encoding or delimiters.
 
readLines("test.csv")
 
[1] "\"V1\",\"V2\""            "\"F\",-0.578643919152124"
[3] "\"E\",0.247290774914572"  "\"U\",0.274830888945797"
[5] "\"R\",1.179155856395"     "\"K\",-0.125859842900427"
[7] "\"X\",-0.889828858494609" "\"L\",0.462727351834403"
[9] "\"C\",-0.708800746374982"
 
To read first 4 lines,
 
readLines("test.csv", n = 4)
 
[1] "\"V1\",\"V2\""            "\"F\",-0.578643919152124"
[3] "\"E\",0.247290774914572"  "\"U\",0.274830888945797" 
 

scan() is similar to readLines() but treat each cell as an item, hence the output does not group elements by rows.
 

scan("test.csv", what = "list", nlines = 4)

Read 8 items
[1] "V1"                  ",\"V2\""             "F"                
[4] ",-0.578643919152124" "E"                   ",0.247290774914572"
[7] "U"                   ",0.274830888945797" 
 
To skip first 2 lines and read next 4 lines (note the header is treated as line 1 when skipping),
 
scan("test.csv", what = "list", nlines = 4, skip = 2)
 
Read 8 items
[1] "E"                   ",0.247290774914572"  "U"                
[4] ",0.274830888945797"  "R"                   ",1.179155856395"  
[7] "K"                   ",-0.125859842900427"

 
If file is compressed, e.g. gzip, use gzfile().
 
To read files in,
 
read.csv(gzfile("test.csv.gz", "r"))

 
To write into the gz file,
  
a <- gzfile("test.csv.gz", "w")
  
cat("New1, 1111 \n New2, 22222\n", file = a)








Tuesday, 15 August 2017

Creating combinations of elements using expand.grid

To generate all combinations of elements from a pair or multiples of vectors, use expand.grid().
 
 
expand.grid(c(1:3), LETTERS[1:3]) 
 

  Var1 Var2
1    1    A
2    2    A
3    3    A
4    1    B
5    2    B
6    3    B
7    1    C
8    2    C
9    3    C
 
   
 
 
expand.grid(c(1:3),LETTERS[1:3],letters[1:2])
 
 
   Var1 Var2 Var3
1     1    A    a
2     2    A    a
3     3    A    a
4     1    B    a
5     2    B    a
6     3    B    a
7     1    C    a
8     2    C    a
9     3    C    a
10    1    A    b
11    2    A    b
12    3    A    b
13    1    B    b
14    2    B    b
15    3    B    b
16    1    C    b
17    2    C    b
18    3    C    b

 
 


Tuesday, 29 November 2016

data.table

data.table package allows R to handle very large data sets, typically 10's or 100's of millions of rows, efficiently. This includes loading/importing the data and aggregating the data.
 
To import a flat file with very large number of rows, data.table provides fread function.
 
library(data.table)
Data<- fread("data.csv", sep = ",", header = TRUE)

  
To aggregate the data set: 
  
Agg <- as.data.table(iris)[, list(Avg_Sepal_Length = mean(Sepal.Length)), by = "Species"]
 
When aggregating multiple columns at the same time:
 
AggMC <- as.data.table(iris)[, list(Avg_Sepal_Length = mean(Sepal.Length), Avg_Petal_Length = mean(Petal.Length)), by = "Species"]
 
When aggregating all columns other than the grouping column:
 
AggAC <- as.data.table(iris)[, lapply(.SD, mean), by = "Species"]
 
   
When aggregating by multiple grouping columns:

AggMCMG <- as.data.table(CO2)[, list(Avg_Conc = mean(conc), Total_Uptake = sum(uptake)), by = c("Plant", "Type")]





Tuesday, 25 October 2016

Passing parameters to R script from command line


To pass parameters to the R script when running the script from the command line, commandArgs( ) can be used.

Example: 

Save the below script in a file called 'DateRange.r'

Para <- commandArgs() 
DATE <- as.Date(as.character(Para[6]), format = "%Y%m%d")
N <- as.numeric(Para[7])
DateRge <- data.frame(Date = seq(from = DATE, length.out = N, by = 1), Value = rnorm(N))

Then, run the below command with the parameters inserted at the end


For Windows

If the below path is saved in your environment variable, you can simply use 'Rscript' without writing out the full path.

"C:\Program Files\R\R-3.2.3\bin\Rscript.exe" DateRange.r [date in yyyymmdd format (DATE)] [length of sequence (N)] 

"C:\Program Files\R\R-3.2.3\bin\Rscript.exe" DateRange.r 20161005 5
will return:
       Date       Value
 2016-10-05  1.61637011
 2016-10-06 -0.08534756
 2016-10-07 -2.24108808
 2016-10-08  0.05773242
 2016-10-09  0.73725642 


For Linux
 
Similar to Windows, you can use Rscript command

Rscript DateRange.r yyyymmdd N 

Rscript DateRange.r 20161005 5 
        Date      Value
 2016-10-05 -0.7931385
 2016-10-06 -0.4229764
 2016-10-07 -0.3338677
 2016-10-08 -1.0844999








Friday, 20 May 2016

send emails from R through Outlook


This assumes Outlook Application is installed and your account is set up etc...

Also, you may need to restart Outlook after installing the package in R, if you get an error like 'Error: Exception occurred.'.


library(RDCOMClient)

OutApp <- COMCreate("Outlook.Application")  
outMail = OutApp$CreateItem(0) 

outMail[["To"]] = "recipient's email address" 
outMail[["subject"]] = "subject" 
outMail[["body"]] = "body text" 

outMail$Send()



To send emails to multiple recipients, use semicolon (;) to separate email addresses:

OutApp <- COMCreate("Outlook.Application") 
outMail = OutApp$CreateItem(0)

outMail[["To"]] = "recipient's email address 1; recipient's email address 2"
outMail[["subject"]] = "subject" 
outMail[["body"]] = "body text" 

outMail$Send()



To send emails with attachment(s):

OutApp <- COMCreate("Outlook.Application") 
outMail = OutApp$CreateItem(0)

outMail[["To"]] = "recipient's email address"
outMail[["subject"]] = "subject" 
outMail[["body"]] = "body text" 

outMail[["Attachments"]]$Add("full path to file")     
#e.g. "C:/Users/Documents/someFile.txt"
#note the use of forward slash instead of back slash as you'd normally do in R when setting path to the attachment 

outMail$Send()



To embed table within the body of the email:
library(pander) 

panderOptions('table.split.table', Inf)

OutApp <- COMCreate("Outlook.Application") 
outMail = OutApp$CreateItem(0)

outMail[["To"]] = "recipient's email address"
outMail[["subject"]] = "subject" 
outMail[["body"]] = paste("Hello!", "", "The below summarises xxx:", pandoc.table.return(data.frame(V1 = 1:5, V2 = LETTERS[1:5])), sep = "\n")

outMail$Send()