R Efficiency > Import and Export Your Data 124x Faster!

In this section of R efficiency, we will go over how you can import and export your data, at lightning speeds, with just a few lines of code.

In the experiment below, I compare two different methods of importing and exporting data. The first method is the read and write functions native to R which are read.csv() and write.csv(). As you will discover from the conclusion of the experiment below, these functions are no match for the data.table() package.

The data.table() package’s fast and friendly file finagler aka fread() takes most of the thinking out of importing your data. Intelligently, fread() reads in the first few rows of your data, and detects the data type of that field for the rest of the data set. Assigning the correct datatype to the data will lead to faster importing of data.

The data.table() package also has a fwrite() function which writes data files very quickly compared to write.csv(). The native function will convert all of the data to a string before writing it to a file, which takes up more RAM and time.

We test the two methods of import and export on a common dataset used in R, iris. I have duplicated the same iris dataset multiple times to get a dataset with about 19 million rows and 5 columns, which is about 750 Mb of data. I run the same functions 10 times each on the same dataset, to find the mean duration of each function.

Statistics on data:

The dimensions of the dataset is about 19 million rows by 5 columns, and the total data size is 750 Mb.

The dimensions of the dataset is about 19 million rows by 5 columns, and the total data size is 750 Mb.

Benchmark:

Call the data.table package (you can install the package with install.packages(“data.table”)) as well as the microbenchmark package, to help us capture the run times of each functions we are testing.

Call the data.table package (you can install the package with install.packages(“data.table”)) as well as the microbenchmark package, to help us capture the run times of each functions we are testing.

Comparing read.csv to fread: we can see the average time it takes to import the dataset is 18.61 seconds and 0.95 seconds, respectively.  Comparing write.csv to fwrite: we can see the average time it takes to export the dataset is 88.6 seconds and 0.72 seconds, respectively.

Comparing read.csv to fread: we can see the average time it takes to import the dataset is 18.61 seconds and 0.95 seconds, respectively.

Comparing write.csv to fwrite: we can see the average time it takes to export the dataset is 88.6 seconds and 0.72 seconds, respectively.

Conclusion:

fread() is about 20x times faster then read.csv().

fwrite() is about 124x times faster then write.csv().

Read more about the data.table package here.