Our Data team is responsible for crunching, reporting, and serving data. The team also does data integrations with other systems and creates machine and deep learning models.
With this post, we intend to share our favorite tools, which are proven to run with thousands of millions of pieces of data. Scaling processes in real-world scenarios is a hot topic among new people coming to data science.
R or Python?
Well… both!
RStudio is an open-source IDE capable of browsing data and objects created during the session, creating plots, debugging code, among many other options. It also provides an enterprise-ready solution.
Jupyter is also an open-source IDE aimed to interface Julia, Python, and R. Today, it is widely used by data scientists to share their analysis. Recently, Google created “Colab,” a Jupyter Notebookenvironment capable of running in Google Drive.
You may also like: Top 10 Reasons to Learn R.
Is R Capable of Running In Production?
Yes. We run several heavy data preparations and predictive models every day, every hour, and every few minutes with R.
Rscript my_awesome_script.R
. Airflow is a Python-based task scheduler that allows us to run chained processes with many complex dependencies, monitoring the current state of all of them and firing alerts if anything goes wrong to Slack. This is ideal for running import jobs to populate a Data Warehouse with fresh data every day.
Do We Have a Data Warehouse?
Yes, and it’s huge! It’s mounted on Amazon Redshift, a suitable option if scaling is a priority. Visit their websiteto learn more about it.
redshiftTools
. This data can be either plain files or from DataFrames created during the R session. Data Preparation Using R
We have two scenarios, data preparation for data engineering and data preparation for machine learning/AI.
Tidyverse, especially the dplyr package, contains a set of functions that make the exploratory data analysis and data preparation quite comfortable.
For certain tasks in crunching data prep and visualization, we use the funModeling package. It was the seed for an open-source book I published some time ago: Data Science Live Book.
It contains some good practices we follow related to deploying models on production, dealing with missing data, handling outliers, and more.
Does R Scale?
One of the key advantages of dplyr is that it can be run on databases, thanks to another package with a pretty similar name: dbplyr.
dplyr
), and it is “automagically” converted to SQL syntax that then runs in production. There are some cases in which these conversions from R to SQL are not made automatically. For such cases, we are still able to do a mix of SQL syntax in R. Generates:
This way, dbplyr
makes transparent for the R user working with objects in RAM or in a foreign database.
Not many people know, but many key pieces of R are written in C++ (concretely, the Rcpp package).
How Do We Share the Results?
Mostly in Tableau. We have some integrations with Salesforce. In addition, we do have some reports deployed in Shiny, especially the ones that need complex customer interaction.
For ad hoc reports (HTML), we use R markdown, which shares some functionality with Jupyter Notebook. It allows a script to be created with an analysis that ends in a dashboard, PDF report, web-based reports, and books!
Machine Learning/AI
We use both R and Python. For Machine Learning projects, we mainly use the caret package in R. It provides a high-level interface to many machine learning algorithms, as well as common tasks in data preparation, model evaluation, and hyper-tuning parameter.
Keras is an API to build complex neural networks. These can easily scale by training them on the cloud, in services like AWS.
Summing up!
The open-source languages are leading the data path. R and Python have strong communities, and there are free and top-notch resources to learn.
Further Reading
Our Data team is responsible for crunching, reporting, and serving data. The team also does data integrations with other systems and creates machine and deep learning models.
With this post, we intend to share our favorite tools, which are proven to run with thousands of millions of pieces of data. Scaling processes in real-world scenarios is a hot topic among new people coming to data science.
R or Python?
Well… both!
RStudio is an open-source IDE capable of browsing data and objects created during the session, creating plots, debugging code, among many other options. It also provides an enterprise-ready solution.
Jupyter is also an open-source IDE aimed to interface Julia, Python, and R. Today, it is widely used by data scientists to share their analysis. Recently, Google created “Colab,” a Jupyter Notebookenvironment capable of running in Google Drive.
You may also like: Top 10 Reasons to Learn R.
Is R Capable of Running In Production?
Yes. We run several heavy data preparations and predictive models every day, every hour, and every few minutes with R.
Rscript my_awesome_script.R
. Airflow is a Python-based task scheduler that allows us to run chained processes with many complex dependencies, monitoring the current state of all of them and firing alerts if anything goes wrong to Slack. This is ideal for running import jobs to populate a Data Warehouse with fresh data every day.
Do We Have a Data Warehouse?
Yes, and it’s huge! It’s mounted on Amazon Redshift, a suitable option if scaling is a priority. Visit their websiteto learn more about it.
redshiftTools
. This data can be either plain files or from DataFrames created during the R session. Data Preparation Using R
We have two scenarios, data preparation for data engineering and data preparation for machine learning/AI.
Tidyverse, especially the dplyr package, contains a set of functions that make the exploratory data analysis and data preparation quite comfortable.
For certain tasks in crunching data prep and visualization, we use the funModeling package. It was the seed for an open-source book I published some time ago: Data Science Live Book.
It contains some good practices we follow related to deploying models on production, dealing with missing data, handling outliers, and more.
Does R Scale?
One of the key advantages of dplyr is that it can be run on databases, thanks to another package with a pretty similar name: dbplyr.
dplyr
), and it is “automagically” converted to SQL syntax that then runs in production. There are some cases in which these conversions from R to SQL are not made automatically. For such cases, we are still able to do a mix of SQL syntax in R. Generates:
This way, dbplyr
makes transparent for the R user working with objects in RAM or in a foreign database.
Not many people know, but many key pieces of R are written in C++ (concretely, the Rcpp package).
How Do We Share the Results?
Mostly in Tableau. We have some integrations with Salesforce. In addition, we do have some reports deployed in Shiny, especially the ones that need complex customer interaction.
For ad hoc reports (HTML), we use R markdown, which shares some functionality with Jupyter Notebook. It allows a script to be created with an analysis that ends in a dashboard, PDF report, web-based reports, and books!
Machine Learning/AI
We use both R and Python. For Machine Learning projects, we mainly use the caret package in R. It provides a high-level interface to many machine learning algorithms, as well as common tasks in data preparation, model evaluation, and hyper-tuning parameter.
Keras is an API to build complex neural networks. These can easily scale by training them on the cloud, in services like AWS.
Summing up!
The open-source languages are leading the data path. R and Python have strong communities, and there are free and top-notch resources to learn.