This blog originates from my desire to share methodologies and software to bring R in production.
Why “R in Production”
R is one of the most used open-source tools in the world for performing Data Science, an interdisciplinary sector that uses scientific methods, processes, algorithms and systems to extract value from data.
By “production systems” we mean those systems that are used to produce value. It means that they are tools that have ceased to be prototypes and are actually used to provide a service. Consequently, once they are in operation, the business relies on them. Examples can be, critical systems that must work 24/7, or those used by non-technical users where there is no expert who can fix the shot if something goes wrong, or systems that must lead to a result quickly according to a well-tested protocol.
In general, they are systems that require high quality in terms of security, reliability and maintainability.
Security means the process by which precautionary measures are taken to avoid events such as malfunctions, data loss, tampering or theft. Often neglected, conversely, it is of fundamental importance both as a form of practical protection and because it is a sine-qua-non requirement in the specifications of certain business environments.
Reliability is a fundamental requirement of a product or service to become part of a production chain. At least two of its essential characteristics can be highlighted: it must carry out its work correctly and must not interrupt the service. The correctness of its functionality should be verified, for example with tests. Robustness is the characteristic of a product that always remains reactive without ending up in dead ends that would cause the temporary interruption of operations. Therefore it must take into account all possible user actions, be resistant to the variability of the input data, and detect malfunctions in advance allowing the maintenance technician to intervene.
Maintainability is the ease with which you can intervene on a product that works in production. Hardly any product remains perfectly unchanged over time. In the most stable cases, the interventions concern routine maintenance, or we intervene occasionally for the correction of defects or the repair of a malfunction. Other times we find ourselves in more dynamic situations where it is necessary to release a new functionality or in general the fulfilment of new requirements. This aspect becomes a real necessity in environments where the request to evolve the product becomes a practice for which a team is constantly working to create new features to be released.
In recent times, these characteristics have been put under stress by projects that need to reduce time to market (time to market “TTM”), and to iteratively develop a minimum functional product (Minimum Viable Product “MVP”) and to release it continuously at regular intervals.
Team and cooperation
So far we have generally addressed business issues, but there are equally important issues that affect the dynamics and workflows within the company providing the service.
Sharing and collaboration have solid foundations in the structuring of the project, and it is one of the objectives of the methodologies I would like to discuss. By “collaboration” I mean how the work of the different team members is integrated. Whereas by “sharing” I mean the concept that a team benefits if all the components are aware of what the objectives of the project are, the specifics of the solution brought, and are able to optimize their work according to the work of the team. I want to emphasize again that this is achieved through the correct workflow and proper software tools.
These issues are not solely related to the R or Data Science world, they are software engineering issues. They have therefore already been extensively addressed and there are solutions of proven efficacy used in a widespread way. It is therefore a question of knowing practices and tools and applying them to the project.
The spectrum of topics that the blog intends to address is wider than that described here, there are in fact many tasks that are usually entrusted to other specialists such as systems engineers or other developers. It is good that the data scientist is also aware of the existence and meaning of these tools because in this way he will be able to create production-ready prototypes or easy to bring into production. Examples of these topics are the DevOps mindset, which includes continuous integration (Continuous Integration CI) and continuous distribution (Continuous Distribution or Continuous Deployment CD), etc… “Programmable infrastructure” (Infrastructure as Code IaC), etc… IT security as we said above.
Another topic relevant to the Data Scientist and crucial for the use of R in some projects is integration between R and other systems. These can be databases, including databases for “Big Data”, other software, libraries of other programming languages, management, Cloud tools and in general other company software platforms. In general, the integration of applications is a set of techniques through which multiple types of software can be put into communication. There are well-defined protocols with different pros and cons, so it is good to have an overview of them to choose the most appropriate solution each time.
But can it really be achieved with R?
Yes! The execution of R instructions runs a process very similar to other languages/software and the interaction of this process with the system follows the same protocols. So what I can do in languages like Java, Scala or Python I can do with R. Essentially there are two primary limitations: the availability of libraries and tools to facilitate the implementation of the chosen solution and the limited availability of information and guidelines on the matter. The R community and software companies in recent years have taken steps to fill the gap by producing adequate solutions and lately the community has moved quickly in this direction. The time is ripe to use R in Production.
Is your mouth watering yet ? Well! Because it’s the same feeling I had when I started, but now time has passed and things are simpler. I cannot wait to share them with you!