DATA SCIENCE

Python vs. R — What to Choose for Data Science?

Detailed comparison of Python and R languages for use in Data Science

Image for post
Image for post
Artwork by the author. Photo: Pexels

🚀Python 🚀

Python’s thirtieth anniversary has crept quite unnoticed. Over its long history, Python has been reborn several times, losing backward compatibility, but has always remained popular both among developers in general and, in particular, among data scientists. There are several reasons for this.

Benefits of Python in Data Science

  • Simple yet expressive syntax. Knowledge of English at the first grade school level is already a victory because Python's basics can be considered mastered. Further, it will not be much more difficult. If you are already familiar with Java, for example, you will be pleasantly surprised at how easy it is to say hello to the world.
  • A rich selection of libraries. And it’s not only about libraries of machine learning algorithms — cloud storage, streaming services, and even games are developed in Python (although sometimes they have to beat the brakes as a feature, not a bug).
  • High documentation culture. Python itself is well-documented, and usually, its libraries continue this tradition. For instance, Python exclusively supports Type Annotations, which greatly facilitate overall code readability and debugging of your functions.

Python tools for the data scientist

As mentioned earlier, Python is notable for its extensive set of libraries and tools. When talking about data science, the following should be mentioned first:

  • Scikit-learn is a large library of machine learning and data processing algorithms. A considerable part of the competition on Kaggle was won using only it in tandem with Pandas.
  • Keras and PyTorch are libraries used to train deep neural networks. Suitable for tasks related to images, audio, and video files.
  • IPython Notebook — when talking about Python, one cannot forget to mention it. The standard development environment is not quite suitable for the data scientist in the data mining process. There is a need for a format that would allow, for example, to run a costly algorithm, and when it is completed, play a little with the results, research them and build graphs. This is where the laptop format comes in. This is a graphical interface that opens in a regular browser and is basically a sequence of cells where you can write and execute code using shared memory to store data.

🚀R Language 🚀

In 2020, the R language remains one of the most popular languages ​​for Data Science and Statistics, consistently gaining an increasing share of views in the relevant sections of StackOverflow. At the same time, academic questions lead with a significant preponderance: first of all, R is a language with a rich set of libraries for machine learning and statistics, which is especially important for research purposes.

Benefits of R in Data Science

  • Rich ML ecosystem, a huge number of libraries of statistical methods. As noted earlier, R is especially popular in the academic environment, which leads to the fact that often new methods are first implemented on it.
  • A fairly convenient proprietary development environment, RStudio or VS Code, will be easy to understand, especially if you have prior experience in MATLAB.
  • Unusual syntax tailored to the needs of statistics. An experienced programmer with knowledge of another language may experience difficulties in acclimatization, but users with a mathematical background will easily perceive the logic of the language.
  • Native support for vector computing. A cool bonus means that you can program reasonably fast implementations of mathematical methods in R using vector and matrix calculations.

R tools for the data scientist

Let’s talk about the R library riches mentioned. Here are some of the basic but powerful libraries that can be armed with extensive research or good spots in Kaggle:

  • Ggplot2 and Esquisse are powerful graphing libraries.
  • Shiny is the most useful library for creating web applications with interactive visualizations of research.
  • Caret, randomForest, Mlr, etc. — dozens of libraries with machine learning methods. One of them will definitely work.

▶ Python vs. R in Data Science: Which is Better? ◀

Both languages ​​have their own advantages and disadvantages. Any of them can be suitable; it all depends on your tasks. Here are some points that can help you with your choice:

  1. Are you planning to work in a scientific field, or are you inclined to be closer to practice? Python is more close to production and is more often used in commercial projects. At the same time, R. is more popular in academic circles.
  2. Do you want to improve your horizons in machine learning methods? Or will it be enough for you to familiarize yourself with several of the most popular methods and devote more time, for example, to big data processing algorithms? In the first case, you definitely need R; in the second, you will find more Python features.
  3. Do you want to be engaged in implementing your developments and program anything other than predictors? If so, Python is better for you, but you will most likely need something else (like Java, Scala, or C++).

Got Interested?

If you found this article helpful, click the💚 or 👏 button below or share the article on Facebook so your friends can benefit from it too.

Written by

Bioinformatician at Oncobox Inc. (@oncobox). Research Associate at Moscow Institute of Physics and Technology (@mipt_eng).

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store