К основному контенту

R and Python. Biased opinion



Recently I’ve published a post on Medium. That was a post about how friendly an implementation of Read-Evaluate-Print-Loop (REPL) mode in different IDEs for data science. Even I like R I tried to be as much unbiased as possible while writing the post (I even must admit I was really surprised how friendly for statistical analysis a new Spyder 4.0 is).  
In my Medium post I mentioned there are tons of articles about the R vs. Python fight. Some of them are really comprehensive and useful. But it’s not common to consider REPL. Of course, IDE is important for efficient working process. But the features of a language as itself are crucial as well. Today I’m going to touch few other things that are typically not covered in R vs. Python comparisons. It might happen today my post is subjected and not that much unbiased. Let’s see😊
First of all it’s widely spread that R has “weird” syntax. And in contrast, it’s claimed that Python syntax is so clear and obvious which helps a lot to learn it. Well, it’s true that there are specific features which are unexpected for people who are used to code on general purpose programming languages. In R you can use dot to create variable name instead of using it to access a class method or field. Here we can mention also using $ to get an object’s fields and using ‘<-  instead of ‘=’ (even it’s possible to use ‘=’ as well). Also indexing in R starts from 1, not from zero. Yes, at the beginning it’s a bit frustrating.
Ok, let’s look at what Python offers for data science. Is it really so straightforward and more convenient? I’d say the answer is controversial. Python for general purposes is really as clear as other high-level languages. But Python for even very basic data analysis is a combination of Python and at least two packages – NumPy and Pandas. And each of the components introduces its own details. For instance, Python lists and dictionaries, NumPy arrays and Pandas dataframes are slightly different things with its own methods. You have to always keep in mind what you’re working with.
Meanwhile, base R has all the data structures already built-in. It simplifies quite a lot the process of getting started with data analysis in R.
Another thing is accessing the elements of data frames. In Python/Pandas it’s not so clear in my opinion. Here below is an example. When the first option works as expected, the second one doesn’t (frankly speaking, I have no idea what’s wrong):

By the way, slicing of data frame in Pandas is slightly different compared to base Python (this is again about the previous point).
In R assessing the elements in data structures is as much straightforward as possible. It’s always like
df[i,j]
Where as i and j you can use indexes, names and conditions with predictable output
df[1, ‘test_col’] 
df[1:10, ]
 df[1:10, ‘test_col’]
 df[‘test_col’>0, 1]  etc.
Another thing in Python which complicates a bit the life of researchers is a combination of functional-like style with object oriented-like style. Probably this is also a consequence of a need of using base Python with Pandas, NumPy etc.
Just a few examples:
  • A function for differencing a vector is like a method in OOP: df[‘column’].diff()
  • At the same time the standard function for length is in functional-like style: len()
  • But to check if data frame is empty you have an object property: df.empty

Even within one framework you can notice a confusing difference. For instance, in Pandas to get the unique values for Pandas Series it’s like a method in OOP:
pdseries.unique()
while the same for data frame is in functional style:
pd.unique(df[‘test_col’])
¯ \ _ (ツ) _ / ¯

In R everything is done in functional-like style:
diff()
 length()
 is.empty()
 unique()
Finally, sometimes I really get frustrated while looking at documentation 😊 Recently I had to switch from R to Python and vice versa in one project with using the same data. I saved it in R and load in Python and repeated it back. I faced an issue with quoting and since I’m not that much familiar with small details in Pandas I opened the documentation.
Well, first of all I was scared by how the parameters are listed 😊 No any highlighting by color or font between the parameters and its values:

Let’s compare with R:

Ok, it’s a joke of course. It’s a matter of what you are used to. I don’t want to say: “Hey, that’s why R is better” 😊
Coming back to more serious things I believe R is a more friendly ecosystem for such data science tasks as exploratory data analysis, hypotheses testing and data visualization. Even it may look weird at a first glance.
As Hadley Wickham said “R is a weird language but it is weird for good reasons, and it's a really good fit for data science. It's not a general purpose programming language, but there are good reasons for a lot of the things it does.”




Комментарии