Recently I’ve published a post on Medium. That was a post about how friendly an implementation of Read-Evaluate-Print-Loop (REPL) mode in different IDEs for data science. Even I like R I tried to be as much unbiased as possible while writing the post (I even must admit I was really surprised how friendly for statistical analysis a new Spyder 4.0 is).
In my
Medium post I mentioned there are tons of articles about the R vs. Python
fight. Some of them are really comprehensive and useful. But it’s not common to
consider REPL. Of course, IDE is important for efficient working process. But
the features of a language as itself are crucial as well. Today I’m going to
touch few other things that are typically not covered in R vs. Python
comparisons. It might happen today my post is subjected and not that much
unbiased. Let’s see😊
First of
all it’s widely spread that R has “weird” syntax. And in contrast, it’s claimed
that Python syntax is so clear and obvious which helps a lot to learn it. Well,
it’s true that there are specific features which are unexpected for people who
are used to code on general purpose programming languages. In R you can use dot
to create variable name instead of using it to access a class method or field.
Here we can mention also using ‘$’ to get an object’s fields and using
‘<-‘ instead of ‘=’ (even it’s
possible to use ‘=’ as well). Also indexing in R starts from 1, not from zero.
Yes, at the beginning it’s a bit frustrating.
Ok, let’s
look at what Python offers for data science. Is it really so straightforward
and more convenient? I’d say the answer is controversial. Python for general
purposes is really as clear as other high-level languages. But Python for even
very basic data analysis is a combination of Python and at least two packages –
NumPy and Pandas. And each of the components introduces its own details. For
instance, Python lists and dictionaries, NumPy arrays and Pandas dataframes are
slightly different things with its own methods. You have to always keep in mind
what you’re working with.
Meanwhile,
base R has all the data structures already built-in. It simplifies quite a lot
the process of getting started with data analysis in R.
Another thing
is accessing the elements of data frames. In Python/Pandas it’s not so clear in
my opinion. Here below is an example. When the first option works as expected,
the second one doesn’t (frankly speaking, I have no idea what’s wrong):
By the way,
slicing of data frame in Pandas is slightly different compared to base Python
(this is again about the previous point).
In R assessing
the elements in data structures is as much straightforward as possible. It’s always
like
df[i,j]
Where as i and
j you can use indexes, names and conditions with predictable output
df[1, ‘test_col’]
df[1:10, ]
df[1:10, ‘test_col’]
df[‘test_col’>0,
1] etc.
Another
thing in Python which complicates a bit the life of researchers is a combination
of functional-like style with object oriented-like style. Probably this is also
a consequence of a need of using base Python with Pandas, NumPy etc.
Just a few
examples:
- A function for differencing a vector is like a method in OOP: df[‘column’].diff()
- At the same time the standard function for length is in functional-like style: len()
- But to check if data frame is empty you have an object property: df.empty
Even within
one framework you can notice a confusing difference. For instance, in Pandas to
get the unique values for Pandas Series it’s like a method in OOP:
pdseries.unique()
while the
same for data frame is in functional style:
pd.unique(df[‘test_col’])
¯ \ _ (ツ) _ / ¯
In R
everything is done in functional-like style:
diff()
length()
is.empty()
unique()
Finally,
sometimes I really get frustrated while looking at documentation 😊 Recently I had to switch from R to Python and vice
versa in one project with using the same data. I saved it in R and load in
Python and repeated it back. I faced an issue with quoting and since I’m not that
much familiar with small details in Pandas I opened the documentation.
Well, first
of all I was scared by how the parameters are listed 😊 No any highlighting by color or font between the parameters and its
values:
Let’s
compare with R:
Ok, it’s a
joke of course. It’s a matter of what you are used to. I don’t want to say: “Hey,
that’s why R is better” 😊
Coming back
to more serious things I believe R is a more friendly ecosystem for such data
science tasks as exploratory data analysis, hypotheses testing and data
visualization. Even it may look weird at a first glance.
As Hadley
Wickham said “R is a weird language but it is weird for good reasons, and it's
a really good fit for data science. It's not a general purpose programming
language, but there are good reasons for a lot of the things it does.”
Комментарии
Отправить комментарий