notes-computer-programming-programmingLanguageDesign-prosAndCons-r

haberman 2 days ago

link

> Just curious -- what patterns bothered you the most with R btw?

It's mostly around the multitude of subtly different types and the ways you convert between them. I think I also remember strange things like lists having named attributes in addition to list members that just seemed totally wrong and confusing to me.

I wish I could give you better specifics but it's been several years since I've done anything with R.

reply

http://www.burns-stat.com/pages/Tutor/R_inferno.pdf

---

http://r.cs.purdue.edu/pub/ecoop12.pdf

" One of the reasons for the success of R is that it caters to the needs of the first group, end users. Many of its features are geared towards speeding up interactive data analysis. The syntax is intended to be concise. Default arguments and partial keyword matches reduce coding effort. The lack of typing lowers the barrier to entry, as users can start working without understanding any of the rules of the language. The calling convention reduces the number of side effects and gives R a functional flavor. But, it is also clear that these very features hamper the development of larger code bases. For robust code, one would like to have less ambiguity and would probably be willing to pay for that by more verbose specifications, perhaps going as far as full-fledged type declarations. So, R is not the ideal language for developing robust packages. Improving R will require increasing encapsulation, providing more static guarantees, while decreasing the number and reach of reflective features. Furthermore, the language specification must be divorced from its implementation and implementation-specific features must be deprecated. The balance between imperative and functional features is fascinating. We agree with the designers of R that a purely functional language whose main job is to manipulate massive numeric arrays is unlikely to be a success. It is simply too useful to be able to perform updates and have a guarantee that they are done in place rather than hope that a smart compiler will be able to optimize them. The current design is a compromise between the functional and the imperative; it allows local side effects, but enforces purity across function boundaries. It is unfortunate that this simple semantics is obscured by exceptions such as the super-assignment operator ( <<- ) which is used as a sneaky way to implement non-local side effects. One of the most glaring shortcomings of R is its lack of concurrency support. Instead, there are only native libraries that provide behind-the-scenes parallel execution. Concurrency is not exposed to R programmers and always requires switching to native code. Adding concurrency would be best done after removing non-local side effects, and requires inventing a suitable concurrent programming model. One intriguing idea would be to push on lazy evaluation, which, as it stands, is too weak to be of much use outside of the base libraries, but could be strengthened to support parallel execution. The object-oriented side of the language feels like an afterthought. The combination of mutable objects without references or cyclic structures is odd and cumbersome. The simplest object system provided by R is mostly used to provide printing methods for different data types. The more powerful object system is struggling to gain acceptance. The current implementation of R is massively inefficient. We believe that this can, in part, be ascribed to the combination of dynamism, lazy evaluation, and copy semantics, 26 Morandat et al. but it also points to major deficiencies in the implementation. Many features come at a cost even if unused. That is the case with promises and most of reflection. Promises could be replaced with special parameter declarations, making lazy evaluation the exception rather than the rule. Reflective features could be restricted to passive introspection which would allow for the dynamism needed for most uses. For the object system, it should be built-in rather than synthesized out of reflective calls. Copy semantics can be really costly and force users to use tricks to get around the copies. A limited form of references would be more efficient and lead to better code. This would allow structures like hash maps or trees to be implemented. Finally, since lazy evaluation is only used for language extensions, macro functions ` a la Lisp, which do not create a context and expand inline, would allow the removal of promises "

" Environments. Explicit environment manipulation hinders compiler optimizations. In our benchmarks these functions are called often. But it turns out that they are most often used to short-circuit the by-value semantics of R. We discovered that 87% of the calls to remove , which deletes a local variable from the current frame, are used as part of an implementation of a hash map. R also allows programs to change the nesting of an environment with parent.env . But 69% of these changes are used by the R VM’s lazy load mechanism, and 30% by the proto library which implements prototypes [ 20 ] and uses environments to avoid copies. "

" Laziness. Lazy evaluation is a distinctive feature of R that has the potential for reducing unnecessary work performed by a computation. Our corpus, however, does not bear this out. Fig. 14(a) shows the rate of promise evaluation across all of our data sets. The average rate is 90%. Fig. 14(b) shows that on average 80% of promises are evaluated in the first function they are passed into. In computationally intensive benchmarks the rate of promise evaluation easily reaches 99%. In our own coding, whenever we encountered higher rates of unevaluated promises, finding where this occurred and refactoring the code to avoid those promises led to performance improvements. Promises have a cost even when not evaluated. Their cost in in memory is the same as a pairlist cell, i.e., 56 bytes on a 64-bit architecture. On average, a program allocates 18GB for them, thus increasing pressure on the garbage collector. The time cost of promises is roughly one allocation, a handful of writes to memory. Moreover, it is a data type which has to be dispatched and tested to know if the content was already evaluated Finally, this extra indirection increases the chance of cache misses. An example of how unevaluated promises arise in R code is the assign function, which is heavily used in Bioconductor with 23 million calls and 46 million unevaluated promises. function (x,val,pos=-1,env=as.environment(pos),immediate=TRUE) .Internal(assign(x,val,env)) This function definition is interesting because of its use of dependent parameters. The body of the function never refers to pos , it is only used to modify the default value of env if that parameter is not specified in the call. Less than 0.2% of calls to assign evaluate the promise associated with pos . It is reasonable to ask if it would be valid to simply evaluate all promises eagerly. The answer is unfortunately no. Promises that are passed into a function which provides a language extension may need special processing. For instance in a loop, promises are intended to be evaluated multiple times in an environment set up with the right variables. Evaluating those eagerly will result in a wrong behavior. However, we have not seen any evidence of promises being used to extend the language outside of the base libraries. We infer this from calls to the substitute and assimilate functions. Another possible reason for not switching the evaluation strategy is that promises perform and observe side effects. x <- y <- 0 fun <- function (a, b) if ( runif (1)>.5) a+b else b+a fun(x<-y+1, y<-x+2)

  1. Result is always a+b, but can be either 4 or 5 This code snippet will yield different results depending on the order the two promises passed to fun are going to be evaluated. Taking into account the various oddities of R, such as lookups that force evaluation of all promises in scope, it is reasonable to wonder if relying on a particular evaluation order is a wise choice for programmers. "

" Observations. R is clearly slow and memory inefficient. Much more so than other dynamic languages. This is largely due to the combination of language features (call-by- value, extreme dynamism, lazy evaluation) and the lack of efficient built-in types. We believe that with some effort it should be possible to improve both time and space usage, but this would likely require a full rewrite of the implementation "

" Dynamic evaluation. R allows code to be dynamically evaluated through the eval function. Unevaluated expressions can be created from text with the quote function, and variable substitution (without evaluation) can be done with substitute and partial substitution with bquote . Further, expressions can be reduced back to input strings with the substitute and deparse functions. The R manual [ 16 ] mentions these functions as useful for dynamic generation of chart labels, but they are used for much more. Extending the language. One of the motivations to use lazy evaluation in R is to extend the language without needing macros. But promises are only evaluated once, so implementing constructs like a while loop, which must repeatedly evaluate its body and guard, takes some work. The substitute function can be used to get the source text of a promise, the expression that was passed into the function, and eval to execute it. Consider this implementation of a while loop in user code, mywhile <- function (cond, body) repeat if (! eval.parent ( substitute (cond))) break else eval.parent ( substitute (body)) Not all language extensions require reflection, lazy evaluation can be sufficient. The implementation of tryCatch is roughly, tryCatch <- function (expr, ...) {

  1. set up handlers specified in ... expr } Explicit Environment Manipulation. Beyond these common dynamic features, R’s reflective capabilities also expose its implementation to the user. Environments exist as a data type. Users can create new environments, and view or modify existing ones, including those used in the current call stack. Closures can have their parameters, body, or environment changed after they are defined. The call stack is also open for examination at any time. There are ways to access any closure or frame on the stack, or even return the entire stack as a list. With this rich set of building blocks, user code can implement whatever scoping rules they like. "