What are the best ways to evaluate research? This question has received renewed interest in both the United Kingdom and the Netherlands. The dynamics in research, the increased availability of data about research, as well as the rise of the web as infrastructure for research leads to regular revisiting this question. The UK funding agency HEFCE has installed a Steering Group to evaluate the evidence on the potential role of performance metrics in the next instalment of the Research Excellence Framework. The British ministry of science suspects that metrics may help to alleviate the pressure of the large-scale assessment exercise on the research community. In the Netherlands, a new Standard Evaluation Protocol has been published in which the number of publications is no longer an independent criterion for the performance of research groups. Like the British, the Dutch are putting much more emphasis on societal relevance than in the previous assessment protocols. However, whereas the British are exploring new ways to make the best use of metrics, the Dutch pressure group Science in Transition is calling for an end to the bibliometric measurement of research performance.
In 2010 (three years before Science in Transition was launched), we started to formulate our new CWTS research programme and we took a different perspective to research evaluation. We defined the key issue not as a problem of too many or too little indicators. Neither do we think that using the wrong indicators (either bibliometric or peer review based) is the fundamental problem, although misinterpretation or misuse of indicators certainly does happen. The most important issue is the emergence of a more fundamental gap between on the one hand the dominant criteria in scientific quality control (in peer review as well as in metrics approaches), and on the other hand the new roles of research in society. This was the main point of departure for the ACUMEN project, which aimed to address this gap (and has recently delivered its report to the European Commission).
This “evaluation gap” results in discrepancies at two levels. First, research has a variety of missions: to produce knowledge for its own sake; to help define and solve economic and social problems; to create the knowledge base for further technological and social innovation; and to give meaning to actual cultural and social developments. These different missions are strongly interrelated and can often be served within one research project. Yet, they do require different forms of communication and articulation work. The work needed to accomplish these missions is certainly not limited to the publication of articles in specialized scientific journals. Yet, it is this type of work that figures most prominently in research evaluations. This has the paradoxical effect that the requirements to be more active in “valorization” and other forms of society-oriented scientific work is piled on top of the requirement to be excellent in publishing high impact articles and books. No wonder a lot of Dutch researchers regularly show signs of burn out (Tijdink, Rijcke, Vinkers, Smulders, & Wouters, 2014; Tijdink, Vergouwen, & Smulders, 2012). Hence, there is a need for diversification of quality criteria and a more refined set of evaluation criteria that take into account the real research mission of the group or institute that is being evaluated (instead of an ideal-typical research mission that is actually not much more than a pipe dream).
Second, research has become a huge enterprise with enormous amounts of research results and an increased complexity of interdisciplinary connections between fields. The current routines in peer review cannot keep up with this vast increase in scale and complexity. Sometimes there is a lack of sufficient numbers of peers to check the quality of the new research. In addition, new forms of peer review of data quality are in increasing demand. A number of experiments with new forms of review to address these issues have been developed in response to these challenges. A common solution in massive review exercises (such as the REF in the UK or the judgement of large EU programmes) is the bucreaucratization of peer review. This effectively turns the substantive orientation of peer expert judgment into a procedure in which the main role of experts is ticking boxes and checking whether the researchers have fulfilled their procedural requirements. Will this in the long run undermine the nature of peer review in science? We do not really know.
A possible way forward would be to re-assess the balance between qualitative and quantitative judgement of quality and impact. The fact that the management of large scientific organizations require lots of empirical evidence and therefore also quantitative indicators, does not mean that these indicators should inevitably be leading. The fact that the increased social significance of scientific and scholarly research means that researchers should be evaluated, does not mean that evaluation should always be a formalized procedure in which the researchers participate only willy-nilly. According to the Danish economist Peter Dahler-Larsen, the key characteristic of “the evaluation society” is that evaluation has become a profession in itself and has become detached from the primary process that it evaluates (Dahler-Larsen, 2012). We are dealing with “evaluation machines”, he argues. The main operation of these machines is to make everything “fit to be evaluated”. Because of this social technology, individual researchers or individual research groups are not able to evade evaluation without jeopardizing their career. At the same time, there is also a good non-Foucauldian reason for evaluation: evaluation is part of the democratic accountability of science.
This may be the key point in re-thinking our research evaluation systems. We must solve a dilemma: on the one hand we need evaluation machines because science has become too important and too complex to do without, and on the other hand evaluation machines tend to start to lead a life of their own and reshape the dynamics of research in potentially harmful ways. Therefore, we need to re-establish the connection between evaluation machines in science and the expert evaluation that is already driving the primary process of knowledge creation. In other words, it is fine that research evaluation has become so professionalized that we have specialized experts in addition to the researchers involved. But this evaluation machine should be organized in such a way that the evaluation process becomes a valuable component of the very process of knowledge creation that it wants to evaluate. This, I think, is the key challenge for both the new Dutch evaluation protocol and the British REF.
Would it be possible to enjoy this type of evaluation?
Dahler-Larsen, P. (2012). The Evaluation Society (p. 280). Stanford University Press. Retrieved from http://www.amazon.com/The-Evaluation-Society-Peter-Dahler-Larsen/dp/080477692X
Tijdink, J. K., Rijcke, S. De, Vinkers, C. H., Smulders, Y. M., & Wouters, P. (2014). Publicatiedrang en citatiestress. Nederlands Tijdschrift Voor Geneeskunde, 158, A7147.
Tijdink, J. K., Vergouwen, A. C. M., & Smulders, Y. M. (2012). De gelukkige wetenschapper. Nederlands Tijdschrift Voor Geneeskunde, 156, 1–5.