A key challenge: the evaluation gap

What are the best ways to evaluate research? This question has received renewed interest in both the United Kingdom and the Netherlands. The dynamics in research, the increased availability of data about research, as well as the rise of the web as infrastructure for research leads to regular revisiting this question. The UK funding agency HEFCE has installed a Steering Group to evaluate the evidence on the potential role of performance metrics in the next instalment of the Research Excellence Framework. The British ministry of science suspects that metrics may help to alleviate the pressure of the large-scale assessment exercise on the research community. In the Netherlands, a new Standard Evaluation Protocol has been published in which the number of publications is no longer an independent criterion for the performance of research groups. Like the British, the Dutch are putting much more emphasis on societal relevance than in the previous assessment protocols. However, whereas the British are exploring new ways to make the best use of metrics, the Dutch pressure group Science in Transition is calling for an end to the bibliometric measurement of research performance.

In 2010 (three years before Science in Transition was launched), we started to formulate our new CWTS research programme and we took a different perspective to research evaluation. We defined the key issue not as a problem of too many or too little indicators. Neither do we think that using the wrong indicators (either bibliometric or peer review based) is the fundamental problem, although misinterpretation or misuse of indicators certainly does happen. The most important issue is the emergence of a more fundamental gap between on the one hand the dominant criteria in scientific quality control (in peer review as well as in metrics approaches), and on the other hand the new roles of research in society. This was the main point of departure for the ACUMEN project, which aimed to address this gap (and has recently delivered its report to the European Commission).

This “evaluation gap” results in discrepancies at two levels. First, research has a variety of missions: to produce knowledge for its own sake; to help define and solve economic and social problems; to create the knowledge base for further technological and social innovation; and to give meaning to actual cultural and social developments. These different missions are strongly interrelated and can often be served within one research project. Yet, they do require different forms of communication and articulation work. The work needed to accomplish these missions is certainly not limited to the publication of articles in specialized scientific journals. Yet, it is this type of work that figures most prominently in research evaluations. This has the paradoxical effect that the requirements to be more active in “valorization” and other forms of society-oriented scientific work is piled on top of the requirement to be excellent in publishing high impact articles and books. No wonder a lot of Dutch researchers regularly show signs of burn out (Tijdink, Rijcke, Vinkers, Smulders, & Wouters, 2014; Tijdink, Vergouwen, & Smulders, 2012). Hence, there is a need for diversification of quality criteria and a more refined set of evaluation criteria that take into account the real research mission of the group or institute that is being evaluated (instead of an ideal-typical research mission that is actually not much more than a pipe dream).

Second, research has become a huge enterprise with enormous amounts of research results and an increased complexity of interdisciplinary connections between fields. The current routines in peer review cannot keep up with this vast increase in scale and complexity. Sometimes there is a lack of sufficient numbers of peers to check the quality of the new research. In addition, new forms of peer review of data quality are in increasing demand. A number of experiments with new forms of review to address these issues have been developed in response to these challenges. A common solution in massive review exercises (such as the REF in the UK or the judgement of large EU programmes) is the bucreaucratization of peer review. This effectively turns the substantive orientation of peer expert judgment into a procedure in which the main role of experts is ticking boxes and checking whether the researchers have fulfilled their procedural requirements. Will this in the long run undermine the nature of peer review in science? We do not really know.

A possible way forward would be to re-assess the balance between qualitative and quantitative judgement of quality and impact. The fact that the management of large scientific organizations require lots of empirical evidence and therefore also quantitative indicators, does not mean that these indicators should inevitably be leading. The fact that the increased social significance of scientific and scholarly research means that researchers should be evaluated, does not mean that evaluation should always be a formalized procedure in which the researchers participate only willy-nilly. According to the Danish economist Peter Dahler-Larsen, the key characteristic of “the evaluation society” is that evaluation has become a profession in itself and has become detached from the primary process that it evaluates (Dahler-Larsen, 2012). We are dealing with “evaluation machines”, he argues. The main operation of these machines is to make everything “fit to be evaluated”. Because of this social technology, individual researchers or individual research groups are not able to evade evaluation without jeopardizing their career. At the same time, there is also a good non-Foucauldian reason for evaluation: evaluation is part of the democratic accountability of science.

This may be the key point in re-thinking our research evaluation systems. We must solve a dilemma: on the one hand we need evaluation machines because science has become too important and too complex to do without, and on the other hand evaluation machines tend to start to lead a life of their own and reshape the dynamics of research in potentially harmful ways. Therefore, we need to re-establish the connection between evaluation machines in science and the expert evaluation that is already driving the primary process of knowledge creation. In other words, it is fine that research evaluation has become so professionalized that we have specialized experts in addition to the researchers involved. But this evaluation machine should be organized in such a way that the evaluation process becomes a valuable component of the very process of knowledge creation that it wants to evaluate. This, I think, is the key challenge for both the new Dutch evaluation protocol and the British REF.

Would it be possible to enjoy this type of evaluation?

References:

Dahler-Larsen, P. (2012). The Evaluation Society (p. 280). Stanford University Press. Retrieved from http://www.amazon.com/The-Evaluation-Society-Peter-Dahler-Larsen/dp/080477692X

Tijdink, J. K., Rijcke, S. De, Vinkers, C. H., Smulders, Y. M., & Wouters, P. (2014). Publicatiedrang en citatiestress. Nederlands Tijdschrift Voor Geneeskunde, 158, A7147.

Tijdink, J. K., Vergouwen, A. C. M., & Smulders, Y. M. (2012). De gelukkige wetenschapper. Nederlands Tijdschrift Voor Geneeskunde, 156, 1–5.

Advertisement

Tales from the field: On the (not so) secret life of performance indicators

* Guest blog post by Alex Rushforth *

In the coming months Sarah De Rijcke and I have been accepted to present at conferences in Valencia and Rotterdam on research from CWTS’s nascent EPIC working group. We very much look forward to drawing on collaborative work from our ongoing ‘Impact of indicators’ project on biomedical research in University Medical Centers (UMC) in the Netherlands. One of our motivations behind the project is that there has been a wealth of social science literature in recent times about the effects of formal evaluation in public sector organisations, including universities. Yet too few studies have taken seriously the presence of indicators in the context of one of the universities core-missions: knowledge creation. Fewer still have looked to take an ethnographic lens to the dynamics of indicators in the day-to-day work context of academic knowledge. These are deficits we hope to begin addressing through these conferences and beyond.

The puzzle we will be addressing here appears – at least at first glance- straightforward enough: what is the role of bibliometric performance indicators in the biomedical knowledge production process? Yet comparing provisional findings from two contrasting case studies of research groups from the same UMC – one a molecular biology group and the other a statistics group – it becomes quickly apparent that there can be no general answer to this question. As such we aim to provide not only an inventory of different ‘roles’ of indicators in these two cases, but also to pose the more interesting analytical question of what conditions and mechanisms explain the observed variations in the roles indicators come to perform?

Owing to their persistent recurrence in the data so far, the indicators we will analyze are journal impact factor, H-index, and ‘advanced’ citation-based bibliometric indicators. It should be stressed that our focus on these particular indicators have have emerged inductively from observing first-hand the metrics that research groups attended to in their knowledge-making activities. So what have we found so far?

Dutch UMCs constitute particularly apt sites through which to explore this problem given how bibliometric assessments have been central to the formal evaluations carried-out since their inception in the early-2000s. On one level it is argued that researchers in both cases encounter such metrics as ‘governance/managerial devices’, that is, as forms of information required of them by external agencies on whom they are reliant for resources and legitimacy. Such examples can be seen when funding applications, annual performance appraisals, or job descriptions demand such information of an individual’s or group’s past performance. As the findings will show, the information needed by the two groups to produce their work effectively and the types of demands made on them by ‘external’ agencies varies considerably, despite their common location in the same UMC. This is one important reason why the role of indicators differs between cases.

However, this coercive ‘power over’ account is but one dimension of a satisfying answer to our role of indicators question. Emerging analysis reveals also the surprising discovery that in fields characterized by particularly integrated forms of coordination and standardization (Whitley, 2000)– like our molecular biologists – indicators in fact have the propensity to function as a core feature of the knowledge making process. For instance, a performance indicator like the journal impact factor was routinely mobilized informally in researchers’ decision-making as an ad hoc standard against which to evaluate the likely uses of information and resources, and in deciding whether time and resources should be spent pursuing them. By contrast in the less centralized and integrated field statistical research such an indicator was not so indispensable to routines of knowledge making activities. In the case of the statisticians it is possible to speculate that indicators are more likely to emerge intermittently as conditions to be met for gaining social and cultural acceptance by external agencies, but are less likely to inform day-to-day decisions. Through our ongoing analysis we aim to unpack further how disciplinary practices interact with organisation of Dutch UMCs to produce quite varying engagements with indicators.

The extent to which indicators play central/peripheral roles in research production processes across academic contexts is an important sociological problem to be posed in order to enhance understanding of the complex role of performance indicators in academic life. We feel much of the existing literature on evaluation of public organisations has tended to paint an exaggerated picture of formal evaluation and research metrics as synonymous with empty ritual and legitimacy (e.g. Dahler-Larsen, 2012). Emerging results here show that – at least in the realm of knowledge production- the picture is more subtle. This theoretical insight will prompt us to suggest further empirical studies are needed of scholarly fields with different patterns of work organisation in order to compare our results and develop middle-range theorizing on the mechanisms through which metrics infiltrate knowledge production processes to fundamental or peripheral degrees. In future this could mean venturing into fields far outside of biomedicine, such as history, literature, or sociology. For now though we look forward to expanding the biomedical project, by conducting analogous case studies from a second UMC.

Indeed it is through such theoretical developments that we can consider not only the appropriateness of one-size-fits-all models of performance evaluation, but also unpack and problematize discourses about what constitutes ‘misuse’ of metrics. And indeed how convinced should we be that academic life is now saturated and dominated by deleterious metric indicators? 

References

DAHLER-LARSEN, P. 2012. The evaluation society, Stanford, California, Stanford Business Books, an imprint of Stanford University Press.

 WHITLEY, R. 2000. The intellectual and social organization of the sciences, Oxford England ; New York, Oxford University Press.

Bibliometrics of individual researchers

The demand for measures of individual performance in the management of universities and research institutes has been growing, in particular since the early 2000s. The publication of the Hirsch Index in 2005 (Hirsch, 2005) and its popularisation by the journal Nature (Ball, 2005) has given this a strong stimulus. According to Hirsch, his index seemed the perfect indicator to assess the scientific performance of an individual author because “it is transparent, unbiased and very hard to rig”. The h-index balances productivity with citation impact. An author with a h-index of 14 has created 14 publications that each have been cited at least 14 times. So neither authors with a long list of mediocre publications, nor an author with 1 wonder hit are rewarded by this indicator. Nevertheless, the h-index turned out to have too many disadvantages to be wearing the crown of “the perfect indicator”. As Hirsch acknowledged himself, it cannot be used for cross-disciplinary comparison. A field in which many citations are exchanged among authors will produce a much higher average Hirsch index than a field with much less citations and references per publication. Moreover, the older one gets, the higher ones h-index will be. And, as my colleagues have shown, the index is mathematically inconsistent, which means that rankings based on the h-index may be influenced in rather counter-intuitive ways (Waltman & Eck, 2012). At CWTS, we therefore prefer the use of an indicator like the number (or percentage) of highly cited papers instead of the h-index (Bornmann, 2013).

Still, none of the bibliometric indicators can claim to be the perfect indicator to assess the performance of the individual researcher. This raises the question of how bibliometricians and science managers should use statistical information and bibliometric indicators. Should they be avoided and should the judgment of candidates for a prize or a membership of a prestigious academic association only be informed by peer review? Or can numbers play a useful role? What guidance should the bibliometric community then give to users of their information?

This was the key topic at a special plenary at the 14th ISSI Conference two weeks ago in Vienna. The plenary was an initiative taken by Jochen Gläser (Technical University Berlin), Ismael Rafols (SPRU, University of Sussex, and Ingenio, Polytechnical University Valencia), Wolfgang Glänzel (Leuven University) and myself. The plenary aimed to give a new stimulus to the debate how to apply, and how not to apply, performance indicators of individual scientists and scholars. Although not a new debate – the pioneers of bibliometrics already paid attention to this problem – it has become more urgent because of the almost insatiable demand for objective data and indicators in the management of universities and research institutes. For example, many biomedical researchers mention the value of their h-index on their CV. In publications lists, one can regularly see the value of the Journal Impact Factor mentioned after the journal’s name. In some countries, for example Turkey and China, one’s salary can be determined by the value of either the h-index or the journal’s impact factor one has published in. The Royal Netherlands Academy of Arts and Sciences also seems to ask for this kind of statistics in its forms for new members in the medical and natural sciences. Although robust systematical evidence is still lacking (we are working hard on this), the use of performance indicators in the judgment of individual researchers for appointments, funding, and memberships, seems widespread, intransparent and unregulated.

This situation is clearly not desirable. If researchers are being evaluated, they should be aware of the criteria used and these criteria should be justified for the purpose at hand. This requires that users of performance indicators should have clear guidelines. It seems rather obvious that the bibliometric community has an important responsibility to inform and provide such guidelines. However, at the moment, there is no consensus yet about such guidelines. Individual bibliometric centres do indeed inform their clients about the use and limitations of their indicators. Moreover, all bibliometric centres have the habit of publishing their work in the scientific literature, often including technical details of their indicators. However, this published work is not easily accessible to non-expert users such as deans of faculties and research directors. The literature is too technical and distributed over too many journals and books. It needs synthesizing and translation into plain language which is easily understandable.

To initiate a process of a more professional guidance for the application of bibliometric indicators in the evaluation of individual researchers, we asked the organizers of the ISSI conference to devote a plenary to this problem, which they kindly agreed to. At the plenary, Wolfgang Glänzel and me presented “The dos and don’ts in individual level bibliometrics”. We do not think this is a final list, more a good start with ten dos and don’ts. Some examples: “do not reduce individual performance to a single number”, “do not rank scientists according to 1 indicator”, “always combine quantitative and qualitative methods”, “combine bibliometrics with career analysis”. To prevent misunderstandings: we do not want to initiate a bibliometric police with absolute rules. The context of the evaluation should always determine which indicators and methods to use. Therefore, some don’ts in our list may sometimes be perfectly useable, such as the application of bibliometric indicators to make a first selection among a large number of candidates.

Our presentation was commented on by Henk Moed (Elsevier) with a presentation on “Author Level Bibliometrics” and by Gunnar Sivertsen (NIFU, Oslo University) with comments on the basis of his extensive experiences in research evaluation. Henk Moed built on the concept of the multi-dimensional research matrix which was published by the European Expert Group on the Assessment of University Based Research in 2010, of which he was a member (Assessing Europe’s University-Based Research – Expert Group on Assessment of University-Based Research, 2010). This matrix aims to give global guidance to the use of indicators at various levels of the university organization. However, it does not focus on the problem of how to evaluate individual researchers. Still, the matrix is surely a valuable contribution to the development of more professional standards in the application of performance indicators. Gunnar Sivertsen made clear that the discussion should not be restricted to the bibliometric community itself. On the contrary, the main audience of guidelines should be the researchers themselves and adminstrators in universities and funding agencies.

The ensuing debate led to a large number of suggestions. They will be included in the full report of the meeting which will be published in the upcoming issue of the ISSI’s professional newsletter in September 2013. A key point was perhaps the issue of responsibility: it is clear that researchers themselves and the evaluating bodies should carry the main responsibility for the use of performance indicators. However, they should be able to rely on clear guidance from the technical experts. How must this balance be struck? Should bibliometricians refuse to deliver indicators when they think their application would be unjustified? Should the association of scientometricians publicly comment on misapplications? Or should this be left to the judgment of the universities themselves? The plenary did not solve these issues yet. However, a consensus is emerging that more guidance by bibliometricians is required and that researchers should have a clear address to which they can turn to with questions about the application of performance indicators either by themselves or by their evaluators.

What next? The four initiators of this debate in Vienna have also organized a thematic session on individual level bibliometrics at the next conference on science & tecnnology indicators, the STI Conference “Translational twists and turns: science as a socio-economic endeavour”, which will take place in Berlin, 4-6 September 2013. There, we will take the next step in specifying guidelines. In parallel, this conference will also host a plenary session on the topic of bibliometric standards in general, organized by iFQ, CWTS and Science-Metrix. In 2014, we will then organize a discussion with the key stakeholders such as faculty deans, adminstrators, and of course the research communities themselves on the best guidelines for evaluating individual researchers.

Stay tuned.

Bibliography:

Assessing Europe’s University-Based Research – Expert Group on Assessment of University-Based Research. (2010). Research Policy. European Commission. doi:10.2777/80193

Ball, P. (2005). Index aims for fair ranking of scientists. Nature, 436(7053), 900. Retrieved from http://dx.doi.org/10.1038/436900a

Bornmann, L. (2013). A better alternative to the h index. Journal of Informetrics, 7(1), 100. doi:10.1016/j.joi.2012.09.004

Hirsch, J. E. (2005). An index to quantify an individual’s scientific research output. Proceedings of the National Academy of Sciences of the United States of America, 102(46), 16569–72. doi:10.1073/pnas.0507655102

Waltman, L., & Eck, N. J. Van. (2012). The Inconsistency of the h-index. Journal of the American Society for Information Science and Technology, 63(2007), 406–415. doi:10.1002/asi

%d bloggers like this: