Ethics and misconduct – Review of a play organized by the Young Academy (KNAW)

This is a guest blog post by Joost Kosten. Joost is PhD student at CWTS and member of the EPIC working group. His research focuses on the use of research indicators from the perspective of public policy. Joost obtained an MSc in Public Administration (Leiden University) and was also trained in Political Science (Stockholm University) and Law (VU University Amsterdam).

Scientific (mis)conduct – The sins, the drama, the identification

On Tuesday November 18th 2014 the Young Academy of the Royal Netherlands Academy of Sciences organized a performance of the play Gewetenschap by Tony Maples at Leiden University. These weeks, Pandemonia Science Theater is on tour in the Netherlands to perform this piece at several universities. Gewetenschap was inspired by occasional troubles with respect to ethics and integrity which recently occurred in Dutch science and scholarship. Although these troubles concerned grave violations of the scientific code of conduct (i.e., the cardinal sins of fraud, fabrication, and plagiarism) the play focusses on common dilemma’s in a researcher’s everyday life. The title Gewetenschap is a non-existent word, which combines the Dutch words geweten (conscience) and wetenschap (science).

The playwright used confidential interviews with members of the Young Academy to gain insight into the most frequently occurring ethical dilemma’s researchers have to deal with. Professor Karin de Zwaan is a research group leader who has hardly any time to do research herself. She puts much effort in organizing grants, attracting new students and organizing her research group. Post-doc Jeroen Dreef is a very active researcher who does not have enough time to take organizational responsibilities serious. A tenure track is all he wants. Given their other important activities, Karin and Jeroen hardly have any time to supervise PhD student Lotte. One could question the type of support they do give her.

At times, given the reaction on scenes of the drama piece, the topics presented were clearly recognized by the audience. Afterwards, the dilemma’s touched upon during the play are presented by prof. Bas Haring. The audience discusses the following topics:

  • Is there a conflict between the research topics a researcher likes himself and what the research group expects her/him to do?
  • In one of the scenes, the researchers were delighted because of the acceptance of a publication. Haring asks if that exhibits “natural behaviour”. Shouldn’t a researcher be happy with good results instead of a publication being accepted? One of the participants replies that a publication functions as a reward.
  • What do you do with your data? Is endless application of a diversity analysis methods until you find nice results a responsible approach?
  • What about impact factors (IF)? Bas Haring himself says his IF is 0. “Do you think I am an idiot?” Which role do numbers such as the IF play in your opinion about colleagues? There seems to be quite a diversity of opinions. An early career research says everone knows these numbers are nonsense. An experienced scientist points out that there is a correlation between scores and quality. Someone else expresses his optimism since he expects that this focus on numbers will be over with ten years. This causes another to respond that in the past there was competition too, but in a different way.
  • When is someone a co-author? This question results in a lively debate. Apparently, there are considerable differences from field to field. In the medical fields, a co-authorship can be a way to express gratitude to authors who have played a vital role in a research project, such as people who could organize experimental subjects. In this way, a co-authorship becomes a tradeable commodity. A medicine professor points out that in his field, co-authorships can be used to compare a curriculum vitae with the development of status as a researcher. Thus, it can be used as a criterion to judge grant proposals. A good researcher should start with first position co-authorships, later on should have co-authorships somewhere in between the first and last author, and should end his career with papers in which has co-authorships in the last position. Thus, the further the career has been developed, the more the name of the other should be in the final part of the author list. Another participant states that one can deal with co-authorships in three different ways: 1. Co-authors should always have full responsibility for everything in the paper. 2. Similar to openness which is given at the end of a movie, co-authors should clarify what each co-author’s contribution was. 3. Only those who really contributed in writing a paper can be a co-author. The participant admits that this last proposal works in his own field but might not work in other fields.
  • Can a researcher exaggerate his findings if he presents them to journalists? Should you keep control over a journalist’s work in order to avoid that he will present things differently? Is it allowed to present untruth information in order to help support your case, just to avoid that a proper scientific argumentation will be too complex for the man in the street?
  • Is it allowed to to present your work as having more societal relevance than you really expect? One of the reactions is that researchers are forced to express the societal relevance of their work when they apply for a grant. From the very nature of scientific research it is hardly possible to clearly indicate what society will gain from the results.
  • What does a good relationship between a PhD-student and a supervisor look like? What is a good balance between serving the interests of PhD students, serving organizational interests (e.g. the future of the organization by attracting new students and grants), and the own interest of the researcher?

The discussion did not concentrate on the following dilemma´s presented in Gewetenschap:

  • To what extent are requirements for grant proposals contradictory? On the one hand, researchers are expected to think ‘out-of-the-box’ while on the other hand they should meet a large amount of requirements. Moreover, should one propose new ideas including the risks which come along, or is it better to walk on the beaten path in order to guarantee successes?
  • Should colleagues who did not show respect be served with the same sauce if you have a chance to review their work? Should you always judge scientific work on its merits? Are there any principles of ‘due process’ which should guide peer review?
  • Whose are the data if someone contributed to them but moves to another research group or institute?

 

Advertisements

Quality in the age of the impact factor

ISIS, the most prestigious journal in the history of science, moved house last September and its central office is now located at the Descartes Centre for the History and Philosophy of the Sciences and Humanities at Utrecht University. The Dutch science historian H. Floris Cohen took up the position of the editor in chief of the journal. No doubt this underlines the international reputation of the community of historians of science in the Netherlands. Being the editor of the central journal in ones field surely is mark of esteem and quality.

The opening of the editorial office in Utrecht was celebrated with a symposium entitled “Quality in the age of the impact factor”. Since quality of research in history is intimately intertwined with the quality of writing, it seemed particularly apt to call attention to the role of impact factors in humanities fields. I used the occasion to pose the question how we actually define scientific and scholarly quality. How do we recognize quality in our daily practices? And how can this variety of practices be understood theoretically? Which approaches in the field of science and technology studies are most relevant?

In the same month, Pleun van Arensbergen graduated on a very interesting PhD dissertation which dealt with some of the issues, “Talent Proof. Selection Processes in Research Funding and Careers”. Van Arensbergen did her thesis work at the Rathenau Institute in The Hague. The quality of research is increasingly seen as mainly the result of the quality of the people involved. Hence, universities “have openly made it one of their main goals to attract scientific talent” (van Arensbergen, 2014, p. 121). A specific characteristics of this “war for talent” in the academic world is that there is an oversupply of talents and a relative lack of career opportunities, leading to a “war between talents”. The dissertation is a thorough analysis of success factors in academic careers. It is an empirical analysis of how the Dutch science foundation NWO selects early career talent in its Innovational Research Incentives Scheme. The study surveyed researchers about their definitions of quality and talent. It combines this with an analysis of both the outcome and the process of this talent selection. Van Arensbergen paid specific attention to the gender distribution and to the difference between successful and unsuccessful applicants.

Her results point to a discrepancy between the common notion among researchers that talent is immediately recognizable (“you know it when you see it”) and the fact that there are very small differences between candidates that get funded and those that do not. The top and the bottom of the distribution of quality among proposals and candidates are relatively easy to detect. But the group of “good” and “very good” proposals is still too large to be funded. Van Arensbergen and her colleagues did not find a “natural threshold” above which the successful talents can be placed. On the contrary, in one of her chapters they find that researchers who leave the academic system due to lack of career possibilities regularly score higher on a number of quality indicators than those who are able to continue a research career. “This study does not confirm that the university system always preserves the highly productive researchers, as leavers were even found to outperform the stayers in the final career phase (van Arensbergen, 2014, p. 125).

Based on the survey, her case studies and her interviews, Van Arensbergen also concludes that productivity and publication records have become rather important for academic careers. “Quality nowadays seems to a large extent to be defined as productivity. Universities seem to have internalized the performance culture and rhetoric to such an extent that academics even define and regulate themselves in terms of dominant performance indicators like numbers of publications, citations or the H-index. (…) Publishing seems to have become the goal of academic labour.” (van Arensbergen, 2014, p. 125). This does not mean, however, that these indicators determine the success of a career. The study questions “the overpowering significance assigned to these performance measures in the debate, as they were not found to be entirely decisive.” (van Arensbergen, 2014, p. 126) An extensive publication record is a condition but not a guarantee for success.

This relates to another finding: the group process of panel discussions are also very important. With a variety of examples, Van Arensbergen shows how the organization of the selection process shapes the outcome. The face to face interview of the candidate with the panel is for example crucial for the final decision. In addition, the influence of the external peer reports was found to be modest.

A third finding in the talent dissertation is that success in obtaining grants feeds back into ones scientific and scholarly career. This creates a self reinforcing mechanism, which the science historian Robert Merton coined the Matthew effect after the quote from the bible: “For unto every one that hath shall be given, and he shall have abundance: but from him that hath not shall be taken even that which he hath.” (Merton, 1968). Van Arensbergen concludes that this means that differences between scholars may initially be small but will increase in the course of time as a result of funding decisions. “Panel decisions convert minor differences in quality into enlarged differences in recognition.”

Combining these three findings leads to some interesting conclusions regarding how we actually define and shape quality in academia. Although panel decisions about who to fund are strongly shaped by the organization of the selection process as well as by a host of other contextual factors (including chance), and although all researchers are aware of the uncertainties in these decisions, this does not mean that these decisions are given less weight. On the contrary, obtaining external grants has become a cornerstone for successful academic careers. Universities even devote considerable resources to make their researchers abler to acquire prestigious grants as well as external funding in general. Although this is clearly instrumental for the organization, Van Arensbergen thinks that grants have become part of the symbolic capital of a researcher and research group and she refers to Pierre Bourdieu’s theory of symbolic capital to better understand the implications.

This brings me to my short lecture at the opening of the editorial office of ISIS in Utrecht. Although the experts on bibliometric indicators don’t generally see the Journal Impact Factor as an indicator of quality, socially it seems to partly function like it. But indicators are not alone in shaping how we in practice identify, and thereby define, talent and quality. They flow together with the way quality assurance and measurement processes are organized, the social psychology of panel discussions, the extent to which researchers are visible in their networks, etc. In these complex contextual interactions, indicators do not determine but they are ascribed meaning dependent on the situation in which the researchers find themselves. A good way to think about this, in my view, is developed in the field of material semiotics. This approach which has its roots in the French actor network theory of Bruno Latour and Michel Callon, does not accept a fundamental rupture in reality between the material and the symbolic. Reality as such is the result of complex and interacting translation processes. This is an excellent philosophical basis to understand how scientific and scholarly quality emerge. I see quality not as an attribute of an academic persona or of a particular piece of work, but as the result of the interaction between a researcher (or a manuscript) and the already existing scientific or scholarly infrastructure (eg. the body of published studies). If this interaction creates a productive friction (meaning that there is enough novelty in the contribution but not so much that it is incompatible with the already existing body of work), we see the work or scholar as of high quality. In other words, quality does simply not (yet) exist outside of the systems of quality measurement. The implication of this is that quality itself is a historical category. It is not an invariant but a culturally and historically specific concept that changes and morphes over time. In fact, the history of science is the history of quality. I hope historians of science will take up the challenge to map this history in more empirical and theoretical sophistication than has been done so far.

Literature:

Merton, R. K. (1968). The Matthew Effect in Science. Science, 159, 56–62.

Van Arensbergen, P. (2014). Talent proof : selection processes in research funding and careers. The Hague, Netherlands: Rathenau Institute. Retrieved from http://www.worldcat.org/title/talent-proof-selection-processes-in-research-funding-and-careers/oclc/890766139&referer=brief_results

 

Developing guiding principles and standards in the field of evaluation – lessons learned

This is a guest blog post by professor Peter Dahler-Larsen. The reflections below are a follow-up of his keynote at the STI conference in Leiden (3-5 September 2014) and the special session at STI on the development of quality standards for science & technology indicators. Dahler-Larsen holds a chair at the Department of Political Science, University of Copenhagen. He is former president of the European Evaluation Society and author of The Evaluation Society (Stanford University Press, 2012).

Lessons learned about the development of guiding principles and standards in the field of evaluation – A personal reflection

Professor Peter Dahler-Larsen, 5 October 2014

Guidelines are symbolic, not regulatory

The limited institutional status of guiding principles and standards should be understood as a starting point for the debate. In the initial phases of development of such standards and guidelines, people often have very strong views. But only the state can enforce laws. To the extent that guidelines and standards merely express some official views of a professional association who has no institutional power to enforce them, standards and guidelines will have limited direct consequences for practitioners. The discussion becomes clearer once it is recognized that standards and guidelines thus primarily have a symbolic and communicative function, not a regulatory one. Practitioners will continue to be free to do whatever kind of practice they like, also after guidelines have been adopted.

Design a process of debate and involvement

All members of a professional association should have a possibility to comment on a draft version of guidelines/standards. An important component in the adoption of guidelines/standards is the design of a proper organizational process that involves the composition of a draft by a select group of recognized experts, an open debate among members, and an official procedure for the adoption of standards/guidelines as organizational policy.

Acknowledge the difference between minimum and maximum standards

Minimal standards must be complied with in all situations. Maximum standards are ideal principles worth striving for, although they will not be accomplished in any particular situation. It often turns out that there will be many maximum principles in a set of guidelines, although that is not what most people believe is “standards.” For that reason I personally prefer the term guidelines or guiding principles rather that “standards.”

Think carefully about guidelines and methodological pluralism

Advocates of a particular method often think that methodological rules connected to their own method defines quality as such in the whole field. For that reason, they are likely to insert their own methodological rules into the set of guidelines. As a consequence, guidelines can be used politically to promote one set of methods or one particular paradigm rather than another. Great care should be exercised in the formulation of guidelines to make sure that pluralism remains protected. For example, in evaluation the rule is that if you subscribe to a particular method, you should have high competence in the chosen method. But that goes for all methods.

Get beyond the “but that´s obvious” argument

Some argue that it is futile to formulate a set of guidelines because at that level of generality, it is only possible to state some very broad and obvious principles with which every sensible person must agree. The argument sounds plausible when you hear it, but my experience suggests otherwise for a number of reasons. First, some people have just not thought about a very bad practice (for example, doing evaluation without written Terms of Reference). Once you see, that someone has formulated a guideline against this, you are likely to start paying attention to the problem. Just because a principle is obvious to some, does not mean that it is obvious to all. Second, although there may be general agreement about a principle (such as “do no unnecessary harm” or “take general social welfare into account”), there can be strong disagreement about the interpretations and implications of the principle in practice.  Third, a good set of guiding principles will often comprise at least two principles that are somewhat in tension with each other, for example the principle of being quick and useful versus the principle of being scientifically rigorous. To sort out exactly which kind of tension between these two principles one can live with in a concrete case turns out to be a matter of complicated professional judgment. So, get beyond the “that´s obvious” argument.

Recognize the fruitful uses of guidelines

Among the most important uses of guidelines in evaluation are:

– In application situations, good evaluators can explain their practice with reference to broader principles

– In conferences, guidelines can stimulate insightful professional discussions about how to handle complicated cases

– Books and journals can make use of guidelines as inspiration for the development of an ethical awareness among practitioners. For example, google Michael Morris´ work in the field of evaluation.

– There is great use of guidelines in teaching and in other forms of socialization of evaluators.

Respect the multiplicity of organizations

If, say, the European Evaluation Society wants to adopt a set of guidelines, it should be respected that, say, the German and the Swiss association already have their own guidelines. Furthermore, some professional associations (say, psychologists) also have guidelines. A professional association should take such overlaps seriously and find ways to exchange views and experiences with guidelines across national and organizational borders.

Professionals are not alone, but relations can be described in guidelines, too

It is often debated that one of the major problems in bad evaluation practice is the behavior of commissioners. Some therefore think that guidelines describing good evaluation practice are in vain until the behavior of commissioners (and perhaps other users of evaluation) are included in the guidelines, too. However, there is no particular reason why the guidelines cannot describe a good relation and a good interaction between commissioners and evaluators. Remember, guidelines have no regulatory power. They express merely the official norms of the professional association. Evaluators are allowed to express what they think a good commissioner should do or not do. In fact, explicit guidelines can help clarify mutual and reciprocal role expectations.

Allow for regular reflection, evaluation and revision of guidelines

At regular intervals, guidelines should be debated, evaluated and revised. The AEA guidelines, for example, have been revised and now reflect values regarding culturally competent evaluation that was not in earlier versions. Guidelines are organic and reflect a particular socio-historical situation.

Sources:

Michael Morris (2008). Evaluation Ethics for Best Practice. Guilford Press.

American Evaluation Association Guiding principles

A key challenge: the evaluation gap

What are the best ways to evaluate research? This question has received renewed interest in both the United Kingdom and the Netherlands. The dynamics in research, the increased availability of data about research, as well as the rise of the web as infrastructure for research leads to regular revisiting this question. The UK funding agency HEFCE has installed a Steering Group to evaluate the evidence on the potential role of performance metrics in the next instalment of the Research Excellence Framework. The British ministry of science suspects that metrics may help to alleviate the pressure of the large-scale assessment exercise on the research community. In the Netherlands, a new Standard Evaluation Protocol has been published in which the number of publications is no longer an independent criterion for the performance of research groups. Like the British, the Dutch are putting much more emphasis on societal relevance than in the previous assessment protocols. However, whereas the British are exploring new ways to make the best use of metrics, the Dutch pressure group Science in Transition is calling for an end to the bibliometric measurement of research performance.

In 2010 (three years before Science in Transition was launched), we started to formulate our new CWTS research programme and we took a different perspective to research evaluation. We defined the key issue not as a problem of too many or too little indicators. Neither do we think that using the wrong indicators (either bibliometric or peer review based) is the fundamental problem, although misinterpretation or misuse of indicators certainly does happen. The most important issue is the emergence of a more fundamental gap between on the one hand the dominant criteria in scientific quality control (in peer review as well as in metrics approaches), and on the other hand the new roles of research in society. This was the main point of departure for the ACUMEN project, which aimed to address this gap (and has recently delivered its report to the European Commission).

This “evaluation gap” results in discrepancies at two levels. First, research has a variety of missions: to produce knowledge for its own sake; to help define and solve economic and social problems; to create the knowledge base for further technological and social innovation; and to give meaning to actual cultural and social developments. These different missions are strongly interrelated and can often be served within one research project. Yet, they do require different forms of communication and articulation work. The work needed to accomplish these missions is certainly not limited to the publication of articles in specialized scientific journals. Yet, it is this type of work that figures most prominently in research evaluations. This has the paradoxical effect that the requirements to be more active in “valorization” and other forms of society-oriented scientific work is piled on top of the requirement to be excellent in publishing high impact articles and books. No wonder a lot of Dutch researchers regularly show signs of burn out (Tijdink, Rijcke, Vinkers, Smulders, & Wouters, 2014; Tijdink, Vergouwen, & Smulders, 2012). Hence, there is a need for diversification of quality criteria and a more refined set of evaluation criteria that take into account the real research mission of the group or institute that is being evaluated (instead of an ideal-typical research mission that is actually not much more than a pipe dream).

Second, research has become a huge enterprise with enormous amounts of research results and an increased complexity of interdisciplinary connections between fields. The current routines in peer review cannot keep up with this vast increase in scale and complexity. Sometimes there is a lack of sufficient numbers of peers to check the quality of the new research. In addition, new forms of peer review of data quality are in increasing demand. A number of experiments with new forms of review to address these issues have been developed in response to these challenges. A common solution in massive review exercises (such as the REF in the UK or the judgement of large EU programmes) is the bucreaucratization of peer review. This effectively turns the substantive orientation of peer expert judgment into a procedure in which the main role of experts is ticking boxes and checking whether the researchers have fulfilled their procedural requirements. Will this in the long run undermine the nature of peer review in science? We do not really know.

A possible way forward would be to re-assess the balance between qualitative and quantitative judgement of quality and impact. The fact that the management of large scientific organizations require lots of empirical evidence and therefore also quantitative indicators, does not mean that these indicators should inevitably be leading. The fact that the increased social significance of scientific and scholarly research means that researchers should be evaluated, does not mean that evaluation should always be a formalized procedure in which the researchers participate only willy-nilly. According to the Danish economist Peter Dahler-Larsen, the key characteristic of “the evaluation society” is that evaluation has become a profession in itself and has become detached from the primary process that it evaluates (Dahler-Larsen, 2012). We are dealing with “evaluation machines”, he argues. The main operation of these machines is to make everything “fit to be evaluated”. Because of this social technology, individual researchers or individual research groups are not able to evade evaluation without jeopardizing their career. At the same time, there is also a good non-Foucauldian reason for evaluation: evaluation is part of the democratic accountability of science.

This may be the key point in re-thinking our research evaluation systems. We must solve a dilemma: on the one hand we need evaluation machines because science has become too important and too complex to do without, and on the other hand evaluation machines tend to start to lead a life of their own and reshape the dynamics of research in potentially harmful ways. Therefore, we need to re-establish the connection between evaluation machines in science and the expert evaluation that is already driving the primary process of knowledge creation. In other words, it is fine that research evaluation has become so professionalized that we have specialized experts in addition to the researchers involved. But this evaluation machine should be organized in such a way that the evaluation process becomes a valuable component of the very process of knowledge creation that it wants to evaluate. This, I think, is the key challenge for both the new Dutch evaluation protocol and the British REF.

Would it be possible to enjoy this type of evaluation?

References:

Dahler-Larsen, P. (2012). The Evaluation Society (p. 280). Stanford University Press. Retrieved from http://www.amazon.com/The-Evaluation-Society-Peter-Dahler-Larsen/dp/080477692X

Tijdink, J. K., Rijcke, S. De, Vinkers, C. H., Smulders, Y. M., & Wouters, P. (2014). Publicatiedrang en citatiestress. Nederlands Tijdschrift Voor Geneeskunde, 158, A7147.

Tijdink, J. K., Vergouwen, A. C. M., & Smulders, Y. M. (2012). De gelukkige wetenschapper. Nederlands Tijdschrift Voor Geneeskunde, 156, 1–5.

The new Dutch research evaluation protocol

From 2015 onwards, the societal impact of research will be a more prominent measure of success in the evaluation of research in the Netherlands. Less emphasis will be put on the number of publications, while the vigilance about research integrity will be increased. These are the main elements of the new Dutch Standard Evaluation Protocol which was published a few weeks ago.

The new protocol aims to guarantee, improve, and make visible the quality and relevance of scientific research at Dutch universities and institutes. Three aspects are central: scientific quality; societal relevance; and feasibility of the research strategy of the research groups involved. As is already the case in the current protocol, research assessments are organized by institution, and the institutional board is responsible. Nationwide comparative evaluations by discipline are possible, but the institutions involved have to agree explicitly to organize their assessments in a coordinated way to realize this. In contrast to performance based funding systems, the Dutch system does not have a tight coupling between assessment outcomes and funding for research.

This does not mean, inter alia, that research assessments in the Netherlands do not have consequences. On the contrary, these may be quite severe but they will usually be implemented by the university management with considerable leeway for interpretation of the assessment results. The main channel through which Dutch research assessments has implications is via the reputation gained or lost for the research leaders involved. The effectiveness of the assessments is often decided by the way the international committee works which performs the evaluation. If they see it as their main mission to celebrate their nice Dutch colleagues (as has happened in the recent past), the results will be complimentary but not necessarily very informative. On the other hand, they may also punish groups by using criteria that are actually not valid for those specific groups although they may be standard for the discipline as a whole (and this has also happened, for example when book-oriented groups work in a journal-oriented discipline).

The protocol does not include a uniform set of requirements or indicators. The specific mission of the research institutes or university departments under assessment is leading. As a result, research that is mainly aimed at having practical impact may be evaluated with different criteria from a group that aims to work on the international frontier of basic research. The protocol is not unified around substance but around procedure. Each group has to be evaluated every six years. A new element in the protocol is also that the scale for assessment has been changed from a five-point to a four-point scale, ranging from “unsatisfactory”, via “good” and “very good” to “excellent”. This scale will be applied to all three dimensions: scientific quality, societal relevance, and feasibility.

The considerable freedom that the peer committees have in evaluating Dutch research has been maintained in the new protocol. Therefore, it remains to be seen what the effects will be of the novel elements in the protocol. In assessing the societal relevance of research, the Dutch are following their British peers. Research groups will have to construct “narratives” which explain the impact their research has had on society, understood broadly. It is not yet clear how these narratives will be judged according to the scale. The criteria for feasibility are even less clear: according to the protocol a group has an “excellent” feasibility if it is “excellently equipped for the future”. Well, we’ll see how this works out.

With less emphasis on the amount of publications in the new protocol, the Dutch universities, the funding agency NWO and the academy of science KNAW (who collectively are reponsible for the protocol) have also responded to the increased anxiety about “perverse effects” in the research system triggered by the ‘Science in Transition’ group and to recent cases of scientific fraud. The Dutch minister of education, culture and the sciences Jet Bussemaker welcomed this change. “Productivity and speed should not be leading considerations for researchers”, she said at the reception of the new protocol. I fully agree with this statement, yet this aspect of the protocol will also have to stand the test of practice. In many ways, the number of publications is still a basic building block of scientific or scholarly careers. For example, the h-index is very popular in the medical sciences  ((Tijdink, Rijcke, Vinkers, Smulders, & Wouters, 2014). This index is a combination of the number of publications of a researcher and the citation impact of these articles in such a way that the h-index can never be higher than the total number of publications. This means that if researchers are compared according to the h-index, the most productive ones will prevail. We will have to wait and see whether the new evaluation protocol will be able to withstand this type of reward for high levels of article production.

Reference: Tijdink, J. K., Rijcke, S. De, Vinkers, C. H., Smulders, Y. M., & Wouters, P. (2014). Publicatiedrang en citatiestress. Nederlands Tijdschrift Voor Geneeskunde, 158, A7147.

Metrics in research assessment under review

This week the Higher Education Funding Council for England (HEFCE) published a call to gather “views and evidence relating to the use of metrics in research assessment and management” http://www.hefce.ac.uk/news/newsarchive/2014/news87111.html. The council has established an international steering group which will perform an independent review of the role of metrics in research assessment. The review is supposed to contribute to the next installment of the Research Excellence Framework (REF) and will be completed Spring 2015.

Interestingly, two members of the European ACUMEN project http://research-acumen.eu/ are members of the 12 person steering group – Mike Thelwall (professor of cybermetrics at Wolverhampton University http://cybermetrics.wlv.ac.uk/index.html) and myself – and it is led by James Wilsdon, professor of Science and Democracy at the Science Policy Research Unit (SPRU) at the University of Sussex. The London School of Economics scholar Jane Tinkler, co-author of the book The Impact of the Social Sciences, is also member and has put together some reading material on their blog http://blogs.lse.ac.uk/impactofsocialsciences/2014/04/03/reading-list-for-hefcemetrics/. So there will be ample input from the social sciences to analyze both the promises and the pitfalls of using metrics in the British research assessment procedures. The British clearly see this as an important issue. The creation of the steering group was announced by the British minister for universities and science, David Willett at the Universities UK conference on April 3 https://www.gov.uk/government/speeches/contribution-of-uk-universities-to-national-and-local-economic-growth. In addition to science & technology studies experts, the steering group consists of scientists from the most important stakeholders in the British science system.

At CWTS, we responded enthusiastically to the invitation by HEFCE to contribute to this work, because this approach resonates so well with the CWTS research programme http://www.cwts.nl/pdf/cwts_research_programme_2012-2015.pdf. The review will focus on: identifying useful metrics for research assessment; how metrics should be used in research assessment; ‘gaming’ and strategic use of metrics; and the international perspective.

All the important questions about metrics have been put on the table by the steering group, among others:

–       What empirical evidence (qualitative or quantitative) is needed for the evaluation of research, research outputs and career decisions?

–       What metric indicators are useful for the assessment of research outputs, research impacts and research environments?

–       What are the implications of the disciplinary differences in practices and norms of research culture for the use of metrics?

–       What evidence supports the use of metrics as good indicators of research quality?

–       Is there evidence for the move to more open access to the research literature to enable new metrics to be used or enhance the usefulness of existing metrics?

–       What evidence exists around the strategic behaviour of researchers, research managers and publishers responding to specific metrics?

–       Has strategic behaviour invalidated the use of metrics and/or led to unacceptable effects?

–       What are the risks that some groups within the academic community might be disproportionately disadvantaged by the use of metrics for research assessment and management?

–       What can be done to minimise ‘gaming’ and ensure the use of metrics is as objective and fit-for-purpose as possible?

The steering group also calls for evidence on these issues from other countries. If you wish to contribute evidence to the HEFCE review, please make it clear in your response whether you are responding as an individual or on behalf of a group or organisation. Responses should be sent to metrics@hefce.ac.uk by noon on Monday 30 June 2014. The steering group will consider all responses received by this deadline.

 

 

Tales from the field: On the (not so) secret life of performance indicators

* Guest blog post by Alex Rushforth *

In the coming months Sarah De Rijcke and I have been accepted to present at conferences in Valencia and Rotterdam on research from CWTS’s nascent EPIC working group. We very much look forward to drawing on collaborative work from our ongoing ‘Impact of indicators’ project on biomedical research in University Medical Centers (UMC) in the Netherlands. One of our motivations behind the project is that there has been a wealth of social science literature in recent times about the effects of formal evaluation in public sector organisations, including universities. Yet too few studies have taken seriously the presence of indicators in the context of one of the universities core-missions: knowledge creation. Fewer still have looked to take an ethnographic lens to the dynamics of indicators in the day-to-day work context of academic knowledge. These are deficits we hope to begin addressing through these conferences and beyond.

The puzzle we will be addressing here appears – at least at first glance- straightforward enough: what is the role of bibliometric performance indicators in the biomedical knowledge production process? Yet comparing provisional findings from two contrasting case studies of research groups from the same UMC – one a molecular biology group and the other a statistics group – it becomes quickly apparent that there can be no general answer to this question. As such we aim to provide not only an inventory of different ‘roles’ of indicators in these two cases, but also to pose the more interesting analytical question of what conditions and mechanisms explain the observed variations in the roles indicators come to perform?

Owing to their persistent recurrence in the data so far, the indicators we will analyze are journal impact factor, H-index, and ‘advanced’ citation-based bibliometric indicators. It should be stressed that our focus on these particular indicators have have emerged inductively from observing first-hand the metrics that research groups attended to in their knowledge-making activities. So what have we found so far?

Dutch UMCs constitute particularly apt sites through which to explore this problem given how bibliometric assessments have been central to the formal evaluations carried-out since their inception in the early-2000s. On one level it is argued that researchers in both cases encounter such metrics as ‘governance/managerial devices’, that is, as forms of information required of them by external agencies on whom they are reliant for resources and legitimacy. Such examples can be seen when funding applications, annual performance appraisals, or job descriptions demand such information of an individual’s or group’s past performance. As the findings will show, the information needed by the two groups to produce their work effectively and the types of demands made on them by ‘external’ agencies varies considerably, despite their common location in the same UMC. This is one important reason why the role of indicators differs between cases.

However, this coercive ‘power over’ account is but one dimension of a satisfying answer to our role of indicators question. Emerging analysis reveals also the surprising discovery that in fields characterized by particularly integrated forms of coordination and standardization (Whitley, 2000)– like our molecular biologists – indicators in fact have the propensity to function as a core feature of the knowledge making process. For instance, a performance indicator like the journal impact factor was routinely mobilized informally in researchers’ decision-making as an ad hoc standard against which to evaluate the likely uses of information and resources, and in deciding whether time and resources should be spent pursuing them. By contrast in the less centralized and integrated field statistical research such an indicator was not so indispensable to routines of knowledge making activities. In the case of the statisticians it is possible to speculate that indicators are more likely to emerge intermittently as conditions to be met for gaining social and cultural acceptance by external agencies, but are less likely to inform day-to-day decisions. Through our ongoing analysis we aim to unpack further how disciplinary practices interact with organisation of Dutch UMCs to produce quite varying engagements with indicators.

The extent to which indicators play central/peripheral roles in research production processes across academic contexts is an important sociological problem to be posed in order to enhance understanding of the complex role of performance indicators in academic life. We feel much of the existing literature on evaluation of public organisations has tended to paint an exaggerated picture of formal evaluation and research metrics as synonymous with empty ritual and legitimacy (e.g. Dahler-Larsen, 2012). Emerging results here show that – at least in the realm of knowledge production- the picture is more subtle. This theoretical insight will prompt us to suggest further empirical studies are needed of scholarly fields with different patterns of work organisation in order to compare our results and develop middle-range theorizing on the mechanisms through which metrics infiltrate knowledge production processes to fundamental or peripheral degrees. In future this could mean venturing into fields far outside of biomedicine, such as history, literature, or sociology. For now though we look forward to expanding the biomedical project, by conducting analogous case studies from a second UMC.

Indeed it is through such theoretical developments that we can consider not only the appropriateness of one-size-fits-all models of performance evaluation, but also unpack and problematize discourses about what constitutes ‘misuse’ of metrics. And indeed how convinced should we be that academic life is now saturated and dominated by deleterious metric indicators? 

References

DAHLER-LARSEN, P. 2012. The evaluation society, Stanford, California, Stanford Business Books, an imprint of Stanford University Press.

 WHITLEY, R. 2000. The intellectual and social organization of the sciences, Oxford England ; New York, Oxford University Press.

%d bloggers like this: