The OA debate between an “archivangelist” and an OA researcher
Gunther Eysenbach's responses to Stevan Harnads rebuttal to my response to his initial e-letter in PLoS, commenting on an editorial on my recent PLoS paper “Citation Advantage of Open Access”, which, among other things, shows a "gold-over-green" advantage
(note: Gold-OA = publishing in an OA journal, Green-OA = self-archiving on the Internet)
I have divided my response into three sections, addressing what I think are the main discussion points here
1) Solid versus non-solid evidence: Were the PLoS editors and PLoS reviewers right in calling this paper more "solid" than previous papers? Harnad disputes this. As not everybody can be expected to completely appreciate the study methodology and the statistical methods used, here is a detailed and easy to understand explanation of the methodology, highlighting some of its advantage over previous approaches, and a introduction for beginners on concepts like self-selection, confounding, and multivariate analysis
2) What is Open Access? A Continuum! In response to my remark that open access publishing is, like publishing itself, a continuum, Harnad reacted confused and disagrees - according to him, OA is like pregnancy. Here are some explanations for why I refer to OA publishing as a continuum, with the current definitions on OA being quite arbitrary, and different implementation modes having different impact on metrics like citations
3) The gold-versus-green conspiracy theory: Was the PLoS paper and the way it was framed in the editorial "hyped", according to Harnad, and an attempt of gold-OA editors in their eternal struggle not to loose authors to self-archiving to devalue "green-OA", as Harnad implies? Of course not - this is what editors do, promoting their journal and the work that is published in their journal - and this is exactly the reason for why gold will always have an advantage over green (Part 3 ).
In a remarkably polemic and sometimes misleading rebuttal to my response to his initial e-letter, “archivangelist” Stevan Harnad is still decrying some of the statements made by PLoS editors in their editorial on my recent PLoS paper “Citation Advantage of Open Access”.
The PLoS editors apparently hurt Harnads ego deeply last week when they alluded that previous evidence on OA advantage was not “solid”. The secondary finding of the PLoS paper, that gold-OA had a an advantage over Harnad’s cause, green-OA, in terms of a larger citation advantage, was unsurprising and intuitively logical for the rest of the world, while Harnad tried to discredit it as “controversial” finding, being part of a “gold-against-green” conspiracy (see Part 3 for a detailed response).
Harnad starts with a misquote, asking “why [PLoS editors] said such studies were "surprisingly hard to find" (suggesting the PLoS editors were foolish enough not be aware of prior evidence) and wondering why I started with a rebuttal with hints on why previous evidence is weak. He misses the point here that the editors wrote “SOLID evidence” is surprisingly hard to find. The emphasis is on SOLID, - which is exactly why I started my e-letter response with the two “hints” to why OA critics (and anybody else who has some basic understanding of research methodology) might regard them as unsolid. So the debate here is not whether or not previous papers have been out there, but to what degree they are valid or “solid”.
The notion of unsolid studies was exactly what motivated me to embark on this research project in the first place. On a rainy April afternoon one year ago, I sat in my office and had nothing better to do than reading Aronsons BMJ editorial on OA, in which among other things he stated that there was “little evidence” that OA will boost citations. I was pretty upset by this statement, because I knew that he (Aronson) was wrong, and I decided to produce robust evidence which had a maximum of credibility and defendability.
Being trained in epidemiology I had a pretty good idea on how to tackle this scientifically in the most rigorous way possible. I designed a cohort study which addresses some (not all!) of the criticisms in previous studies, and these criticism are 1) confounders (tackled through multivariate analysis) and 2) the arrow of causation problem (“what comes first – open access or citations” - addressed through a cohort study design).
A cohort study is in fact the study design which epidemiologists use in order to identify a possible causal relationship between exposure (e.g. “asbestos”) and an outcome (such as “cancer”). In a cohort study the researcher starts with a population, of which one group is exposed and the other one is unexposed, and follows up the groups over several years ("longitudinally"), measuring the incidence of an outcome in both groups. My exposure was “open access status”, and my outcome was “citation”. Thus, Harnads dismissive statement “all of this is as far away from rigorous oncological research as it is from rocket science” shows his limited understanding and lack of appreciation for the methodology used. In fact, a cohort study with a multivariate analysis is exactly the methodology a serious cancer epidemiologist uses to show the association between exposure and outcome in situations where an experimental exposure (randomized trial) is impossible or unethical (e.g. "does smoking cause cancer"). Harnad’s utterance that “there's precious little science involved here” (in reaction to my request to respond with good science rather than polemics) unfortunately just further demonstrates his inability to engage in serious scientific dialog.
Harnad’s attempt to respond “scientifically” is a mess: He not only mixes up concepts like self-selection and confounding (which are related, but not the same - see below), he even mingles together the two distinct study designs described in the PLoS paper (1) a (longitudinal) cohort study of PNAS papers and (2) a (cross-sectional) survey among authors. With statements like “Eysenbach's author self-report data certainly don't constitute such a longitudinal cohort study” he either tries to deliberately confuse the reader, or shows that he really hasn’t read or understood the study. He even asks – now seemingly completely confused and again intermingling the survey with the cohort study - “Is Eysenbach suggesting that his failure to find any significant difference among author self-reports […] is an objective test of the arrow of causation?”. [The answer of course is “no” – the fact that we look at a immediate (gold-)OA article population in a longitudinal cohort study design takes care of the “arrow of causation” problem, because it makes sure that open access status comes first, then the citations are coming, not the other way round. The author survey has nothing to do with this.]
As to his continued confusion over concepts like self-selection and confounding (which he treats like synonyms), here some “fundamentals in epidemiology for beginners”:
Self-selection is a threat to the validity of any observational, non-randomized study, because it leads to unequal distribution of variables (characteristics of articles) between the groups. Some of these imbalanced variables may be so-called confounders, i.e. variables which are strongly associated with the outcome of interest (in this case citations). In this case, number of authors on a paper may be a strong confounder (see below for an explanation for why this is a confounder). So self-selection causes confounding. The best strategy to get rid of confounding would be to get rid of self-selection, but if this is not possible, the second best strategy is to statistically eliminate the effect of confounders using a technique called multivariate analysis.
A cohort study by itself obviously does not take care of self-selection, and nobody never claimed it would. To get rid of self-selection one would have to do a randomized trial, deciding through a coin toss for the authors whether or not a paper should become OA. However, in a cohort study (and again, this has nothing to do with the author survey! – just forget the survey for a while) with multivariate analysis researchers can statistically eliminate the effect of known confounders, which minimizes the validity threat created by self-selection. In other words, if I know or suspect beforehand that for example the number of authors could be a threat created by self-selection, I can include this in my model and statistically eliminate its effect. So yes, self-selection is still present, but we statistically correct for the known or suspected effects of self-selection.
Thus, it is this cohort study design with multivariate adjustment for confounders, and not the “author survey” (which is an additional “bonus” of the paper, but has nothing to do with the cohort study) or a “gold-against-green conspiracy” which is the reason for why editors, reviewers and basically the rest of the world views the PLoS study as providing more SOLID evidence than Harnad’s studies.
Harnad meanwhile keeps discrediting himself by babbling about confounders not being present in his samples (admitting that he hasn’t even compared his two groups to detect potentially imbalanced variables between the groups), or confounders somehow magically going away because his samples of thousands of articles across different disciplines are so large and show a huge, consistent effect.
Let’s examine this last argument in more detail using an example to make intuitively clear even to the not statistically inclined reader how unscientific and absurd (even counterintuitive) this line of argumentation is. Let’s say there is a strong correlation between “carrying matches” and “getting cancer”. Does this strong association mean that carrying matches causes cancer? No, because smoking is a confounder, linking “carrying matches” to “cancer”. Hardly anybody carries matches without being a smoker, thus there is a "huge" relation between carrying matches and cancer. Now increase your sample size: Look at a population of 1000, 100 thousand or 10 million people, you will still see a strong, universal association between carrying matches and cancer, with this artificial association due to confounding not just going away because of this “huge” sample size or looking at different countries. So to answer Harnads question “What confounding effects does Eysenbach expect from controlling for number of authors in a sample of over a million articles across a dozen disciplines and a dozen years all showing the very same, sizeable OA advantage? Does he seriously think that partialling out the variance in the number of authors would make a dent in that huge, consistent effect?” – the answer is “absolutely”. There is a “huge, consistent effect” between carrying matches and cancer, but the “true” unconfounded, independent effect of carrying matches on cancer can only be isolated if we adjust (control) for the confounder “smoking”. In this example, the “huge effect” of carrying matches would be reduced to zero if we adjust for “smoking”.
“Confounded” associations between two variables which falsely suggests causality can ONLY be ruled out if one controls for the confounder no matter how “strong, consistent” the effect appears – period, end of story. Talk to your local statistician. Which is something Harnad obviously never did (“it is not at all clear that controls for those ‘multiple confounders’ are necessary in order to demonstrate the reality, magnitude and universality of the OA advantage”).
The best way to control for multiple potential confounders at the same time is multivariate analysis. Multivariable analysis is a tool for determining the relative contributions of different causes to a single event.
Our “events” (citations) are determined by many different “causes”, of which “access” is only one variable – many other variables, including confounders, have to be taken into account.
Harnad says he can’t think of any confounders in his samples (“What exactly does he think is being confounded in within-journal comparisons of self-archived versus non-self-archived articles?”). This is as if he would say that access status is the only variable that determines citations which is different in both groups. Let’s help him to think. Which variables are associated with “having higher citations”, and also with “self-archiving”, in other words, which variables are possible confounders? Number of authors for example. It is easy to imagine that the higher the number of authors, the more likely an article will be self-archived, because it takes only 1 author to self-archive an article. An article with 10 authors is therefore almost 10 times more likely to be self-archived than an article with 1 author. I would therefore not be surprised if – on average – self-archived papers have more authors than non-self-archived authors (Harnad says he never even tested whether there is any imbalance between the groups in terms of “number of authors”). And because a higher number of authors is also an independent predictor for higher citations – be it only through self-citations, or through true quality differences, we have one thick and clear confounder: If high-author papers are overrepresented in self-archived papers, then this confounder alone will contribute to having a greater number of citations. And this confounder does not simply go away by taking “a sample of over a million articles across a dozen disciplines and a dozen years”.
Only if one statistically controls for all these confounders (there are several of them - see PLoS paper), and one STILL sees an open access citation advantage, then (and only then) one has a SOLID, defendable study. Which was the entire point of doing the study in the first place, and the reason for the “hype” in the PLoS editorial. SOLIDness.
The thrust of Harnad’s reply is unfortunately also geared at discrediting me as an impartial researcher with obvious misinformation like “it is Eysenbach (and PLoS) who are focussed on gold-OA journals; the rest of the studies are focussed on OA itself”, while the reality is – for everybody clearly to see – that the PLoS paper is the first study which contains an analysis of both gold and green (thus focuses on “OA itself”), whereas the rest of the studies is actually focused on “green”.
As explained below, I chose a hybrid gold-OA journal to study open access for scientific reasons, because this was an excellent model to study the impact of open access without the various problem of studying self-archived (“green”) open access papers, where it is often not even clear when they actually became “open”.
Harnad says he focused his research on green-OA “because most of the OA is green”. Which brings me back to my previous point: It is not about quantity or sample size, it is about quality. I chose to study PNAS only because PNAS was an excellent model to study the effect of OA. Only gold-OA allows me to be sure about when an article becomes OA, and more importantly to know that it is immediately open access after publication. I chose to study the PNAS model because it is ideal to study the question at hand in a SOLID manner. The point was never to “focus the research on gold OA” or to prove superiority of gold OA over green OA, or to “market gold-OA journals”. The data happen to show what they show: A relative citation advantage of gold-OA over green-OA, with the green-OA advantage being much less from what Harnads studies show, which is not surprising, because Harnads studies look at crude, unadjusted citation differences between two groups or articles which may differ in important characteristics.
To make the following once and for all clear: I am interested in research and scientific dialog, not in advocacy and polemic. Unfortunately, Harnads latest posting shows that is not open or capable for any sort of dialog on this level.
In my response to Harnads e-letter concerning an editorial on my recent PLoS paper “Citation Advantage of Open Access” I mentioned that open access is a continuum, much as publishing itself is a continuum (a quote from Richard Smith, former editor of the BMJ), and green and gold OA are on different end in the OA spectrum, as are different definitions and implementation modes. Harnad in his rebuttal reacts confused: “I have no idea what Eysenbach means about OA being a continuum: Time is certainly a continuum, and access certainly admits of degrees […] but Open Access does not admit of degrees (any more than pregnancy does). OA means immediate, permanent, full-text online access, free for all, now.”. Nice rhetorics, but obviously far away from the realities of OA.
Let’s clarify what I mean by continuum. If this (Harnad links to the Budapest definition) is open access publishing, is this (Bethesda definition) or this (Berlin definition) not open access? The definitions are similar, but not identical. So there is already a continuum between these definitions. These definitions seem quite arbitrary to me and as a scientist they are actually too vague to decide with absolute certainty what OA is and what isn’t – so we have not only a continuum of OA models between the definitions, but even within each definition.
To illustrate the continuum idea further, lets look at the timing issue in detail, as Harnad at least concedes that time is a continuum, and according to him, “OA means immediate [..] online access, free for all, now.” (Bethesda definition also speaks of “immediate” deposition). If we agree that “open access” publishing implies providing “immediate access”, then we should be able to define exactly (down to the minute) what we mean by “immediate” – where is the cut-off point when is the open door shut? Presumably, we would agree that we can call an article which is self-archived 1 second after original publication in a toll-access journal a “green-OA” article. On the other hand, we surely do not consider it “open access publishing” if a publisher agrees to self-archive articles after 6 months, because there is no such thing as “delayed OA”. But where exactly on the time scale do we draw the line between for example an green open access publication and a non-open access publication? In fact, in Harnads “research” on the citation advantage of “green-OA”, the timing issue (when was the article actually self-archived in relation to the actual publication date) is completely unaddressed, and it is entirely possible that the articles in his sample (which he refers to as green-OA articles) were not “immediately” self-archived after publication, but 1 month, 6 months, or 12 months after original publication, therefore not really what Harnad refers to as green-OA, implying “immediate” deposition.
Timing of self-archiving is not the only dimension of OA which is less than clear cut. An OA journal that is not indexed in any bibliographic database is less accessible than another OA journal which is indexed. A gold-OA journal which offers a publication in three different downloadable formats is less open than a journal which asks to pay for special formats. An institutional repository which is OAI compliant is (slightly) better accessible than a personal homepage used for self-archiving. All different points of accessibility on the OA continuum.
So any statement along the lines THIS is open access (without being very explicit about the exact implementation mode, but at the same time dismissing the idea that a range of possible implementation modes exists which together constitute a continuum of OA models) are pure advocacy rhetoric, but do not reflect the reality, which is slightly more (actually – a lot more) subtle.
John Willinsky (Access Principle, p. 28) happens to agree that open access is not a binary issue, door open or door shut, pregnant or not pregnant:
One reason to focus on the variety of open access models is to dispel the idea that greater access to the knowledge represented by scholarly publishing is an all-or-nothing proposition. The term open access may suggest that, like a door, a journal is open or it is not. The still-emerging realities of opening access to this literature are otherwise.
So Harnad seems to be the only one who seems to have the clear-cut pregnancy test for what constitutes open access publishing.
Harnad is so disturbed by my continuum remark that ironically he next fears that “Eysenbach is going to tell us that […]self-archiving [..] is not OA after all” – which is the complete opposite of what I said. I am certainly not going to tell anybody what “true” OA is, first, because this is a question I am not at all interested in, second, because (other than Harnad) I already admitted - by viewing open access publishing as a continuum and not a clear-cut yes/no or trueOA/not-true issue - that I have no magic pregnancy test to tell anybody where OA starts and where it ends. Rather, different implementation modes of what is considered OA today under the BBB definitions are lying on a continuum, with different levels of accessibility and openness.
Green OA and gold OA (but even different OA-journals within the “gold” family, or different kinds of self-archiving within the “green-OA” publishers family) clearly have different accessibility characteristics, and if different levels of accessibility/openness correlate with different levels of citations and benefits to society (Harnads own argument!), then these different levels of accessibility/openness can be reflected by citations, and different degrees of a citation advantage could indicate different levels of openness, accessibility, findability, and benefits to society. Such subtle differences may be too subtle for the rhetoric of an “immediate, green OA now!” advocate, but from a researchers point these subtle differences exist, are measurable, and should be explored.
A far-fetched possibility? Let’s get a second opinion on this:
Dorothea Salo, That's the stuff, Caveat Lector, May 16, 2006:.There’s one joker in the abstract [of the Eysenbach study] for repository-rats, though (added emphasis mine): “Articles published as an immediate OA article on the journal site have higher impact than self-archived or otherwise openly accessible OA articles.” I believe that, actually, especially for newly-published articles. It’s just plain easier to find an article via a publisher’s website than on the open Web....
Harnads implicit accusation that the continuum remark or the findings in the PLoS paper suggesting a gold-over-green citation advantage are the result of a “gold-against-green” conspiracy by gold-OA editors who are shivering in their pants because they may loose authors to self-archiving, therefore downplaying the advantages of self-archiving, is of course empty rhetoric – see next section.
In an attempt to seek (or make up) evidence that these study findings (or the way they were framed) were a result of a conspiracy “gold against green” to “devalue” self-archiving, Harnad (in his rebuttal to my response to his initial e-letter) does not shy away from putting words into my mouth which I never said or even implied: I talked about an “open access continuum” – in reaction Harnad writes “Eysenbach is going to tell us that making a published journal article accessible online free for all by self-archiving it is not OA after all, or not "full OA". “. This of course is pure polemics – as outlined above, other than Harnad I have no pregnancy test for what is OA. I meant what I said: Open access publishing is a continuum, with different models and implementation modes along this continuum, which will have different impact, measurable through citations and other metrics. This has nothing to do with green OA not being “full OA”, which nobody (and certainly not me!) ever said.
According to Harnad’s view, publishers of (gold) open access journals (including myself and PLoS) “want to give the impression that green OA was not ‘really’ OA or not ‘fully’ OA” because they are terrified to loose authors who could suddenly start self-archiving instead of submitting their work to gold-OA journals. How flawed this logic is should be easily recognizable (and Harnad realizes this himself speaking of a paradox), because OA-journals are getting manuscripts – some more than they can handle – now, today, although self-archiving is – as Harnad always points out – already an option today. For Harnad, this “author paradox” is the result of “authors foolishness” or the result of a gold-OA-against-green-OA propaganda, for which – according to Harnad - the PLoS paper including its editorial is an example and/or provides welcome ammunition by showing a gold-over-green advantage.
For me this “author paradox” is not a paradox, and authors are not fools. Rather, most authors are cleverer than Harnad, realizing that journals (any journal, OA journals included) create added value on different levels (I am not only talking about peer-review only, but also such things as community building and delivering the paper to the right target audience), which cannot be replaced by submitting to a toll-access journal with subsequent self-archiving. Most authors (including myself) know this intuitively, and I for example (even though I know more about institutional repositories and the benefits of self-archiving than the average author) made a conscientious decision to submit my paper to a gold-OA journal (PLoS) rather than publishing the study in an obscure scientrometrics journal and then self-archived it, as Harnad actually recommends (isn't it ironic that Harnads publication strategy mainly consists of self-archiving of unreviewed work or self-archiving work published in toll-access journals, while at the same time wondering why - according to him - his previous research is not properly acknowledged – did it ever cross his mind that this might also have something to do with his own publication strategy?).
The visibility of an article published in a properly promoted OA journal site will always be better than a paper that is published in a toll-access journal site, even if it is self-archived. This is exactly why my study shows an advantage of gold-OA over green-OA, this is also why I personally chose the gold route to publish this paper in PLoS, and not the green route, and this is why I issued a call for papers to submit OA research to JMIR, an open access journal (where it will be rigorously peer-reviewed – and yes, even promoted) rather than in an obscure toll-access scientrometrics journal. Yes, editors/publishers of toll-access journals also promote their content, and try to draw readers to their site, but here readers encounter access problems and the publisher - dependent on selling subscriptions - will certainly make it as difficult as possible for the reader to retrieve the manuscript from a institutional repository).
Harnad “accuses” PLoS and myself to “promote” our respective journals, not realizing that this is what editors’ and publishers do, and that this is exactly what contributes to the gold-over-green advantage, and that this is why authors keep submitting to gold-OA journals. Publishers of OA journals also know the value they are creating, therefore they don’t have to embark on any anti-self-archiving propaganda, as Harnad absurdly suggests.
As to Harnad’s statement that the advantage gold-over-green will wash out as self-archiving repositories become more interoperable, I would also dispute this notion. If the advantage of gold-OA is delivered through community building (building networks of peer-reviewers, networks of users, and promoting the content to the right users) and promoting the journal site and its content (by press releases, participation in conferences to build relations with readers and authors etc.), then this advantage cannot be simply washed out by a vast interdisciplinary repository of articles where not such efforts are undertaken (surely, you could have people doing the same for subject areas in a repository, but then these people can be called editors, and you are reinventing OA journals). And any editor of an open access journal who can be replaced by an toll-access journal editor and/or an interoperable database of self-archived articles deserves to be.
So, in summary, journal publishers/editors are paid not only to solicit and edit manuscripts, but also to promote their journals and its content, to attract readers and authors, and this is exactly the reason for the gold-over-green advantage. Articles published on open access journal sites will always have a higher impact than articles deposited in a distributed database of self-archived articles if the editors of open access journals do their job properly.
Gunther Eysenbach
Citation suggestion:
Eysenbach G. OA Debate between an “Archivangelist” and an OA Researcher. Posted 24/05/2006, WebCited 24/05/2006. WebCite ID:5G8O63tlv