Skip to PREreview

PREreview of A Measure of Open Data: A Metric and Analysis of Reusable Data Practices in Biomedical Data Resources

Published
DOI
10.5281/zenodo.7624601
License
CC BY 4.0

Live streamed Journal Club on "A Measure of Open Data: A Metric and Analysis of Reusable Data Practices in Biomedical Data Resources"

As part of our first Live PREreview Journal Club (#LivePREJC), we discussed the bioRxiv preprint "A Measure of Open Data: A Metric and Analysis of Reusable Data Practices in Biomedical Data Resources" by Seth Carbon,  Robin Champieux, Julie McMurry,  Lilly Winfree,  Letisha R Wyatt, and Melissa Haendel. doi: https://doi.org/10.1101/282830.  We were joined by two of the preprint authors: Robin Champieux and Lilly Winfree.  Below is a summary of our discussion. 

Summary:

The manuscript provided a means to increase the clarity on what is quite a complex landscape of data licensing - to help make data re-use terms clearer. The authors generated a clear rubric that was used to determine how visible the licensing info was, how open the licensing terms were, and what this means for the user in terms of data reuse restrictions, e.g. what kind of data can be re-used and under what conditions. This rubric provided a quantitative metric of open data that can be used more generally in open science to answer the question "How much of my science/data is open?"The live preprint journal club  (Live PREJC, the discussion was conducted over video call) attracted a broad group of participants from across the North America. Given the diverse makeup of the journal club particpants, we were interested in why this study was relevant to their discipline/research. Below is a summary of their responses:
  • I'm a theater making rapper who uses research to put together presentations. I'm not the only one. I  don't have the capacity for review, and my counterparts just want to make  a  hot show, so a lot of mess ends up in the presentations. Also the lack of diversity doesn't  help  much either in terms of perspective in the narratives I make.
  • Having been in a lab where we have used thousands of datasets from other labs to build a tool that can be used to analyse genomics data, I understand how frustrating it might be to not know what the reuse policies are for these datasets, and how much time it might waste having to request permissions. This manuscript helps to address where the situation lies right now, with the hope that this might help develop more clarity in future policies.
  • There are many definitions of open data but they are used interchangeably in science its great to have a table to define the different types of data that get classified as open but fall at very different parts of that transect.
  • I'm interested in monitoring policies across fields or different organizations. The development of a rubric enables comparisons, a birds-eye view, and advocacy for change. Furthermore, our current MozSprint project, TRANSPOSE, is a journal database built on a similar YAML architecture :) ASAPbio is also working on preprint licensing, and while the options typically available to people in that space are more restricted (to CC licenses), preprint servers offer different levels of clarity and machine access to these licenses. This framework could be extended or modified to categorize information about that space.
  • I'm running Open Data repositories and I want to enable the best reuse of the data we collect.
  • As an OA publisher we have an open data policy, a rubric like this helps us assess databases that align with our policies, as well as hurdles for researchers that we need to understand when putting a data policy in place or how it should evolve over time.
  • I'm a graduate student and before we start performing certain experiments, mining publicly available databases yields very valuable information which can contribute greatly to our projects. However, how to reuse this data is not well known by everyone and publications like this will be helpful to educate other researchers. 

What did the participants like about the manuscript?

There was a general excitement about this work and an appreciation for the impact it will have on the scientific community. The below are some specific comments from the journal club participants:
  • Table 1 was very helpful for understanding the overall types of licenses.

What could the authors improve in their manuscript?

Collectively, we felt that the following could help improve the clarity of the manuscript: 
  • As this is a more unique type of manuscript (not like a standard wet lab research MS {manuscript]), it would be useful to have a figure that explains the process, for example, a flow diagram of how you went through the process.
  • I appreciated how you simplified a complex aspect of the licensing landscape into fairly concise categories, but it may help if you make the details of the criteria/rubric more clear. For example, it would be useful to have some short hand word for the criteria that isn't just a letter, so that it's easier to look at the visuals on the website. This was very educational, though, so thank you!
  • Are all of the "stars" equal weight and importance? Curious whether this is intended to be used as a quantitative combined score or more like a suite of characteristics. If the former, are some of the categories considered more essential than others?
  • I might have missed this but is there a place for the data of the data?  Author's response: yes! in the github repo :)
  • On violations: separate out the definitions of violations. A violation that prevents further analysis/inclusion vs a category violation.
  • In figure 3, what does it mean that categories B, D, and E have "violations"? I was under the impression from the text under Fig 2 that these were not scored if there are violations in A & C
  • I noticed that Figure 1 focuses on percentages whereas in the text you mainly focus on the raw numbers with percentages in parentheses. It might help the flow to always stick to percentages with the raw numbers in parentheses to give context/real numbers.
  • The large amount of whitespace left around the figures were a bit distracting. Also, I agree with overlaying the numbers on the donut/pie chart, it was not easy on the eye. In figure 2, it would be helpful to mention again the scoring criteria to the figure caption so that if the reader forgets about the scoring criteria or they just happen  to see the figure, they can understand what was the scoring criteria by just reading the figure caption.

Minor comments/typos:

  • I noticed in the second paragraph of the discussion there is repitition of the word "that" in the first sentence.
  • Unlike other figure captions, caption for Fig. 3 was enclosed within a black frame. It stood out/ was different to the eye. I don't know if it is deliberate.

Author's comments:

One of the novelties of Live PREJCs is that the authors can be on the call, too. We therefore allowed 10 minutes at the end of the call to allow the authors to comment on any of the points that were brought up during the journal club. Below are some discussion points brought up by Robin Champieux and Lilly Winfree (edited for clarity).
  • The authors wanted to add that while they intended  it as a combined quantitative score, the criteria itself, hopefully, draws out the particular issues (and impacts) associated with each part.
  • What stood out for the authors was aggregated data resources/data databases. They had early readers who found that focus confusing. They were hoping they had cleared that up in the MS [the rationale for choosing to focus on these specific databases]. But they haven't laid out a rationale for how the rubric and their thinking should apply to datasets not living in a aggregated database. 
  • Figures: general feedback and how to move away from pie charts/donuts
  • They appreciated our comments about the text being hard to read when overlaid on pie chart
  • They were interested in ways to provide more info on what permissive/restrictive/copyright/left pool means-diff shades of same colour for similar licenses etc.
  • They are considering making a network diagram...a wall showing the barrier to data linkage, maybe just one visual example of this
  • Are their clusters of categories of licenses etc?
  • Many of the data resources were integrates to the Monarch project. Want to ask how these resources don't actually work together as they might intend-because of the tensions/conflicts
  • For figures, think about how it will render in print. Having words within the figures is great but may be difficult to read depending on the color selection
Thank you to everyone who participated in the Live PREJC, and in particular to Robin Champieux and Lilly Winfree for being brave first participants as authors of the preprint. It was really valuable to have two of the authors on the call.If you are interested in hosting a Live PREJC or you are a preprint author and you would like to arrange a Live PREJC for your preprint, let us know at contact@prereview.org or fill out our form here.