Ten questions to Mercè Crosas
Mercè Crosas is a data technologist and researcher, who just recently took on the role of Secretary of Open Government at the Generalitat de Catalunya (Institutional Government of Catalonia).
Up until then, she held two roles at Harvard University, as the Chief Data Science and Technology Officer at Harvard's Institute for Quantitative Social Science and as University Research Data Management Officer at Harvard University Information Technology (HUIT).
Mercè led several software platforms and tools for research data sharing and analysis, applied to all research fields.
CESSDA asked Mercè Crosas to answer a few questions.
Read the full interview below or tune in on YouTube!
How are you coping with the COVID-19 crisis and how do you make sure that you get work done?
For me part of the problem has been working too hard because, in the current situation, you can be reached by many more people than before and could be available for a meeting anytime. At least that seems to be what people tend to assume. With Zoom, there are no distances. If you don't control it, you end up working a lot and there are no weekends but a continuous flow of meetings.
I think that happened to many of us at the start of the pandemic. Fortunately, at Harvard we had a lot of support and guidance. Efforts have been made to raise awareness about stress and the importance of a good work-life balance.
Now I manage it very well by combining going for walks every day with meetings and sitting down at home to just focus on work. When you have a team or you are running projects, and as work becomes more remote, it is important to figure out what is the right hybrid way of working. I think that working half remote and half in person could be ideal for many people. In that case it is very important to have a very clear understanding of common goals for the projects, what everybody is doing, what we want to achieve and milestones that you can review. Quick zoom check-ins once in a while are fine but there is some work that is done better without meetings and on your own. Here I mean strategic work and review work, as well as reading and tasks where you need to think and analyse data or situations.
Thanks to that balance we have been able to continue to do a lot of work throughout the last year and a half.
Before all this, what would a typical working day have looked like? How does it look now?
It is easy to forget how it used to look. I used to go a lot to campus and to be in my office. It was a very friendly environment and most of the team was on the same floor, so it was easy to chat. I guess those are the things that you miss, right? Just being able to check in with someone without having to plan a meeting and having brainstorming sessions about things that you do spontaneously. There were a lot of meetings, sometimes walking from one part of campus to another to meet with different teams or groups and a lot more travelling to conferences.
I am pretty fortunate right now living in a little house in the countryside and being able to go for walks. However, I do miss some of the travelling as it is really nice to meet people that you might work with and just have interesting conversations.
However, the pandemic has helped me realise that I was away too much, and that I value spending time in nature. At Harvard, we have been thinking about how we will restart after the pandemic. We are not saying that we will go back to where we were before. I think that what everybody should think about is not how to go back but what have we learned and what could we do better. Is there a better way to provide a work-life balance? A way that could make work more enjoyable.
Can you highlight three main ways that your work has supported researchers over the last few months?
I can think of several and some that are ongoing. In my role as research area management officer, I work with all schools and across all the departments and units at Harvard University and together with colleagues we have created a service catalogue of all the services that Harvard provides. These are mostly data and computing services but some also administrative infrastructure services which are useful to improve how you do your research.
A lot of support for researchers, when you think of open science or data management or dissemination of research findings, the code and so on, all of those services can be very distributed within the university or research institution. This means that often you do not know exactly what is happening in other units. For this reason, doing an inventory of everything that exists and making it available via a common website so that it can be shared within a unified way was very helpful for researchers.
This helps us to improve and better coordinate the services, bring together some of them and identify any gaps.
As an example, thanks to electronic lab notebooks, researchers at Harvard can collect and manage in a better way all the data and information that they have connected to a research project. It can thus be integrated with other parts of the repositories and in the Harvard system.
Another way that we have been helping researchers is in the Dataverse project, where we are constantly creating new releases that help researchers to continue sharing data. We hope to provide more user-friendly ways of sharing data and more features that make data management and supporting the FAIR data principles easier for researchers.
We are working on a proposal to help researchers – this is not yet in place – to create a Harvard Data Commons, integrating the existing repositories, such as the database repository, the open access repository for publications and the preservation repositories from the Harvard libraries with research computing so we can support better workflows.
Lastly, we are also working on data privacy and sensitive data. We are now starting an open-source project that is called OpenDP for differential privacy. There is transparency about how we do what we do, but it adds privacy protection. Differential privacy is a mathematical approach to preserving privacy by using algorithms. We are building a library of algorithms and methods to add some amount of noise to the statistical data set so that you can release that publicly. By adding sufficient noise, no individual within the data could be re-identified.
Your career began in research in astrophysics and then the design and implementation of software for astronomical observations. How did you get to where you are today?
I am often asked a similar question and I usually point out that there are actually many data scientists that are physicists and astrophysicists. In astronomy and physics, a lot of the work that we do is first managing and analysing very large amounts of data and building a lot of the systems that are needed to do so. We do not usually have companies or products that do the analyses of the data that we work on. We must build the tools that are used in research within astronomy and physics.
As a researcher in those fields, you learn a lot by doing this. I remember that there were a lot of opportunities to move over to the private sector and to start-ups working on software development. I did actually do that for a few years and learned more about software development and managed several projects, some on building software for education and others related to biotechnology. In every case it involved building systems that would help with data management and analysis, but very focused on a specific field. After a while, I missed the Harvard academic environment, and I went back and took on the role of setting up a data sharing platform which ended up being Dataverse.
Building a repository for making data more accessible for research was a much more general approach than what I had done before.
Looking back, it seems like it could have been planned but in life there are a lot of things that have just happened. Life gives you opportunities and there are personal situations that make you think ”okay now it's a good moment to make a change”.
As a European based at Harvard, do you support FC Barcelona or …? And any favourite player?
It is very easy answer. I am from Barcelona so Barça always and I guess Messi because we all love him.
For many years, I played soccer in Boston with a women's team. It was a very friendly team and we used to coach our children and at some point, we thought “why don't we learn more about it and play ourselves?”. We made a team and played in some tournaments in New England. I was playing just before the COVID19 crisis, but I haven't played since then unfortunately. I do plan to take it up again. It was wonderful to play in the women's team.
Dataverse is used more and more. What are its major features and major current developments (incl. DDI-CDI metadata standard)?
The community around the Dataverse project has grown both in terms of developers and users. Many of the new features, come not only from the institute for quantitative social science at Harvard – where it was first developed and it continues being developed, but also from groups all over the world. Many of these groups are in Europe.
Some examples of main new features are multiple licenses and providing a workflow for depositing data and publishing it with a DOI, a persistent identifier. Data citation gives credit to the developers etc. Dataverse collects data sets and their metadata and files and recently we have been adding support features for different ways of depositing very large amounts of files.
It supports any type of file, so we think of the platform as a generalised data repository. At the beginning, it aimed to support data in the social sciences and therefore benefits from a very strong support from DDI. We try to export all the standards that are useful to the community (e.g. Dublin core, DDI, etc.). One of the things that I think distinguishes it from other repositories is also the extensive variable metadata for tabular data files that the DDI supports very well.
All the detailed elements such as variable labels, names and types help when it comes to exploring the data, analysing it or standardising it. Hopefully, in the future it will be possible to merge one data file with another. That is where DD-CDI comes in and Dataverse is very well placed to support that.
We have been identifying a few use cases that could be useful together with Simon Hodson from CODATA and Steve McEachern from the Australian Data Archive, and Arofan Gregory from the Open Data Foundation and Joachim Wackerow from GESIS.
Who are your main users and are there differences between US and European users?
The Harvard repository is open to any researcher from any discipline. So, we have individual researchers of course as users as well as journals that use the repository to deposit all the data from their publications. The repository is open to the whole research community, not limited to Harvard University.
For the Dataverse software platform, which is an open-source platform, the users are from research institutions, organisations, or universities all over the world. Either they want to set up their own institutional repository or one to serve several universities. In Norway for example, several universities are working towards using a database using the same instance of the Dataverse software, where each one has its own collection. In Texas there is a consortium of universities that all use the same installation and also in Canada. The Australian Data Archive would be another example.
I do not think there is necessarily a huge difference between European and American users. I get the impression that Europe is further ahead and more proactive in supporting open science (e.g. EOSC). CESSDA plays a vital role here, together with many other European organisations. Europe is ready to provide solutions for sharing data, whereas in the US, some universities are making good progress, but the trend is not generalised.
What are the main barriers for the reuse of research data and code and the key to improving research data management?
There are several and we keep finding new ones! From the perspective of the researcher, you need to prepare your data in a way that somebody else besides you and your group can use it. You need to provide the proper descriptions and the metadata needs to be organised and cleaned -up, using formats that are easy to reuse.
I think that the main issue is when the description of the data is missing – for example, a table of values on its own means nothing. When you start describing every column, variable or attribute that was collected and how it compares to others, then you can start to understand better the data set. The more standardised the data set is, the easier it is to use. I think that that is one of the problems and that's one of the areas where data curators and research data management professionals can help researchers. They should be working on data management from the beginning, instead of doing it at the end of their project when it is time to publish the data. At that point, you do not necessarily have all the information.
Even if you don't do it to for somebody else to reuse it, we often say that if you go back to it after five years, you should be able to understand how that data set was collected and so on.
Another important area is data protection. In the social sciences but also in medicine and public health, there is sensitive data or data from industry which is difficult to share for research purposes. You need proper data use agreements to be able to use data more easily. You also need the infrastructure, the technology and all the tools to be able to analyse personal data. Solutions are being developed in this area.
Those are the main two barriers, I think.
The key to improving research data management is education and cultural change. We need to raise awareness around the fact that research data management starts at the beginning of your research project. Another key aspect is providing the right infrastructure and tools to researchers. There is not one key, but rather bringing all researchers and data scientists and professionals closer together from the beginning to ensure better quality data.
Harvard University has a reference guide on research data management. Are you familiar with the CESSDA Data Management Expert Guide (DMEG). Is there room for collaboration?
There is always room for collaboration! I am not very familiar with it, but I would love to learn more about it. Especially training materials could be shared. It would be interesting for us to see how we could work together.
We have data management working group with data researchers and managers from across all the schools at Harvard. Maybe CESSDA could join us in a future working group meeting.
What are you most looking forward to achieving in 2021?
I have a number of projects but since you're asking me to choose one thing. I think that just moving a few steps forward towards open science and data sharing, first starting at Harvard, would help to provide higher quality research.
At the end of the day, I think that it is important that all the research that we do has a good application, that it improves society and the world we live in.
Science can improve our knowledge of the world around us. A combination of people trusting more in science and establishing a better dialogue between researchers, scientists and societies would be wonderful.
On a personal note, something I started some time ago is writing a play – or maybe it will end up being a Netflix series, you know – about the cosmic microwave background, combined with a human love story. I have an idea of a plot and I have interviewed several cosmologies to find out more about how the cosmic microbial background was discovered.
I used to have an office just next to one of the Nobel prize winners (Robert Woodrow Woodson) so that inspired me!
It is very low temperature radiation that appeared after the Big Bang and that has been expanding while the universe is expanding. I find it very romantic. As you mentioned, one of the hoped-for achievements is finding more time to write and for more personal goals. Maybe we aim too high with trying to change the world, but maybe we can change some small things!
Mercè Crosas is on Twitter!
About Mercè Crosas (her Harvard.edu website)
Generalitat de Catalunya (Institutional Government of Catalonia)
OpenDP, Developing Open Source Tools for Differential Privacy
The Dataverse project
Harvard Dataverse Repository
Document, Discover and Interoperate
Robert Woodrow Wilson, American radio astronomer who shared who jointly won the 1978 Nobel Prize for Physics for a discovery that supported the big-bang model of creation (Britannica).
See the previous article in this series: CESSDA asks ten questions to Jan Dalsten Sørensen.