“Once we figure out how to collaborate, there are a lot of things we can do that we couldn’t do otherwise.”
Ariel Rokem, PhD Program alum (entering class of 2005)
Ariel Rokem is a Data Scientist at the University of Washington eScience Institute, where he collaborates with researchers from diverse fields to develop and maintain software for the analysis of large data sets. Rokem describes his path from experimental scientist to data scientist as a gradual shift, but there were signs of his interest in computational analysis early in his research career. While studying to get his BS in Biology and Psychology from the Hebrew University of Jerusalem, Rokem wanted to work with Idan Segev, a computational neuroscientist. Segev turned him down, saying he hadn’t taken enough math. Rokem set out to remedy that and took extra classes in math and programming.
Putting this fresh knowledge to use, Rokem spent a year in the lab of Andreas Herz at Humboldt-Universität zu Berlin, engaged in electrophysiological and computational studies of the grasshopper auditory system. He then returned to Hebrew University and completed a master’s degree in Cognitive Psychology, studying the differences in hearing and memory between blind and sighted people in Merav Ahissar’s lab.
Rokem joined the Berkeley Neuroscience PhD Program in 2005 and became Michael Silver’s first graduate student. The Silver lab studies how attention, expectation, and learning affect human visual perception. The visual system can get better at discriminating between visual cues with practice, a process called perceptual learning. Rokem found that boosting levels of the neurotransmitter acetylcholine also boosted improvement in visual task performance. During both his PhD studies at UC Berkeley and postdoc with Brian Wandell at Stanford, Rokem continued doing experimental studies while taking on side projects developing software to analyze human neuroimaging data.
Rokem describes several key motivations for his current work: to create software that can take biomedical imaging data and make inferences that inform clinical decision making, to make research more reproducible and robust, and to put forth new collaborative models for practicing science. Learn more about Rokem’s path and current position as Data Scientist at the eScience Institute in the following Q&A.
Georgeann Sack: What led you to choose UC Berkeley for your PhD studies?
Ariel Rokem: I was interested in pursuing a degree in sensory neuroscience, doing experimental work in humans, and Berkeley has a tremendous wealth of work in cognitive neuroscience and human sensory neuroscience. The vision science community is great – it’s very strong, and very large, with presence in many different departments and programs.
GS: What was your experience like as a Berkeley Neuroscience PhD student?
AR: I was Michael Silver’s first graduate student. I learned a lot from that experience about what it looks like to start this kind of operation. I had the benefit of going into an initially very small lab. I had a lot of interaction with Michael, a lot of attention from my mentor, which is not necessarily typical in larger labs. I think we spent a lot of time hashing through things in a way that was really amazing.
GS: What was it like working with human subjects?
AR: I actually have the basis for comparison of what it is like to work with animals because that is what I did when I was in Germany. Working with human subjects is very complicated. There are several aspects to that. One is the ethical considerations – what you can and cannot do in experiments with humans. It can constrain the kind of experiments you can do for obvious reasons. The other thing is that humans are very sophisticated animals. Whenever I had to interpret results that were weird from humans my default was to believe that humans are always trying to work as little as possible. We are trying to be as lazy as we can. If you design an experiment in a way that allows people to use less effort, they will do the less effortful thing. Often that results in an effect that you wouldn’t necessarily have anticipated. Sometimes you hope that people will do something that is hard, for example, but if they find a way to cheat, then they will cheat.
GS: Tell me about your transition from experimental researcher to data scientist.
AR: The transition into working in data science didn’t happen in one day, it was gradual. It actually started at Berkeley. While I was in grad school, Fernando Perez arrived at Berkeley as a research scientist in the BIC. He was a prototype of what my role is currently – working mostly on software with general scientific use. He created IPython, which is an interactive environment for Python, and then created Jupyter, which is now a very popular project for data science both in academic research and outside of academia.
During grad school I spent some time working with Fernando on open source software for analysis of fMRI data, more broadly time series analysis of neuroscience experiments. That gave me my first experience of that.
During my postdoc in the Wandell lab I started working with diffusion MRI. One of the things that was going on in the lab at the time was a strong emphasis on statistics – modeling diffusion MRI signals and other signals and understanding how we can evaluate those models. I benefitted from interactions with students and faculty in the Statistics Department at Stanford.
Another Berkeley graduate, Kendrick Kay, who was in Jack Gallant’s lab, now at University of Minnesota, was my office mate when we were both postdocs in Brian Wandell’s lab. Kendrick was an early riser. I would take the train down from San Francisco, get into the lab, and Kendrick would start the morning with ‘Let’s talk about statistics.’ We would spend maybe an hour or so talking about statistics by the white board and doing more practical things as well. I learned a lot, both from the statistics department across the road from the psychology department, but also from spending time with other postdocs, especially Kendrick, and of course from Brian.
That is when I got involved with the software project DIPY, Diffusion Imaging in Python. I continued to do a lot of work both in experimental work and analysis, and continued to shift more and more to doing a lot and thinking a lot about software and data and data science. By the time I started my current position I was already thinking about these things in an intense way for a few years.
GS: Tell me more about your current position.
AR: My current position is a really interesting opportunity. I don’t think there are many positions like this one. The Moore Foundation and Sloan Foundation gave three institutions – UC Berkeley, UW, and NYU – funding to support institutes that focus on data science in an academic environment. At the University of Washington, where I currently work, this funding went to support the eScience Institute.
I am a data scientist at the eScience Institute. A large portion of my work these days is creating, maintaining, and developing software tools for data science in the research context. The foundations that support us observed that researchers in all different fields are dealing with more and more data and there are challenges that come with this type of data intensive research. For example, the focus on creating and maintaining software is much larger when you really need software in order to do any kind of modern research.
GS: What made this position a good fit for you?
AR: By the time I finished my postdoc, I realized I was not going to be a traditional researcher following the traditional faculty path. My focus on software and reproducibility was not leading me down the path of becoming a faculty member. At the time the eScience Institute was getting off the ground with the Moore Sloan grant and their focus was specifically on those kinds of things – software, reproducibility, and collaborative models of working together between disciplines. It is an interesting model, and one that I really thrive within.
Specifically, because all of the things that made me ill-suited to be a focused researcher in the traditional faculty model made me really good for what I am doing now. A lot of what I need to do is this lateral thinking of connecting between ideas in different disciplines, and collaborating with researchers rather than competing.
GS: What would you describe as the central motivation for the work you do?
AR: I think it divides into several different things. One of the reasons I work on human brain biomedical imaging technologies is that I would like to be able to measure things in humans in a way that eventually informs clinical decision making. Part of my work focuses on creating tools and making inferences from these medical imaging technologies about what is going on inside the brain of an individual person. Ultimately I would like those ideas to make their way into clinical decision making.
The second motivation has to do with reproducibility in science. A lot of my work is about how we make the research more reproducible and robust. We understand that as the data sets we have grow really really large, and computation becomes really really complicated, the computational pipelines that people run on the data become harder to reproduce. A lot of the work focuses on trying to understand what kind of software we need to build so that the work is reproducible, so others can go in and do the same thing.
The third motivation, for me, is really about collaboration. It is about the kind of things you do when you are collaborating, different models for collaboration. I work with researchers from chemical engineering, social sciences, neuroscience, psychology, and medicine. All of those interactions have really interesting challenges, but they also have a lot of promise. Once we figure out how to collaborate, there are a lot of things we can do that we couldn’t do otherwise.
GS: Are you focused on a particular field of research?
AR: A lot of my focus is on software for neuroscience and human neuroscience data in particular. I work on several projects that focus on diffusion MRI. Diffusion MRI is useful to measure properties of connections between different parts of the brain. One project I mentioned earlier is DIPY, Diffusion Imaging Python. That project is a big collaborative project, between developers across many institutions. It is quite unusual for many groups to collaborate together in the creation of software. The model is not only open distribution of the software but open collaboration during development of the software.
GS: What data are you using to test your software?
AR: For a few years now I haven’t been doing experiments myself at all. Neuroscience is going through a big change, actually. We are going from a situation where in the past individual labs were doing their own experiments and generating their own data and addressing scientific hypotheses one by one by tailoring experiments to address those. This experimental model continues to exist, but now there is a new model that large consortium projects collect large data sets that neuroscientists all over the world can download and do their analysis on.
A few examples of that are the Human Connectome Project and the Allen Institute for Brain Science. They create large data sets and distribute them in a way that other scientists can then download, analyze, compare with their own experimental measurements, or address some hypothesis that the original creators of the data hadn’t thought of. I rely on these projects and their data collection efforts for data.
GS: Do you miss doing the experiments yourself?
AR: Not so much. To be honest, doing experiments always stressed me out a bit. I always enjoyed the analysis part much more.
GS: What is your daily work life like now? Are you embedded within a team and regularly interacting with people or are you largely working alone on your laptop?
AR: I am usually sitting at my laptop with somebody else next to me. It varies a lot day to day, but a lot of my work is sitting side by side with other people and working together through problems. I enjoy the collaboration. It is something that really motivates me. A lot of my work is driven through the interaction with other people, and that is something I really enjoy.
I have interactions with many different researchers. For example, right now I am working with a graduate student from the Chemical Engineering department. eScience has something called Incubator Projects. We solicit proposals from people all around campus for quarter long, two days a week projects — this student is working on one of these projects. I also mentor postdocs, currently three, and graduate students, to think through problems and write software together.
A portion of my work these days is working with Data Science for Social Good. This is a summer internship program, where students from all over apply for a ten-week program where they work in teams of four on problems where we think some work on data may provide social good. One of the outcomes of this is a collaboration, funded by the Gates Foundation, where my interaction is with people from local government on quite a different set of data and problems. For example, we are working together to think about data-driven solutions to Seattle’s homelessness crisis.
GS: That is great that you get to work on such diverse projects. What project are you currently most excited about?
AR: One project that has really been fun is a collaboration with a faculty member at the Institute for Learning and Brain Sciences, Jason Yeatman. Jason and I know each other from Stanford, because he was a graduate student in the lab where I was a postdoc. When I arrived we said ‘Hey, we should collaborate.’ One of the things we did together is a project that focuses on visualization of brain connections.
The story of the project is a little bit fun too, because it started as a class project. Jeff Heer, who is, I would say, one of the leading researchers in data visualization in the world, runs the Interactive Data Lab in the computer science department here. He teaches a course on data visualization. He sends around an email to colleagues and says, ‘Do you have some data that you would like students to work on and visualize? Send a short description and students might contact you.’
I asked Jason if we might propose something about visualizing brain connections. I sent the short description and four students showed up and we worked with them for a quarter on this data visualization project.
When the quarter was over we decided to keep going with it, because it was really compelling and interesting. We created something we called AFQ-Browser. It is a visualization that runs in the web browser. People can go in and explore the data interactively, and see, from the diffusion MRI data, what are brain properties along the length of different connections.
So this is really neat and there are a few cool examples of people doing these kind of things. What is special about our project is that we designed the software in a way that allows other researchers who use similar data to also create their own website. So, if they use a particular analysis approach, called automated fiber quantification, or tractometry, they can take the data they have analyzed and package it up as a website and publish it immediately. One of the nice consequences of this is that when they package their website and put it online, they are immediately sharing the data from their own study. This is a way to incentivize researchers to share their data. It provides an accompanying website to a study that allows other researchers and members of the public to go and see the data in a way that is much more vivid than you would see in a static figure in a publication.
GS: What other career paths did you consider after your postdoc?
AR: I considered going into industry. There are a lot of opportunities for people with expertise in software, statistics, and handling data. A lot of companies aggregate masses of data and use those data to create products and draw value for a variety of uses. One thing that has happened over the last 20 years is that a lot of companies realized they can use open source software created in academia for their own commercial niche. In recent years, companies like Google and Facebook are starting to distribute software that they are creating for their own data science needs. Now, I as an academic can take their software and use that in my research. That creates a lot of interesting interactions between academia and industry.
GS: Do you have any advice for current PhD students?
AR: Michael was the one who taught me that research of the kind we do is a marathon, not a sprint. So you have to preserve your energy and see the long arc of things. That requires patience, it requires accepting that a large portion of the things you try will not work out. It also means that people in that stage of their career need to take care of themselves. When I was in grad school, at the end of first year, I had a personal crisis, and I suffered from a lot of anxiety, and depression. One of the things that really helped me was that I eventually found a great therapist. Through that process, I learned that I need to take the time to do other things that are not research – to exercise, to spend time with family or friends. It sounds obvious but I know that it wasn’t entirely obvious to me when I first got started in grad school. I have the tendency to really push myself in a way that was not great. After a while I figured out that I needed to preserve my energy for the long haul that it was going to be. So take care of yourself, and don’t hesitate to seek out professional help when you need it!
Another thing that came out of that was that I could lift my head up from my own desk and see what was around. I feel like especially in Berkeley there were all these things going on that I could benefit from and learn from outside of the lab, just from interactions with other people. I mentioned Fernando. That was an interaction that was outside of the lab. It was related to the work I was doing but it was an opportunity to learn from other interactions.
GS: What kind of things do you do outside of lab now to preserve your energy?
AR: My situation is quite different than it was when I was in grad school. I have a young daughter who is 12 months old now, so I spend time with her.
AR: Thank you. One thing I started doing when I figured things out in grad school is to exercise regularly. I still, despite the young baby, try to exercise regularly. I still try to get outside and go out to nature. We have some really beautiful nature close by to Seattle, so it’s not too hard to do.
GS: Is there anything else you would like to share?
AR: Yes! One of the things that I am excited about right now is that I am organizing this event: neurohackademy.org
It’s a really fun combination of a summer school and hackathon, geared towards early-career researchers in human neuroscience. Another quirk of my current position is that I don’t have to teach regular classes. On the other hand, I enjoy teaching these kinds of things! So, Tal Yarkoni and I started this event (inspired by a similar event that some of my astronomy colleagues organize: astrohackweek.org). I think it’s a really important opportunity. It provides the kind of training in data science that graduate students and postdocs would find hard to get in most graduate programs in cognitive neuroscience.
- Ariel Rokem’s website
- A browser-based tool for visualization and analysis of diffusion MRI data
Nature Communications | March 5th, 2018
- AFQ-Browser on GitHub
- DIPY: Diffusion Imaging In Python
- Pioneering data science tool — Jupyter — receives top software prize
Berkeley News | May 2nd, 2018