The “Big Data” Future of Neuroscience

 

By John McLaughlin

In the scientific world, the increasingly popular trend towards “big data” has overtaken several disciplines, including many fields in biology. What exactly is “big data?” This buzz phrase usually signifies research with one or more key attributes: tackling problems with the use of large high-throughput data sets, large-scale “big-picture” projects involving collaborations among several labs, and heavy use of informatics and computational tools for data collection and analysis. Along with the big data revolution has come an exploding number of new “omics”: genomics, proteomics, regulomics, metabolomics, connectomics, and many others which promise to expand and integrate our understanding of biological systems.

 

The field of neuroscience is no exception to this trend, and has the added bonus of capturing the curiosity and enthusiasm of the public. In 2013, the United States’ BRAIN Initiative and the European Union’s Human Brain Project were both announced, each committing hundreds of millions of dollars over the next decade to funding a wide variety of projects, directed toward the ultimate goal of completely mapping the neuronal activity of the human brain. A sizeable portion of the funding will be directed towards informatics and computing projects for analyzing and integrating the collected data. Because grant funding will be distributed among many labs with differing expertise, these projects will be essential for biologists to compare and understand one another’s results.

 

In a recent “Focus on Big Data” issue, Nature Neuroscience featured editorials exploring some of the unique conceptual and technical challenges facing neuroscience today. For one, scientists seek to understand brain function at multiple levels of organization, from individual synapses up to the activity of whole brain regions, and each level of analysis requires its own set of tools with different spatial and temporal resolutions. For example, measuring the voltage inside single neurons will give us very different insights from an fMRI scan of a large brain region. How will the data acquired using disparate techniques become unified into a holistic understanding of the brain? New technologies have allowed us to observe tighter correlations between neural activity and organismal behavior. Understanding the causes underlying this behavior will require manipulating neuronal function, for example by using optogenetic tools that are now part of the big data toolkit.

 

Neuroscience has a relatively long history; the brain and nervous system have been studied in many different model systems which greatly range in complexity, from nematodes and fruit flies, to zebrafish, amphibians, mice, and humans. As another commentary points out, big data neuroscience will need to supplement the “vertical” reductionist approaches that have been successfully used to understand neuronal function, by integrating what has been learned across species into a unified account of the brain.

 

We should also wonder: will there be any negative consequences of the big data revolution? Although the costs of data acquisition and sharing are decreasing, putting the data to good use is still very complicated, and may require full-time computational biologists or software engineers in the lab. Will smaller labs, working at a more modest scale, be able to compete for funds in an academic climate dominated by large consortia? From a conceptual angle, the big data approach is sometimes criticized for not being “hypothesis-driven,” because it places emphasis on data collection rather than addressing smaller, individual questions. Will big data neuroscience help clarify the big-picture questions or end up muddling them?

 

If recent years are a reliable indicator, the coming decades in neuroscience promise to be very exciting. Hopefully we can continue navigating towards the big picture of the brain without drowning in a sea of data.


prevent data loss in the lab

Don't Lose It!

 

By Sally Burn. PhD

 

Last Halloween I wrote a Scizzle piece on lab nightmares; the first terror I dealt with was “Losing your data or samples”. Well, dear reader, I have to report that this nightmare became a reality for me a few weeks ago: I lost all my data. Four years and 400 GB, gone. And it happened with a single click of the mouse button.

Game of Thrones was also involved to an extent. But try as I might to throw blame at Joffrey and co., the main responsibility lies with my human error. Here’s how it happened: I have an external drive onto which I backup all data from my lab PC (via daily automatic backup) and microscopes (manually) into a folder rather unimaginatively called “Data”. There is also a redundant lab meetings folder sat just next to Data. In a rush to finish up what I was working on and free up my laptop for Game of Thrones I deleted what I believed to be the redundant folder, clicked “Yes” when warned it was too big for the recycle bin, briefly wondered why the deletion was taking so long, then finally settled in for some purple wedding action. Next morning, 24 hours before I’m due to give lab meeting, I go to retrieve some images from my drive. Only Data is no longer there. Some mild cold sweats kick in but I know that there’s a straight forward explanation, right? I must have dragged the folder into another folder. Only I can’t spot it anywhere… and that’s when I notice that my drive has 700 GB free instead of the usual 300. Cue draining of all color, mild sicking up in mouth, and incoherent babbling to lab mates.

How could this possibly happen, especially to me – a known anal retentive? It’s at this juncture I should point out that everything seems to be okay now and the situation was not as dire as it could have been – thanks in no small part to my anxious nature. Three weeks prior to Datageddon I’d taken a flight. Obviously this meant there was a strong chance of me dying in an aviation incident, plus being out of the lab somehow also increased the likelihood of there being a fire or maybe even just the building falling down. So I did one of my not-quite-routine backups to my home drive. The loss was therefore only three weeks’ worth of data. There was no new raw data generated in those three weeks but I had spent an inordinate amount of time converting the data into images, movies, and reconstructions – it was these that were lost.

It was beyond awful. So in an attempt to save my fellow scientists from a similar fate, here is a rundown of what I have learned and what you can do to protect your data:

 

You lost your data… now what?

My data loss was followed by the most mind-numbing two weeks of my life. I downloaded file retrieval software and retrieved 550 GB of deleted files. The retrieval took two days, recovering 300,000 files… which were all placed in the same folder, all details of their original location lost. Now I don’t know if you’ve ever tried to open a folder containing 300,000 files, ranging from 1kb to 35 GB in size, but let me tell you: it takes a LONG time and the average PC cannot handle ordering the files by date. I transferred operations to the fastest microscope PC and so began a week in a darkened ‘scope room, waiting hours for the folder to open and then slowly, laboriously attempting to transfer large handfuls of files (many duplicates or partial copies) into more manageable sub folders, such that I could look at and order the files by date. I got there in the end, retrieving the relatively few files I needed to, but ultimately it ended up taking me longer than it would have done to just reprocess the data from scratch.

As tedious as it was, file recovery software is your friend in this situation. If your files were too big for the recycle bin and you did an outright delete, this is your only option. I used Recuva, a free and easy to use program. You will need a second drive to write the retrieved files to. Try not to access (write to) the deleted disk before you start the recovery – the files are probably still in there somewhere but this may not be the case if you write fresh data to the drive. The process was slow on my geriatric PC and manual sorting through the files was even slower; I cannot even comprehend how long it would have taken me to sift through the retrieved data had I needed it all back. Which is why I cannot emphasize enough: prevention is better than cure - BACKUP!

 

Backup

There are a number of methods you can use to backup your data. Here is a rundown of a few, in order of reliability, starting with the least dependable:

 

Manual backup:

My hitherto method of choice; this also seems to be a popular choice among my peers. This technique relies on you arbitrarily remembering to bring another drive into the lab to backup to. It’s better than nothing, but barely, like fighting an angry tiger with only a spoon for protection.

 

Automatic backup software:

OK, now we’re getting a little more reliable. Most external drives come with backup software installed. I have automatic backup from my lab PC to my external drive; unfortunately my PC data constitutes an insignificant subset of my overall data footprint. My take-home message from my experience is that you need to backup up all the drives you use, including microscope computers. Which raises the question: who pays for that? It would make sense that the PI or department arranges for multi-user drives to be backed up automatically; unfortunately this is not the case in the labs of a number of scientists I quizzed. It seems that “each person for themselves” is an unfortunately common tenet in academia.

 

Cloud storage

At the suggestion of my PI, post-Datageddon, I paid $100 out of my own pocket for cloud storage. He recommended Carbonite, which offers unlimited storage. There are obviously other systems available, but thus far I have no issues with Carbonite. The $99.99/year plan allows for backup of all the internal drives plus one external drive; you can also create a mirror image of your system in case your computer needs totally reinstalling. The initial upload of my approximately 500GB of data took a week (possibly due to my subpar PC and internet connection) but since then it’s been ticking along nicely, backing up any changes in the background. If I delete a file and then realize I need it I have a 30 day window in which to retrieve it before the deleted file is removed from their server. If you work with clinical data and need HIPAA-compliant data storage there is also a package for that, retailing at $269.99/year. Data can be accessed from anywhere in the world, which could be a great benefit when away at conferences.

 

Server

As I mentioned earlier, a common experience among those I talked to was that there was no central backup provided by their PI or department. Whether this is the norm in universities, the USA, or just in the labs of scientists I talked to is unclear. In my previous lab, in a research institute in the UK, all data drives and microscope computers were backed up to an on-site server every night; copies were maintained for a set time period and off-site backups were also regularly performed. The combined on-site and off-site server approach seems to be the gold standard as far as I can see, protecting even against loss due to building damage. However, even a single on-site server is a great idea. So perhaps float the idea next time your PI has grant money earmarked for purchasing equipment. Don’t think they’ll accept it as a reasonable expense? Try working out how much it will cost to repeat your experiments and replace lost data. As a ballpark figure, I calculated what it costs for me, a fourth year postdoc, to run an overnight live imaging organ culture on a multiphoton confocal microscope. My calculation takes into account my wages for time setting up, running, and analyzing this experiment; it also includes the cost of breeding and maintaining transgenic mice for three months leading up to the experiment in order to get the tissues I need, plus lab consumables (culture media, plates, etc.). To run this experiment and hopefully generate a single movie for use in the supplementary material of a paper, I’ve calculated that my PI pays around $2,431. No, really. Maybe that server doesn’t look so pricey now…?

Whatever data protection route you choose, remember that good anti-virus software is also a necessity for protecting your data. Talk to your PI/department to see if there are any provided backup resources. And if you yourself are a PI, come up with a data protection plan and make sure your employees know about it. It may save you a lot of stress and money further down the line.


Dry Science: The Good, The Bad, and The Possibilities

Celine Cammarata

Recent years have seen a boom in so-called “dry lab” research, centered around mining large data sets to draw new conclusions, Robert Service reports in Science.  The movement is fueled in part by the increased openness of information; while research consortia used to hold rights to many such data banks, new efforts to make them freely available have unleashed a wealth of material for “dry” scientists.  So what are some of the pros and cons of this growing branch of research?

 

Computer-based research of large, publicly available data can be a powerful source of information, leading to new insights on disease, drug treatments, plant genetics, and more.  One of the most commonly encountered methods is the Gene Wide Association Study, or GWAS, whereby researchers look for genetic traces of disease.  Such research is strengthened by the ability to collect huge amounts of data, increasing n values without having to recruit new participants.  Another perk of dry research is the increased mobility for researchers to hop among different areas of study; with no investment in maintaining animals or lab equipment specialized to any single line of investigation, researchers can study cancer genetics one year and bowel syndromes the next with little difficulty.

 

But getting the large amounts of data that fuel dry research can be more complicated than it seems.  Some investigators are reluctant to make their hard-earned numbers publicly available; others lack the time and manpower to do so.  And slight variations in how initial studies are conducted can make it challenging to pool data from different sources.  Furthermore, GWAS and similar experiments themselves are deceptively complicated.  Most diseases involve complicated combinations of genes turned on and off, making it hard to uncover genetic fingerprints of illness, and comparing the genomes of many subjects frequently leads to false signals.  For dry research to continue growing successfully, significant advances in programming and in mathematical techniques to analyze data will be required.  Finally, making data freely open for investigators to delve into raises concerns about subject confidentiality.

 

Finally, the increase in data availability raises intriguing questions about the future of research.  Currently, dry research requires complex programs and hefty computer power, but with computer science getting ever better, will future generations need a lab to do science?  Will anyone with a decent computer and some scientific know-how be able to contribute meaningfully to the research community?  And if so, what will this mean for the traditional university-based research world?  Only time will tell.


Piled Higher and Deeper: Bioinformatics

Neeley Remmers

I was perusing the table of contents of the current issue of Clinical Cancer Research, and saw an abstract for a paper entitled “Uncovering the Molecular Secrets of Inflammatory Breast Cancer Biology: An Integrated Analysis of Three Distinct Affymetrix Gene Expression Datasets” by Steven J Van Laere  et al. This particular paper looks at molecular signatures distinct to inflammatory breast cancer (IBC) by means of analyzing Affymetrix microarray data from 137 patients compared to 232 control patients that did not have IBC. After doing a lot of data mining with the help of the PAM50 algorithm, they did find a molecular signature unique to IBC versus normal patients though they would need to do similar comparisons to other forms of breast cancer to see if there are distinctions that set them apart. My initial reaction after reading this abstract and seeing how many patients they had to analyze and compare was a sense of being overwhelmed. In order to do clinical or translational research, you have to work with these large data sets to account for all the many variances that come with studying human samples, which means you also need to have a good understanding of and willingness to do bioinformatics.

Personally, I think it takes a special kind of person to do bioinformatics and, for that matter, biostatistics. If you are fortunate enough to work in an institution that has a bioinformatics and biostatistics core, consider yourself lucky. I recently have been honing my bioinformatics skills by analyzing RNA-sequencing data trying to figure out which activation and chemotaxis pathways in leukocytes are turned on upon treating them with my protein of interest. I had an appreciation for those who make a living in this field, but after countless hours in front of my computer creating different gene lists and analyzing them with Ingenuity I have an even greater appreciation for what bioinformaticians and biostatisticians do. My brain was not wired to understand or generate the many algorithms now available to help us perform these complex analyses and generate the statistics needed to validate the findings, but I applaud those who can. Personally, I think there should be a national bioinformatician/biostatistician appreciation day.