Less Western Blots, more ggplots
For biologists transitioning from the wet lab to bioinformatics.
I’ve spent the last two years in labs: building, learning and breaking stuff. I entered as a wet lab trainee and aspiring cell culture guru but emerged as a sapling bioinformatician. Through the process, I’ve gotten the chance to meet some patient, kick-ass scientific mentors who've taught me almost everything I know about the research process, both at the bench and the command line.
For the incoming bioinformaticians in the crowd- I want to share the most valuable lessons from my journey of transitioning into the dry lab, most of which came through trial, a lot of error, and feedback from my mentors in-lab. This is for any students who are looking to challenge themselves by trying out:
This is programming put to the task of deriving insights from biological data. Behind this explanation, bioinformatics is the combination of logic and teaching. It’s the art of taking messy data that comes out of your biology experiments and turning it into easy-to-read, understandable lessons.
Everyone should try doing wet lab work at least once, there’s an intuition that you build when working at a bench that can’t be developed from just talking to biologists alone.
That doesn’t mean everyone should keep doing assays afterward though.
Regardless of what research niche you find yourself in, usefully interpreting data will always be a really important task that you can become good at. Bioinformatics opens doors if done right, and just like wet-lab work, everyone should try it at least once. So here are the high-level takeaways for becoming a better bioinformatician that I wish I knew when first starting out.
Think in logical flow-charts
When getting started out, I thought of bioinformatics pipelines as these monolithic, hard-to-understand chunks of code that would generate a smart output from the data I put in. Pipelines were complicated black boxes that would eat my data and spit out a pretty chart, and my job as a “bioinformatician” was to clean up my data so that the complicated black box could properly read it.
In reality, bioinformatics workflows are a series of programs and functions: code blocks, that run functions and processes on the data you give them and then pass the data output forward. We can think of how pipelines and workflows for using a flow chart or graph, made up of nodes.
You start with your first task (node “a”), which takes your data input and alters it in some way. Node “a” then passes the output onto other tasks (nodes “b”,”c”,”d”) for processing. The last of these tasks then passes your data onto the final job: outputting your data (the output node, “e”).
Put together, bioinformatics pipelines can be used to transform, manipulate and analyze your data as it moves through each of these steps.
With this in mind, you can start to look at bioinformatics pipelines not as monolithic black boxes of code that generate fancy figures from your data, but as logical processes that are really made up of a series of “baby step” sub-tasks.
Be picky with your pipelines and workflows.
One summer I was doing a bioinformatics internship and came across a dilemma: there were two bioinformatics workflows that did the same job. I had no idea how to choose between them and had no criteria for selecting what was better for my needs.
In the end, I chose a pipeline that wasn’t good for what I needed to do.
You may be told to judge based on the number of citations that each pipeline has and told to pick the pipeline with the most citations. The issue with that appraoch is that when you judge a pipeline purely on its citations, you risk supporting a legacy system instead of choosing the best tool for your needs.
So what can you do to confidently make a decision?
Pick your models wisely
Let’s imagine that as a part of your analysis, you’re hunting for groups of similar data points in a mid-sized dataset with a lot of different variables to sift through. You want to get a series of data groups to then run more analyses later.
You’re therefore going to need code that clusters your data points into groups based on their similarity to one another.
Sound straightforward? Probably not.
Each chunk of code in your bioinformatics workflow will be based on an algorithm, model architecture, or other mathematical function. Because of this flexibility, there are multiple different models that can be used to do data clustering.
Continuing with our example, let’s imagine that there are two pipelines that you can choose between for analyzing your data: M3ta and Supergroop.
Both of these pipelines do pretty much the same tasks to your data, and in fact, give you the same output file type at the end of the pipeline. To get some ideas of the difference between the two workflows you begin to read the papers that describe each pipeline in-depth, coming across one difference between the two pipelines. The Supergroop paper mentions using gaussian mixture models (GMM) for their clustering, while the M3ta paper talks about using K-means clustering for its clustering task.
This sounds like a very specific detail to be worked up about, no?
It turns out that this very specific detail is the difference between putting your data through a clustering step that will be slower to run, versus a pipeline that would run your clustering more efficiently. Assuming all else equal, you would be better off using M3ta for your first-pass analysis instead of Supergroop given that K-means is a computationally-lighter approach than GMM for clustering.
This doesn’t mean that [GMM = bad] && [K-means = Good] but it does mean that you need to read up on what bioinformatics jobs you’re looking to run and decide what resources you need to dedicate accordingly. GMMs are better suited for doing large-scale clustering and getting less abstract clusters, but you were not looking for super granular clustering.
If you’re interested in learning more, I’d check out this ScienceDirect post on GMM vs K-means clustering.
Learning how to decide what are the best models at your disposal is something that I am still very much trying to learn and get more proficient with myself. It should be noted that understanding the underlying math behind these models can really help you understand these models.
In the meantime, a helpful way of gauging the best models to use for your job is to reach out to any bioinformatics mentors you may have and get their professional opinion on the different statistical approaches that you’re considering.
Less is more: Show what matters in your key figures
Starting out, I had a bad habit of showing every bit of data during my presentations so that I could hide the fact that I wasn’t confident in the job I was doing. This didn’t make the audience trust my work more, so much as it made them want to sleep more.
I soon learned a valuable lesson: to keep things clear and concise.
Your overview figures are an executive summary
When you’re presenting data figures to a professor, postdoc, or committee, you want to fundamentally be in a similar headspace that you would be in if you were a consultant presenting to corporate clients.
A lot of lab mates and I have been grilled in the past for having confusing or unclear display choices on our figures. Things like having non-contrasting colors, weirdly labeled clusters and figures that are over-scaled to the point of being illegible. These all contributed to making our data presentations less meaningful and understandable.
Even when your figures are formatted, colored and labeled nicely, there can always be the case of having too much data being presented on one figure. It’s helpful to get feedback on your figure from your mentor on where you should balance being concise and providing enough information on your figures.
The key is to strike a balance between detail and brevity in the figures that you end up presenting on a paper or in a lab meeting. Your audience is looking for the closest thing they can get to a smoking gun, anything on your figure that makes it harder for them to identify patterns of interest is not in your interest to include. As you give more presentations, you’ll get important feedback from your mentors on the level of brevity and detail that they want.
If you make your figures more concise, you should always still have your supplementary data and figures on hand. This data will often provide the context behind your conclusions, as well as can reinforce your takeaways if it’s consistent with what you present in the main figures.
Be prepared for iteration
It’s funny how both wet lab and dry lab people have their own go-to website for answer hunting whenever things go wrong- StackOverflow for people fixing code and ResearchGate for those of us trying to get our gels to work.
Just like how a lab deploying summer undergrad students full-time on optimizing a single wet lab protocol is a testament to the amount of troubleshooting that’s needed in the wet lab, you’ll need to be ready for the slew of backtracking and error log reading that is bioinformatics.
Any pretty figures that you see in published articles are the product of hours-upon-hours of work. For each fancy figure, there are hours spent on generating variations of the same figures that never make it to publication. Just like them, you’ll need to go through multiple designs of your figures, changing parameters and aesthetics until your audience (namely your mentors) are happy that readers who aren’t involved in your research can still take away a clear message from your work.
This is not to even mention the debugging process which as a growing bioinformatician, you will be forced to get good at. The biggest chunks of time I spent in any bioinformatics project, were always for debugging and formatting my input data to be compatible with my pipeline of choice.
Getting Started with your first pipeline
The best way to learn at the end of the day is to have a data problem that you’re looking to solve. If you’re starting from a wet lab position, there’s almost certainly a pile of project data that is sitting around and waiting to be analyzed.
Some important considerations before getting started on a pipeline:
- What kind of data is being outputted?
- What kind of insight do we need from the data?
- What are the pipelines & workflows at our disposal to get this insight?
- What makes these tools better than one another?
As a beginner, you’re generally expected to learn both the granular programming and data handling skills first, only to then slowly pick up on the higher-level takeaways to become a more competent bioinformatician. When it comes time to actually run your first bioinformatics pipeline, you’ll start to become well-acquainted with the importance of “basic” tasks like data cleaning as well as many of the time-consuming aspects of debugging and iteration.
Building your first pipeline (ft. LatchBio SDK)
Eventually, you’re going to want to start tinkering with pipeline code, maybe you’ll even want to try your hand at building your own bioinformatics workflow. Even if you’re a whiz at picking statistical approaches and are an expert troubleshooter, building your own pipeline from start to finish involves so much more than what is covered in this article.
We can talk about picking the right model and making pretty figures all day, but building your own pipeline will bring forward new challenges.
Data science boot camps aren’t going to teach you how to design sleek user interfaces, how to put together the necessary computing and storage infrastructure for people to run their jobs on your pipeline, or how to market your code in the right places so that others will actually use it. Doing all of this requires months of work and sometimes a whole team to support you.
Luckily nowadays, you don’t need the skill set of a whole tech startup just to bring your bioinformatics pipeline to users.
Building pipelines on the SDK
For anyone who wants to get their hands dirty deploying a bioinformatics pipeline for a bigger audience, I’d recommend checking out LatchBio and their SDK developer platform. The SDK handles the infrastructure needs for people who want to build bioinformatics tools while also introducing their solutions to a no-code interface!
If you’re wondering what developing on the SDK is like, check out this loom video I’ve made of my experience. For anyone interested in learning more about developing on the SDK, I’d recommend visiting the Latch docs.
What makes Latch so cool is that they are among the first tools that are bringing the developer-first model of places like GitHub to the bioinformatics space. If you want to get your pipeline onto the platform, all you need to do is request developer access and a few tailored python scripts that will subprocess your pipeline on a new GCP-13 instance.
Final Thoughts: What to prioritize learning when starting out?
A lot of students entering their first bioinformatics gig will come in with a basic understanding of data science and programming, only to then be thrown into the deep end of debugging and data cleaning.
I argue that this learning priority for new bioinformaticians is flipped- that it’s more important to focus on learning higher-level best bioinformatics practices before getting wrapped up in doing granular tasks like file formatting and data cleaning.
Speak to who you can, try to build your own bioinformatics project using publicly available data and do whatever you can to problem solve outside of a classroom tutorial.
Regardless of how you start your foray into bioinformatics the most important thing to do, as always, is to just get started.
Hope this helped, happy wrangling!
— Michael Trinh