Student Assignment, BioInformatics Final Project
- Paige Kramer University of Kansas, School of Professional Studies, Biotechnology
- Jack Treml University of Kansas, School of Professional Studies, Biotechnology
This final project assignment can be used to implement the accompanying walkthrough in bioinformatics (or other applicable) classes.


How to Cite
- Endnote/Zotero/Mendeley (RIS)

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License .
© The Author(s)
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International license.
- Español (España)
- Português (Brasil)
- Français (France)
Information
- For Readers
- For Authors
- For Librarians


Bioinformatics Project Ideas/Topics Collection For Engineering Students
- Predicting Cellular Localization . Eukaryotic cells contain several sub-compartments, the Cellular Localization problem consists of predicting which compartment a protein is most likely to be found, on the basis of sequence information alone. The project may consist of a review of the literature and/or a novel analysis (I have access to a data-set that has never been used in a predictive context).
- Regulatory-motifs . Review of the literature on algorithms to automatically determine regulatory motifs (short sequence signals) in DNA sequence data. I have a Java library that can be used to implement a prototype application; see suffix tree below.
- SNP (Single Nucleotide Polymorphism) . Review the literature of the methods for detecting SNPs, as well as their application. Single nucleotide polymorphisms (SNPs) are common DNA sequence variations among individuals. They promise to significantly advance our ability to understand and treat human disease. (Excerpt fromsnp.cshl.org). See also Linkage analysis. (S)
- Metabolic Pathways . Proteins interact together to perform specific functions. Such network of interaction is called a molecular pathway. There are two main aspects to this field: how to infer/determine the connections and how to simulate cellular processes. There exist several computational approaches to model molecular pathways, including Petri-net.
- Molecular -arrays . Todays technology (which borrows from inkjet technology) allows to fix tens of thousands of different macromolecules (DNA or protein molecules) onto a small surface. This technology allows to reveal which macromolecule is expressed, at different times, within different tissues, or different cellular states (disease vs non-disease). In the case of DNA chips, they measure the levels of expression of each gene.
- Mass spectrometry (MS) . MS produces a spectrum of all the masses of all the compounds that are present in a sample. When an input protein is cut at specific sites, it will produce a specific spectrum. Such technology can now be used to fingerprint the content of a cell.
- Expression data + motif discovery . DNA--arrays allows to find genes that are simultaneously expressed. Those genes are most likely co-regulated, i.e. they share a common sequence signal in their promoter region. Daniela Cerna implemented a suffix tree library in Java, in the context of her honours project. Here, we would be re-using the library to help finding conserved motifs.
- Expression data + cell localization . Can the use of (predicted and experimental) data on cellular localization help distinguish between true and false positive when expression data is analyzed to find actors and inhibitors?
- Genome comparison . Implementing a MUMMER-like algorithm using Danielas suffix tree (Java) library. This involves writing a hybrid algorithm k-bands dynamic programming algorithm + suffix trees.
- Genome rearrangements . Genomes are evolving at several scales: from point mutations to large rearrangements. It the late 80s, it became evident that several closely related genomes had genes that were extremely similar (say 99 pid), one to another, but the order of genes along the chromosomes was not preserved. Review and present the main algorithms to compare entire genomes. Topics include: sorting by reversals (Sankoff), break point graph, Hannenhalli and Pevzner algorithm.
- Accurate Phylogenetic Reconstruction from Gene-Order Data.
- Ontologies . What is an ontology? What tools and knowledge representation formalisms (languages) are available to support the development of ontologies. Give examples of ontologies. Expose the problems associated with ontologies.An ontology is a controlled vocabulary (e.g. gene ontology). It allows to resolve some of the problems associated with data integration.
- Genome assembly . Because of physical limitations, only relatively short DNA sequences can be read (some 500nt). For processing a complete genome, one approach, called shut-gun, consists of sampling small reads (500 nt) at random location along the chromosomes. The total number of reads is chosen so that the likelihood that each nucleotide is included into more than one read is high (typically each nt is part of 3, 5 or 10 reads). Computers are then used to stitch the reads together. One solution to this problem is related to the shortest super-string problem.
- Grammatical frameworks for RNA structure . RNA secondary structure information can be represented using context-free grammars. As with most biological data, the information is better represented within a statistical framework. A Stochastic Context-Free Grammar (SCFG) has probabilities attached to its production rules. The two main issues with SCFGs are the parsing and the induction of the grammar. Review the literature on SCFGs (this includes COVE, infernal and pfold), and build a prototype parser in Java.
- Predicting Gene-Gene (Protein-Protein) interactions . There exist a vast number of algorithms that allow to predict if two genes will be interacting. This includes: text-mining, co-location along the chromosomes, phylogenetic footprinting, etc.
- Lattice models . Predicting the three-dimensional structure of a protein is a notoriously difficult problem. So much that alternative problems have been devised to circumvent it: secondary structure prediction, inverse folding problem, etc. Some authors have also been studying simpler systems, such as 2D and 3D lattices. Create your own implementation; this includes an algorithm to efficiently search the folding space and a scoring function. Run some simulations.
- Structure comparison methods . Review the literature on 3D structure comparison. Implement at least one algorithm. Input: 2 three-dimensional structures, output: a measure of distance (typically root-mean-square deviation expressed in ), and a list of equivalent residues.
- Methods for detecting trans-membrane helices . There is class of transmenbrane proteins whose secondary structure can be reliably predicted. Those proteins are mainly made of helices, such that if the loop connecting the helices i and i + i is exposed to the inside of the cell, then the next one will be exposed to the outside of the cell. Use a Hidden Markov Model or Neural Network to reproduce this result.
- Secondary Structure Prediction . Implement a secondary structure prediction method and compare its accuracy to known methods. Common choices for your implementation include: Neural Networks, Hidden Markov Models, and possibly decision trees.
- Surface/Interior . Implement a algorithm to predict the solvent accessibility. Common choices for your implementation include: Neural Networks, Hidden Markov Models, and possibly decision trees.
- Applications of suffix trees . Use Daniela Cerneas suffix tree library and implement some of the following algorithms: linear time algorithm for finding the longest common substring of k strings (interestingly, Knuth had predicted that no linear time algorithm would be found for solving this problem), finding all maximal repetitive substrings in linear time, finding all maximum palindromes, k -mismatch algorithm.
- Bio-Ethics . Bioinformatics deals with biological and medical data, according there are numerous related ethical issues: should patenting genes be allowed? how to handle patient data? how to deal with genomic data, imagine that the analysis of a dataset allows to draw conclusions about a population, a religious group, people who live in a specific region, etc. The consequences can be sever: it could be that this group will be more likely to suffer from certain diseases, such information could be used by insurance companies, employers, etc. to screen candidate.
- Genome motifs viewer . Construct a flexible graphical using interface to visualize shared motifs. Suggestions: make it 3D to ease viewing multiple strings. Motifs would be extracted from a suffix tree.
- Teaching tools : interactive linear time construction of a suffix tree, showing the suffix links, interactive tools for software alignments.
- Expectation-Maximization (EM) algorithm and some of its applications in molecular biology . EM is used for training certain Hidden Markov Models, Covariance Models and building phylogenetic trees. What is it? What are the main applications? Prototype implementation. (S)
- Gibbs sampling . This technique forms the basis for several motif detection tools. What is it? What are the main applications? Prototype implementation. (S)
- Bayesian networks . What are bayesian networks? What is interesting about them? What are the bioinformatics applications of bayesian networks? Carry out a small experiment. (S)
- Predicting Phenotype from Patterns of Annotation, -arrays, etc . One of the goals of bioinformatics research is to transform molecular biology into a predictive science. For example, given a certain pattern of gene expression, detected by -arrays for example, what would be the best treatment (personalized medicine)? Survey the literature on the use of bioinformatics techniques to assist medical diagnosis, prognosis and treatments. Where are we heading? When will personalized medicine be true? How much data? Remaining problems to be solved?
- Statistics behind BLAST . Good candidate for a multiple teams work, where one team would focus on the statistics of word matching while the other would focus on hashing. Produce a Java implementation of hashing techniques for speeding up the sequence alignment problem. The part on the statistical analysis of hits requires a statistical background (S) but not the algorithmic part.
- Constructing phylogenetic trees . Read an overview of the construction of phylogenetic trees using a neighbour-joining approach. For this project, you will produce a prototype implementation, in Java, of a modern method such as: quartet method, maximum likelihood or maximum parsimony. (s)
- QSAR . One of the main bioinformatics contributions to drug discovery is the Quantitative Structure Activity Relationship analysis (QSAR); the other is molecular docking. QSAR analyses take as input a set of compounds and their relative activity/efficacy. It then finds the commonalities between those molecules. The commonalities are then used to design new/better drugs.
- Molecular docking consists of predicting how two molecules will interact. This can either be two proteins or one protein and a small compound, such as a new drug. The two main factors that are taken into account are the shape and electrostatics of the two molecules.
- BioJava is a large collection of classes for solving bioinformatics problems. See #-Link-Snipped-#.
- Java3D . A protein viewer was developed two years ago in the context of a CSI 4900 project. Extensions of this project could be considered.
- Tandem repeats . Review the literature on tandem repeats detection and implement a prototype application. Tandem repeats are repeats of the form n , s.t. 2
- Simultaneous alignment and structure prediction for two RNA sequences . Implement a simplified version of dynalign, where the secondary structure prediction is calculated using the Nissinov algorithm; i.e. finds the maximum number of base pairs.
- 3 way genome alignments .
- 1. Testing for absence of secondary structure in combinatorial sets of DNA strands.
Akshay Sanap "Resolving Complexity of Bioinformatic Algorithms using Python".
Hi, I'm umer currently trying to find out a better final project for masters in bioinformatics. Can u plz send me extra details related to " Predicting Gene-Gene (Protein-Protein) interactions . There exist a vast number of algorithms that allow to predict if two genes will be interacting. This includes: text-mining, co-location along the chromosomes, phylogenetic footprinting, etc. " this topic at [email protected]
Hello I am a student of bioinformatics ..I am final year student and I want to select molecular docking as my final year project.. So kindly provide me a dataset of this project and its coding as well ..
Hi, I am a student of bioinformatics, final year. I want to do good programming project using Neural Networks and deep learning. Suggest any idea or dataset to work upon.
can I get help with next-generation sequencing technique topic, to gene related to breast , pancreatic or lung cancer
good day! I am an undergraduate and I need project topics on biomedical informatics
You are reading an archived discussion.
Related Posts
Gis project ideas topics collection for engineering students, how does buchholz relay work in a transformer, radio frequency transmitter and reciever, how to become a successful engineer, cloud computing questions.
Available Projects in Bioinformatics and Machine Learning
Discriminative graphical models for protein sequence analysis (joint project with sanjoy dasgupta), embedding sequences into euclidean spaces, discovering the genetic basis of human disease, statistical and algorithmic aspects of motif discovery, promoter discovery in drosophila, promoter modeling in bacteria and yeast, regulatory aspects of human disease.
- No suggested jump to results
- Notifications
Bioinformatics final project, Is being continued as independent project. Completed in python Using Kivy, Webscraping (Selenium and Beautiful Soup), and Regex to edit fasta and ALN files
hollant3510/Bioinformatics-project
Name already in use.
Use Git or checkout with SVN using the web URL.
Work fast with our official CLI. Learn more .
- Open with GitHub Desktop
- Download ZIP
Sign In Required
Please sign in to use Codespaces.
Launching GitHub Desktop
If nothing happens, download GitHub Desktop and try again.
Launching Xcode
If nothing happens, download Xcode and try again.
Launching Visual Studio Code
Your codespace will open once ready.
There was a problem preparing your codespace, please try again.
Latest commit
Bioinformatics-project.
Bioinformatics final project, Is being continued as independent project
- Python 100.0%
Sign in | Recent Site Activity | Report Abuse | Print Page | Powered By Google Sites

- Bioinformatics Project Ideas
Final Year Bioinformatics Project Ideas at ElysiumPro
Bioinformatics Project Ideas: Bioinformatics is a massive domain. As a matter of fact, it is an interdisciplinary field which develops methods and software tools for understanding biological data. In these current days, maximum number of students has very interest to build projects in this domain for their final year semester projects.
At ElysiumPro, We provide high quality bioinformatics projects for students. In fact, it is the application of computer to get the details. It has been store in various types of biological data. Further, the main aim of this technology is to convert the complex data into useful information
- Android Sample Projects for Beginners
- asia culture
- asian dating
- best brides
- Best Project Ideas For Bio Medical Students
- Big Data Hadoop Projects
- brides catalogue
- Chatbot Programming
- Cloud Computing Projects with Source Code
- Computer Network Projects
- Computer Science Engineering Projects
- Cyber Security Project
- Data Analysis Projects
- dating sites
- dating, sex
- ECE Communication Projects
- Electronics and Communication Engineering Projects
- ElysiumPro - Final Year Project Blog
- find a bride
- find brides online
- Forex Trading
- FPGA Signal Processing Projects
- hookup sites
- https://jetbride.com/
- IEEE Project Resource For IT Students
- Image Segmentation Projects
- Interesting Embedded Projects
- international dating sites
- international marriage
- Internet of Things Projects
- interracial dating
- Japanese Brides
- legalni bukmacherzy w polsce
- Mail Order Brides
- mail order brides dating
- mail order wives
- Mobile Computing Projects
- news, relatipnshop
- online brides
- online dating
- Payday Loans
- Power System Projects
- Project in Electrical Engineering
- Project Solution Of Bioinformatics Students
- Project Titles and Abstracts
- relationships
- russian brides
- senior dating
- Sensor Projects
- Student Project Guide
- top dating sites
- Uncategorized
- VLSI Projects
- Web Projects with Source Code
- Wireless Networking Projects
- advantages of cloud computing
- Android App Development Projects with Source Code
- Android App Development Trends
- Android App Trends in 2020
- Android Application Projects
- Android Application Proposal
- Android Projects for Students
- Artificial Intelligence Applications
- Benefits of IoT in Agriculture
- Best Internet of Things Projects
- Big Data Analytics Predictions
- Big Data Analytics Projects
- big data future predictions
- Big Data Predictions
- Big Data Predictions in 2020
- Big data Projects
- Big Data Projects for Engineering Students
- Bioinformatics Applications
- Bioinformatics Projects
- Bioinformatics Technology
- Biomedical Engineering Projects
- Biomedical Project Ideas
- Biometric Authentication System
- Bitcoin (Revolutionary impact project ideas)
- Bitcoin and Ethereum
- Bitcoin Cryptocurrency
- Cloud Computing Benefits
- Cloud Computing Trends
- Cloud Computing Trends in 2020
- Communication Projects
- Consequences of cyber-attack ( Cyber security innovative ideas )
- Creative Application Development - Tools Used
- creative visual communication ( 100+ imaginary ideas)
- Cryptography Applications
- Cryptography Secrets
- Data Mining Concepts
- Data Mining Engineering
- Data Mining Functionalities
- Data science scope [ new up growing technology]
- Data Visualization Projects
- Data Warehouse Projects
- Digital money in crypto-currency (100+ creative ideas)
- Digital signal processing ( Effective 100+ project ideas)
- Digital signal processing (New Trendy ideas )
- Dot Net projects
- Dotnet Projects
- Drone Applications
- ECE Projects IEEE Papers
- Effective Ios technique ( Trendy application development )
- Electric Power System Trends
- electronics engineering project ideas
- Elysiumpro Projects
- Embedded Projects
- Embedded Projects for Engineering Students
- Embedded Systems Projects
- Embedded Systems Projects Ideas
- Emerging IoT Trends 2022
- Engineering IoT projects
- Engineering Project Ideas
- Evolution of cyber security (Types of cyber-attack)
- Evolution of Machine Learning Technology
- Evolution of operating system [ process and multithreading ]
- Evolution of wireless networking (concepts and teqniques)
- Final Year IEEE Projects
- Final Year IoT Projects
- Final year Matlab Projects
- final year project centre
- Final Year Projects Domains
- final year projects for computer science
- Final Year Python Projects
- Final Year Web Application Projects
- Future of Satellite Communication - Features and Components
- Future Trends of Python 2020
- Grid Computing Projects
- IEEE Based IoT Projects
- IEEE Big Data Projects
- IEEE Communication Projects
- IEEE Embedded Projects
- IEEE Matlab Projects for Engineering Students
- ieee mini projects
- IEEE Mobile Computing Projects
- IEEE Power Electronic Projects
- IEEE Projects
- ieee projects 2017
- ieee projects download
- ieee projects for cse
- ieee projects for ece
- IEEE Projects for Final Year Students
- ieee research paper
- Image Processing Project Ideas for Engineering Students
- Image Processing Projects
- Image Processing Projects in Agriculture
- Information Security Projects
- Informative security projects ( 100+ powerful ideas)
- Innovative Android Projects
- Innovative communication projects (100+ unique projects)
- Innovative IoT Projects
- Innovative spin electronics( recent break thoughts in spintronic)
- Internet of Things
- Internet of Things in Healthcare
- Internet of Things Projects for Final Year Students
- Internet of Things Technology
- Inventive Bioinformatics Project - Elysiumpro
- IoT Applications
- IoT Benefits
- IoT Future Trends
- IoT Healthcare Applications
- IoT Internet of Things Projects
- IoT Projects
- IoT Projects Ideas
- IoT Trends 2020
- IoT Trends 2021
- IoT Trends to Watch in 2021
- Java Web Application Projects
- Latest Android Projects
- Latest Bio-informatics Trends
- Latest Domains for Final Year Projects
- Latest Features of Python
- Latest Image Processing Trends
- Latest Internet of Things Projects
- Latest IoT Projects
- Latest IoT Trends
- Latest IoT Trends 2020
- Latest Mobile Computing Trends
- Latest Networking Trends 2020
- Latest Power Electronics Projects
- list of android trends
- list of networking projects
- List of Power Electronics Projects
- List of VLSI Project Ideas
- machine learning trends 2020
- Matlab Image Processing Projects
- MATLAB Projects
- Matlab Projects for Engineering Students
- Mobile App Development Projects
- Mobile Computing
- Mobile Computing Engineering Projects
- Mobile Networking Projects
- monitor network traffic in 2021
- Nanotechnology Applications
- Network Projects
- Network Security
- Network Traffic
- Networking Project Ideas
- Networking Projects
- networking projects for final year students
- new developments in cloud computing
- New web development projects (tips and tricks ) - Elysiumpro
- Operations in DSP ( what are the effects and architecture )
- php application development ( what are the framework used in php)
- PHP Web Development Projects
- Power Electronics Projects
- Power Electronics Projects for Students
- project domains
- Project Domains 2021
- projects for engineering students
- Projects For Final Year Students
- Python Development Framework
- Python Features
- python for web application projects
- python projects
- Python Trends
- python trends 2020
- Real Time Android App Development Projects
- Real time bioinformatics project(Rapid application development -Elysium pro
- Recent Final Year Projects
- scope of IOS ( 100+ effective project ideas) - Elysium pro
- Security Projects
- Sensor Based Projects
- Sensor Technology
- Signal Processing
- Signal processing Ideas ( Error correction and detection technique )
- Signal Processing Projects
- software engineering project
- Source Code Data Mining
- Top Programming Languages
- VLSI Project Ideas
- Web App Development Projects
- web application development ( features of php)
- Web Application Project Ideas
- Web Application Projects
- Web Application Trends
- Web Based Projects
- Web development companies
- Web Development Projects
- web development projects with source code
- web development Trends
- wireless communication projects
- Wireless Communication Technology
- Wireless Network Projects
- wireless sensor network
- Mohamed Chandulal
- wordpress maintenance

Advances and Applications of Bioinformatics Project Ideas
Hi there! Click one of our representatives below and we will get back to you as soon as possible.

BIOINFORMATICS PROJECTS
1.Cloning and restriction studies. 2.PCR and Primer Design 3.Structure Prediction or Modeling of Proteins 4.Genome analysis and annotation 5.Nucleotide sequence and analysis 6.Structural analysis 7.Phylogenetic analysis 8.SNP Analysis 9.Gene Silencing 10.Protein Sequence Analysis 11.Comparative Genomics 12.Data mining for microarrays 13.Docking and Drug Design 14.RNA Analysis 15.Drug Target Identification. 16.Vector Construction 17.Identifying Biomolecular Subgroups Using Attractor Metagenes 18.Predicting Protein Secondary Structure Using a Neural Network 19.Gene Ontology Enrichment in Microarray Data 20.Exploring Genome-wide Differences in DNA Methylation Profiles 21.Identifying Differentially Expressed Genes from RNA-Seq Data 22.Exploring Protein-DNA Binding Sites from Paired-End ChIP-Seq Data 23.Developing MapReduce Algorithms for Next-Generation Sequencing 24.Working with Illumina Solexa Next-Generation Sequencing Data 25.Performing a Metagenomic Analysis of a Sargasso Sea Sample 26.Analyzing the Human Distal Gut Microbiome 27.Calculating and Visualizing Sequence Statistics 28.Aligning Pairs of Sequences 29.Working with Whole Genome Data 30.Comparing Whole Genomes 31.Assessing the Significance of an Alignment 32.Using Scoring Matrices to Measure Evolutionary Distance 33.Using HMMs for Profile Analysis of a Protein Family 34.Building a Phylogenetic Tree for the Hominidae Species 35.Analyzing the Origin of the Human Immunodeficiency Virus 36.Analyzing Synonymous and Nonsynonymous Substitution Rates 37.Investigating the Bird Flu Virus 38.Reconstructing the Origin and the Diffusion of the SARS Epidemic 39.Bootstrapping Phylogenetic Trees 40.Exploring Primer Design 41.Identifying Over-Represented Regulatory Motifs 42.Predicting and Visualizing the Secondary Structure of RNA Sequences 43.Working with Objects for Microarray Experiment Data 44.Analyzing Illumina® Bead Summary Gene Expression Data 45.Detecting DNA Copy Number Alteration in Array-Based CGH Data 46.Analyzing Array-Based CGH Data Using Bayesian Hidden Markov Modeling 47.Visualizing Microarray Data 48.Gene Expression Profile Analysis 49.Working with Affymetrix® Data 50.Preprocessing Affymetrix® Microarray Data at the Probe Level 51.Exploring Gene Expression Data 52.Analyzing Affymetrix® SNP Arrays for DNA Copy Number Variants 53.Working with GEO Series Data 54.Preprocessing Raw Mass Spectrometry Data 55.Visualizing and Preprocessing Hyphenated Mass Spectrometry Data Sets for Metabolite and Protein/Peptide Profiling 56.Identifying Significant Features and Classifying Protein Profiles 57.Differential Analysis of Complex Protein and Metabolite Mixtures using Liquid Chromatography/Mass Spectrometry (LC/MS) 58.Genetic Algorithm Search for Features in Mass Spectrometry Data 59.Batch Processing of Spectra Using Sequential and Parallel Computing 60.Visualizing the Three-Dimensional Structure of a Molecule
Projects at Bangalore offers Final Year students Engineering projects - ME projects,M.Tech projects,BE Projects,B.Tech Projects, bioinformatics project for students,Diploma Projects,Electronics Projects,ECE Projects,EEE Projects,Mechanical projects,Bio-Medical Projects,Telecommunication Projects,Instrumentation Projects,Software Projects - MCA Projects,M.Sc Projects,BCA Projects,B.Sc Projects,Science Exhibition Kits,Seminars,Presentations,Reports and so on...
- MyManchester
- Faculty StaffNet

- Faculty of Biology, Medicine and Health
Final year project
Your final year project is your opportunity to undertake a research project in an area of your interest, whilst potentially contributing to cutting edge scientific research.
Many of our students work alongside our renowned Manchester university research scientists during their final year projects, contributing to their research.
There are a wide range of project types available:
Laboratory based project
Design and carry out a piece of original research in a specialist research laboratory.
Many of our students have had their projects published in the Biological Sciences Review .
Field based project
Design and carry out a piece of original research in the field. A recent project took place in a conservation area in Peru examining the population of species of caiman, which are similar to crocodiles.
Science communications projects
Elearning project.
Plan, design, develop, and evaluate an electronic resource to support eLearning.
Education project
Work with a school or other educational organisation to design a product such as a practical or website which may be of value in teaching and learning.
A recent project involved creating a display at Manchester Museum to teach children about carnivorous plants.
Science media project
Produce a portfolio of communication materials including articles for scientific magazines, a presentation to a scientific audience and a creative piece such as a video, podcast or poster.
Centre for History of Science Technology and Medicine Project
Engage in independent and original research on an aspect of the development of modern science, technology and medicine and/or science communication. A recent project investigated the treatment of postnatal depression in 19th century asylums.
Enterprise project
Work in a team to develop a business plan for a real product or service in the area of biological or biomedical sciences. A recent project was a proposal for provision of a type of DNA microchips that allow for rapid screening of food products and preparation areas for food-borne pathogens.
Bioinformatics project
Carry out research using computers. This may be achieved by running other software, or querying online data resources, or it may be done by designing and writing your own software. A recent project identified the areas of interaction between specific proteins by analysing data on protein structures which is available in a protein database. Interaction between proteins governs the majority of biological processes.
- Biosciences
- Enhance your studies
Start Your First Project
Learn By Doing

5 Machine Learning Projects in Bioinformatics For Practice
Explore Top Machine Learning Projects Ideas to Understand the Applications of Machine Learning in Bioinformatics| ProjectPro Last Updated: 02 Feb 2023
The term "bioinformatics" represents the use of computation and analysis methods to collect and analyze biological data. It's a multidisciplinary field that combines genetics, biology, statistics, mathematics, and computer science. Various branches of bioinformatics, including genomics, proteomics, and microarrays, extensively use machine learning for better outcomes.

Personalized Medicine: Redefining Cancer Treatment
Last Updated : 2022-12-19 14:17:06
Downloadable solution code | Explanatory videos | Tech Support
Top 5 Machine Learning Projects in Bioinformatics
Here are five exciting machine learning projects for bioinformatics to help you understand the application of machine learning in healthcare , mainly bioinformatics.

1. Anti-Cancer Drug Efficacy Prediction
Predicting which patients are likely to benefit or not from a specific therapy is a significant concern in cancer treatment because, generally speaking, not all patients will benefit from a particular medication. This enhances the efficacy of treatment and reduces the suffering and misery experienced by non-responders. Thus, there is an immediate need to find reliable biomarkers (i.e., genes or proteins) that can precisely predict which patients respond best to which medications. For this project, you will use fundamental data science techniques , such as data processing, integration, analysis, and visualization, to determine the most effective biomarkers for various cancer types.

2. Autism Mutation Detection
In this machine learning project for bioinformatics, you will develop a deep-learning-based system that predicts the accurate regulatory effects and the harmful impacts of genetic variants to address the issue of detecting the impact of noncoding mutations on disease. This predictive genomics framework is likely relevant to complex human diseases, illustrates the significance of noncoding mutations in ASD [autism spectrum disorder], and identifies mutations with higher effects for further analysis. If you want to add some unique project to your machine learning portfolio, you must try working on this project.
3. Personalized Cancer Medication
This deep learning project can predict how different genetic variations affect a patient's health. You can use the MSKCC (Memorial Sloan Kettering Cancer Center) database, including thousands of mutations that top-notch scientists and physicians have thoroughly classified. For this machine learning project, you will create a machine learning algorithm using the Keras deep learning library and LSTM that automatically categorizes genetic variants utilizing this data set as a starting point. Additionally, this project entails using various NLP text processing techniques such as Lemmatization, Stemming, Tokenization, etc.
You don't have to remember all the machine learning algorithms by heart because of amazing libraries in Python. Work on these Machine Learning Projects in Python with code to know more!
4. Human Disease Genetic Basis Identification
Human genomes vary between individuals by.1%. Our genetic inclination to specific disorders, such as hypertension, is encoded within this small degree of variation. We can accurately define which gene variants belong to each disease by comparing populations of healthy and diseased people and their variations in the genes responsible for the diseases. In this bioinformatics, AI and machine learning project, strategies for finding the variation corresponding to disease are developed, along with statistics to support the predictions. Furthermore, this project develops methods for predicting how a gene mutation can alter the structure of the protein or the regulatory structure. You can also estimate the disease risk factor's history and evolution by recreating the genes' phylogeny.
5. Build a DNA Sequence Classifier
You will use a classification model in this project that can predict a gene's function just from the DNA sequence of the coding sequence. You will create a function that will extract from any sequence string all overlapping k-mers of a given length, count the k-mers and convert the k-mers list for each gene into string sequences using scikit-learn NLP tools.

- SIGMA for Students
- SIGMA Schedule
- Teaching plans
- Bachelor's Degree in Bioinformatics
- Bachelor’s Degree Final Project
Bachelor’s Degree Final Project
The Bachelor’s Degree in Bioinformatics includes a mandatory Bachelor's Degree Final Project of a scientific-professional nature to be carried out during the last two terms of students’ final year.
The Bachelor's Degree Final Project can either be a comprehensive project in the field of technologies specific to bioinformatics that brings together the competences acquired over the course of the degree or a project that explores an innovative idea (e.g. a computer program, scientific model for a biomedical question or a biological phenomenon).
Students will be overseen by a supervisor throughout the process of preparing their final project, which they will present to a board of examiners made up of lecturers on the degree course at the end of the final academic year.
Outstanding BDFPs
{[{ trabajogrado.titulo }]}.
{[{ trabajogrado.year }]}
{[{ trabajogrado.autor }]}
- {[{ trabajogrado.email }]}
- Telephone: {[{ trabajogrado.telefono }]}
Author: {[{ trabajogrado.autor }]}
Typology: {[{ trabajogrado.tipologia }]}
Final thesis director: {[{ trabajogrado.director }]}
The form could not be sent, please check the marked fields.
Thank you for contacting us. We will contact you shortly..
El mensaje no ha podido enviarse correctamente. Por favor, prueba de nuevo más tarde.
Controller: : ESCOLA SUPERIOR DE COMERÇ INTERNACIONAL (ESCI-UPF); Purpose: Purpose 1: To respond to the interested party’s requests. Purpose 2: To send marketing messages; Lawful basis: The lawful basis for processing the interested party’s personal data in the case of Purpose 1 is the controller’s legitimate interest; in the case of Purpose 2, it is the need to obtain the interested party’s consent; Shared with:: Personal data is not shared with any third parties; Rights: The interested party has the right to access their personal data and to ask that their personal data be corrected or erased, among other rights set out in the supplementary information; Additional information: More detailed information on personal data protection is available on our website: https://www.esci.upf.edu/en/privacy-policy-cookies/gdpr-contacts
- Project Ideas
Google Summer of Code 2022 Project Ideas
Shortcut to project ideas:
Configurable feature visualization to improve the user experience and performance of the feature viewer
Space- and time-efficient data format for mass spectrometry data (openms), efficient data layout for mass spectrometry data (openms), gpu support for toil-cwl-runner (ucsc/cwl project), improving automated wrapping of c++ code in python (openms/autowrap), migration of journal policy tracker backend to express + graphql, journal tracker – finalise and deploy react front-end.
- Genestorian: data refinement
Citation and databasing functionality in luox
Cross-project ideas.
OBF is an umbrella organization which represents many different programming languages used in bioinformatics. In addition to working with each of the “Bio*” projects (listed below) we also accept “cross-project” ideas that cover multiple programming languages or projects. These collaborative ideas are broadly defined and can be thought of as “unfinished” — interested contributors should adapt the ideas to their own strengths and goals, and are responsible for the quality of the final proposed idea in their application.
Feel free to propose your own entirely new idea.
Project description
Analyzing positional features/annotations in sequences is important in bioinformatics. Visualizing such data is quite a challenging task, considering the large amount of data to be displayed. The feature viewer is an open source javascript library developed to visualize biological data (referred to as features) mapped to a linear sequence (Paladin et al., 2020). For instance, it can be configured to visualize the location of protein domains or amino acid variations in a protein sequence. The feature viewer is being used in several popular bioinformatics resources such as neXtProt and COSMIC 3D.
Currently, the feature viewer supports limited configurability options in the features displayed, such as the color, shape and on-click behavior. This is too restrictive for some of the possible use cases of the feature viewer, where more flexibility is required in the display of features. One such instance is when different types of amino acid variants should be displayed in a color-specific manner in the same feature track.
The overall goal of this project is to improve the configurability of the feature viewer, such that it allows greater flexibility in the visualization of detailed biological data. Specific aims of the project are:
The current version of the feature viewer requires new tracks to be hard-coded. Implementing a solution allowing new tracks to be added or existing tracks to be modified or deleted would ease the use of the viewer.
Protein sequences can have features/annotations which are numerical. For instance, the frequency observed in a population of amino acid variants at a specific position. Such numerical data can be visualized as graphs of different types, such as line graphs or histograms.
Currently all the features displayed on a single track have the same color and/or shape; interesting features can not be highlighted using a different color or shape.
Currently all the data displayed has to be provided before display, which results in slow loading and rendering time when there are tens of thousands of features. In order to improve the user experience, it is possible to initially show a message summarizing the data and only fetch and display the data on-demand by the user.
The user community has requested a download button which generates a snapshot of the feature viewer. This feature will allow users to include an image of the data displayed in the feature viewer in a publication or elsewhere.
Project size
Languages and skills needed
- Optionally: Java
- Medium to Hard
Estimated Length
- Kasun Samarasinghe ([email protected])
- Lydie Lane ([email protected])
Contributor benefits
- Gain experience in developing configurable libraries
- Gain experience in handling large amounts of data in web applications
- Gain experience on web application user experience
- Gain experience in biological data such as gene, protein sequences
How to apply
- For information on how to apply, please contact mentors in the emails provided above.
OpenMS is a framework for computational mass spectrometry. Modern mass spectrometers produce large files (e.g., 100 GB) that can’t be easily stored or accessed in the established XML file format mzML. Recently, an update to mzML has been developed that uses HDF5 to store Blosc compressed spectra in binary format: called mzMLb. In this project, the student will add a reader and writer for the mzMLb file format to OpenMS. To some extent, code from the OpenMS reader and writer for the mzML file format can be reused, as well as inspiration can be taken from reference implementations by other parties.
- Optionally: CMake
- Easy for experienced C++ programmers with good CMake and GitHub skills. Medium for experienced C++ programmers without CMake or GitHub skills.
- Timo Sachsenberg, GitHub: https://github.com/timosachsenberg
- Julianus Pfeuffer, GitHub: https://github.com/jpfeuffer
- Samuel Wein, GitHub: https://github.com/poshul
Contributor Benefits
- Gain experience working in a friendly team of developers.
- Get a glimpse into the exciting field of bioinformatics / computational mass spectrometry.
- Obtain detailed knowledge on HDF5, which is widely used to store large data efficiently.
How to Apply
- Please introduce yourself in our Gitter channel: https://gitter.im/OpenMS/OpenMS and get in contact with us prior to writing a proposal.
OpenMS is a framework for computational mass spectrometry. It features a wide range of algorithms and data structures to process and analyze mass spectra. For some very computationally demanding parts, we performed manual code conversion to make the layout of our data better fit the data access patterns of our algorithms. We observed the biggest speedup switching the data layout from an Array of Structs (AoS) to a Structure of Arrays (SoA). In this project, the GSoC contributor will adapt our core data structure for mass spectra to AoS. Ideally, the contributor should be using a modern C++ zero-cost abstraction (e.g., building on https://github.com/crosetto/SoAvsAoS) that makes the old code work without (or minimal) manual changes.
- Medium: requires a good understanding of modern C++.
- Timo Sachsenberg, GitHub: timosachsenberg
- Hannes Roest, GitHub: hroest
- Julianus Pfeuffer, GitHub: jpfeuffer
- Aditya R Rudra, Github: adityaofficial10
- Learn and apply modern C++ features to a real-life project.
Teach the TOIL workflow system how to support GPUs with CWL . Optionally this can be expanded to HPC clusters. This will enable researchers to run specialized workflows that need occasional GPU support efficiently on university computing clusters.
- Either 175 (basic support) or 350 hours (advanced job routing)
- Optional: HPC, Clustering
- Medium, or easier if you have experience with HPC or clustering.
- Michael Crusoe, https://github.com/mr-c
- Lon Blauvelt, https://github.com/DailyDreaming
- Exposure to workflow engines and distributed computing
- To get started, visit https://github.com/DataBiosphere/toil , review the readme and the contributing guidelines: https://toil.readthedocs.io/en/master/contributing/contributing.html#contributing
Autowrap is a python package for the automated wrapping of whole C++ projects into Python via Cython. C++ developers basically need to provide a Cython header file for each C++ header file to specify what needs to be wrapped and how. It then analyses the syntax tree generated by the Cython parser for those “header” files and generates Cython source code for it. Cython then creates the necessary source code to be compiled with e.g. CPython to create a Python extension module to be imported by the end-user. While the wrappers created by autowrap are rather simple, passing templated and nested STL objects like vectors, maps, or tuples between Cython/Python and C++ with autogenerated code can become rather complex. Autowrap offers recursion for nested vectors but cannot handle mixed data structures yet. It also misses support for newer STL containers like tuples and only offers simple vector to (Python) list conversions, while numpy arrays, e.g. via the buffer protocol would sometimes be more suitable. We are seeking for a motivated GSoC contributor proficient in at least Python to tackle those improvements.
- Python (advanced knowledge, for code generation)
- Cython (basic knowledge, potentially possible to be acquired, syntax similar to Python; to be generated)
- C++ (basic knowledge, potentially possible to be acquired; to be wrapped)
- Medium: requires a deep understanding of Python and in the beginning at least some basic knowledge about its differences to C++ (regarding memory management and typing)
- Axel Walter, GitHub: axelwalter
- Get a deep understanding of two very commonly used programming languages and how to interface between them.
- It is also possible to learn how to create your own Python package with C++ extensions.
The Journal Policy Tracker is your go-to place where you can find all the open-source scientific journals and their policies. Currently the backend of this project is on Flask and SQLite3 along with SQLAlchemy as the ORM. This project aims to migrate the backend from Flask and SQL database to Express, GraphQL using express-graphql and a NoSQL database like MongoDB.
At the end of the program, the mentee is expected to do a successful migration of the existing server to an Express & GraphQL based backend.
- Repository: codeisscience/journal-policy-tracker-backend
- Existing API documentation: journal-policy-tracker.herokuapp.com/swagger
Required skills:
- Documentation (Markdown knowledge preferred)
Useful skills:
- Deployent nodejs apps to Heroku
- Familiarity with testing frameworks like Jest, Chai, Mocha in-order to write unit tests, integration tests and e2e tests.
- Easy if you have experience with GraphQL
- Medium if you are familiar with Express.js
- Flexible, depending on contributor needs
- Pritish Samal, [email protected]
- Yo Yehudi, [email protected]
- Experience developing in backend js frameworks and GraphQL.
- Please always email all mentors in the same mail if you would like to ask questions or discuss the project.
- You can also join the Code is Science Slack workspace .
The Journal Policy Tracker is your go-to place where you can find all the open-source journals and their policies. Currently the Frontend of this project is on React and React Bootstrap. This project aims to finalise the frontend after GSoC 2021, add state management and a user dashboard, and decouple the Frontend from CSS Frameworks for layout and presentation using Grid and Flex in place.
- Study the existing user-interface for the journal policy tracker
- Add functionality to the existing website while developing the components
- Migrate the existing frontend CSS libraries to vanilla CSS
- Work on the user-management dashboard
- Use context/Redux for state management
- Repository: codeisscience/journal-policy-tracker-frontend
- Frontend Preview: journal-policy-tracker.netlify.app
- Familiarity with HTML5 and CSS3 semantics
- Familiarity with UI/UX
- Familiarity with Writing tests
- Easy if you have experience with Grid, Flex box, and CSS page layout
- Medium if you are familiar with CSS page layout
- Isaac Miti, [email protected]
- Experience designing and developing front-end interfaces.
- Please visit the repo: codeisscience/journal-policy-tracker-frontend and make at least one contribution, and email the mentors to discuss your project proposal.
Genestorian data refinement
Genestorian is a web application to manage a collection of model organism strains and recombinant DNA in a life sciences laboratory.
New DNA sequences (inside or outside cells) are always generated by combining existing sequences. Genestorian leverages existing semantic web tools for synthetic biology and libraries for DNA visualisation to provide an intuitive interface where researchers can plan, document and revisit their experiments. Here you can find a short summary of the problem we are adressing, adapted for the non-biologists.
An important challenge for the project is to migrate data from spreadsheets, where most labs keep their collections, to the database. In this project, the intern will develop a first version of a tool to perform the data refinement required to migrate from spreadsheet to the database.
Required skills
- Good knowledge of text processing in a programming language (preferably Python).
- Willingness to learn the biology concepts that underlie the data models.
Useful skills
- Experience with data refinement and approximate string matching.
- Willingness to interact with experimental researchers of which the data will be refined.
- Manuel Lera Ramirez, [email protected]
- Yo Yehudi, [email protected]
Expected outcome
Development of a first version of a tool to perform the data refinement required to migrate from spreadsheets to the database. The task could focus only on the program for refinement, but also developing a web interface for migration is a possibility. In addition to mentorship, we will organise two half-day sessions with a professional Research Software Engineer for helping and advising the contributor, and for code review.
Difficulty level
Medium if you have experience with string matching, easier if you know a bit of biology.
luox is a free, open-access and open-source platform for documenting and reporting light-related quantities from a spectrum of light written in JavaScript and React running directly in the browser. It is targeted to biomedical researchers looking for a convenient way to make their research with light(ing) interventions reproducible. Researchers can request a DOI (digital object identifier) for an uploaded spectrum, which is stored in a compressed/hashed way in the URL. The goal of this project is to develop simple database functionality to luox such that the web interface displays the DOI for any spectra that have already been assigned a DOI.
- Platform: https://luox.app/
- Repository: https://github.com/luox-app/luox
- Article describing the platform: https://doi.org/10.12688/wellcomeopenres.16595.2
- Good knowledge of programming in JavaScript
- Good knowledge of web development
- Version control with Git
- Knowledge of DOIs (digital object identifiers)
- Manuel Spitschan, [email protected]
- Functionality to look up DOI from the compressed/hashed spectrum through a table
- Display of the associated DOI in the web interface
MS in Bioinformatics Degree Details and Courses
This Master of Science degree is a blended program offering courses from the Krieger School of Arts and Sciences and Whiting School of Engineering. The curriculum is designed around 2 Required Core Courses, 3 Customizable Core Courses, and 6 Elective Courses. Based on your course selections, you will earn between 36-42 credits.
Additionally, this degree program offers an optional culminating experience of a Thesis as a 12th course, for four additional credits at full tuition.
Courses offered through the School of Arts and Sciences have course numbers which begin with 410. School of Engineering (EN) offerings have course numbers which begin with 605. Any EN.605 course descriptions are available on the Whiting School of Engineering’s website.
Core Courses - Required
Complete both courses.
Molecular Biology - 410.602
Epigenetics, gene organization & expression - 410.610, core courses - customizable.
Choose 1 of these two courses: Algorithms for Bioinformatics - EN.605.620 or Foundations of Algorithms - EN.605.621 and Choose 1 of these two courses: Biological Databases and Database Tools - EN.605.652 or
Introduction to Bioinformatics - 410.633
Core course - customizable.
Choose 1 of these two courses: Principles of Database Systems - EN.605.641 or
Practical Computer Concepts for Bioinformatics - 410.634
Elective courses.
Select 6 Electives. At least one of your electives must meet the description under the "Elective - Biotechnology Course" category and another one will meet the description of the "Elective - Computer Science Course" category, both detailed below. • Semantic Web - 605.643 • Principles of Bioinformatics - 605.651 • Computational Genomics - 605.653 • Computational Drug Discovery and Development - 605.656 • Statistics for Bioinformatics - 605.657 • Modeling and Simulation of Complex Systems - 605.716 • Computational Aspects of Molecular Structure - 605.751 • Analysis of Gene Expression and High-Content Biological Data - 605.754 • Systems Biology - 605.755
Bioinformatics: Tools for Genome Analysis - 410.635
Protein bioinformatics - 410.639, molecular phylogenetic techniques - 410.640, next generation dna sequencing and analysis - 410.666, gene expression data analysis and visualization - 410.671, bioperl - 410.698, advanced practical computer concepts for bioinformatics - 410.712, advanced genomics and genetics analyses - 410.713, practical introduction to metagenomics - 410.734, genomic and personalized medicine - 410.736, elective – biotechnology course.
Choose one course from the list of more than 100 general biotechnology electives and science elective courses . Other courses may be considered with adviser approval, including 410. courses in the Elective Courses section listed above.

Elective – Computer Science Course
Choose one course from this representative list. Other courses may be considered with adviser approval, including 605. courses in the Elective Courses section listed above.
- Foundations of Software Engineering – 605.601
- XML Design Paradigms – 605.644
- Data Visualization – 605.662
- Principles of Enterprise Web Development – 605.681
- Agile Development with Ruby on Rails – 605.684
- Mobile Application Development for the Android Platform – 605.686
- Software Systems Engineering – 605.701
- Large Scale Database Systems – 605.741
- Machine Learning – 605.746
- Evolutionary Computation – 605.747
- Web Application Development with Java – 605.782
- Rich Internet Applications with Ajax – 605.787
- Big Data Processing Using Hadoop – 605.788
- Independent Research Project in Bioinformatics – 605.759
Optional 12th Course Culminating Experience
This option extends the 11-course degree program to 12 courses. Also, "Biostatistics" and "Independent Research in Biotechnology" must be taken as Biotechnology electives before this course can be taken.
Biotechnology Thesis - 410.801
Students wishing to complete a thesis may do so by embarking on a two-semester thesis project, which includes the 410.800 Independent Research Project and 410.801 Biotechnology Thesis courses. This project must be a hypothesis-based, original research study. The student must complete 410.800 Independent Research Project and fulfill the requirements of that course, including submission of a project proposal, final paper, and poster presentation, before enrolling in the subsequent thesis course. For the thesis course, students are required to submit a revised proposal (an update of the 410.800 proposal) for review and approval by the faculty adviser and biotechnology program committee one month prior to the beginning of the term. Students must meet with the faculty adviser periodically for discussion of the project’s progress. Graduation with a thesis is subject to approval by the thesis committee and program committee and requires the student to present his/her project to a faculty committee both orally and in writing. Prerequisites: Successful completion of 410.800 Independent Research Project and 410.645 Biostatistics.
STATE-SPECIFIC INFORMATION FOR ONLINE PROGRAMS
Students should be aware of state-specific information for online programs . For more information, please contact an admissions representative.
Audience Menu
Computational Genomics and Data Science Program
Extracting knowledge from data is a defining challenge of science.
Explore this Page
Nhgri support, program breadth.
- Tools and Resources
NIH Strategic Plan for Data Science
Workshops and meetings, funding opportunities, related programs, program staff.
Computational genomics has been an important area of focus for NHGRI since the beginning of the Human Genome Project. Today, however, advances in tools and techniques for data generation are rapidly increasing the amount of data available to researchers, particularly in genomics. This increase requires researchers to rely ever more heavily on computational and data science tools for the storage, management, analysis, and visualization of data. NHGRI’s commitment to computational genomics and data science is NHGRI’s commitment to computational genomics and data science is a key component of the NHGRI 2020 Strategic Vision and is in alignment with the NIH Strategic Plan for Data Science , which provides a roadmap for modernizing the NIH-funded biomedical data science ecosystem.
Read the Genomic Data Science Fact Sheet .
The NHGRI 2020 Strategic Vision highlights the importance of bioinformatics and computational biology by stating, “all major genomics breakthroughs to date have been accompanied by the development of groundbreaking statistical and computational methods.” Projects involving a substantial element of computational genomics or data science account for over a quarter of NHGRI’s FY2021 budget ; these areas are key components of many NHGRI grants and programs.
NHGRI’s support for computational genomics and data science follows the general principles and priorities identified in the 2022 NHGRI Funding Policy . NHGRI prioritizes funding support on “approaches generalizable across diseases and biological systems of higher order organisms and approaches that inform the development and implementation of genomics in clinical care.” Projects focusing on a single disease are less likely to be relevant to NHGRI than those generalizable across multiple diseases.
The Computational Genomics and Data Science Program (CGDS) supports the development of advanced computational approaches, innovative data analysis tools, and data resources that provide scientific utility across the extramural research programs and divisions. The CGDS program includes a number of managed grants and programs spanning many scientific topics. These grants can be categorized usefully, though neither exhaustively nor perfectly, into three categories: Genome Analysis Tools and Software Resources, Data Management Resources, and Genome Informatics Training and Workforce Development. This structure is illustrated in Figure 1. The program structure described below should be considered as a general and not exclusive framework for organizing grants into broad scientific categories of interest to NHGRI.

Figure 1: CGDS Program Breadth. See text-only version .
Genome Analysis Tools and Software Resources
The links below lead to NIH RePORTER, a database that provides information on NIH funded grants and research activities. Each link associated with a category will display the portfolio of FY2021 grants that received funding from the NHGRI Computational Genomics and Data Science Program.
Genetic variation, clinical and phenotype analysis
- Variation and association analyses : Methods for interpreting genetic variation, associating variation with phenotypes, and analyzing population data.
- Clinical and phenotype analyses : Methods for management and analysis of clinical phenotype and electronic health record (EHR) data.
- Sequencing informatics : Methods for processing, aligning, and formatting sequence reads, performing genome assembly, and extracting sequence features.
- Function analyses : Methods and tools that facilitate the use of gene regulation, gene expression, epigenetic modifications, and methylation data.
- Machine learning for genomics : Methods development and applications using ML to explore genomics data.
- General genome data analysis tools : Development of other tools not covered above.
- Software environments for management, analysis, and visualization of genomic data : Web-hosted environments (PaaS or Saas) for genomics data storage, analysis, visualization and sharing
Data Management Resources
- Human and Model Organism Databases (HMODs) : Highly-curated and broadly-used databases that provide genomic data on humans and a range of significant model organisms.
- Data analysis and coordination centers (DACCS) : Highly-curated and broadly-used databases that provide genomic data on humans and a range of significant model organisms.
- Informatics solutions for security and privacy of genomic data : Resources for maximizing security in genomic data storage.
- Genomics and Phenomics data standards : Development of standards for easier annotation and sharing of large-scale genomics data and associated metadata.
Genome Informatics Training and Workforce Development
- Online resources for workforce development : Development of online resources (e.g. MOOCs), classroom courses or events for expanding and diversifying the genome informatics workforce.
- Educational resources and community engagement for transitioning genome informatics to the cloud: Non-research funding for programs designed to facilitate usage of cloud computing in genomics.
- Cloud computing resources for genome informatics: Infrastructure-building funding for genomic data science in cloud computing environments.
NHGRI developed a 2020 Strategic Vision that identifies the challenges, discoveries, and opportunities that lie on the horizon for human genomics in the coming decade. As the landscape of data science is rapidly growing, new strategic plans are crucial to guide NHGRI in pushing the forefront of genomics.
See an extensive outline of the 2020 Strategic Vision .
As a result of the rapid changes in biomedical research and information technology, several pressing issues related to the data-resource ecosystem confront NIH and other components of the biomedical research community. To address these challenges, NIH released its first Strategic Plan for Data Science on June 4, 2018, to provide a roadmap for modernizing the NIH-funded biomedical data science ecosystem. In establishing this plan, NIH addresses storing data efficiently and securely; making data usable to as many people as possible (including researchers, institutions, and the public); developing a research workforce poised to capitalize on advances in data science and information technology; and setting policies for productive, efficient, secure, and ethical data use.
- Future Directions of the NHGRI Analysis, Visualization, and Informatics Lab-space (AnVIL) October 29, 2021 (Virtual) This workshop aimed to identify gaps, challenges and future opportunities related to NHGRI’s investment in the AnVIL’s cloud-based infrastructure, tools, and services.
- Genomic Medicine XIII: Developing a Clinical Genomic Informatics Research Agenda February 9-10, 2021 (Virtual) The goal of this meeting was to develop a research strategy on the use of genomic-based clinical informatics resources to improve the detection, treatment, and reporting of genetic disorders in clinical settings.
- NHGRI Extramural Informatics & Data Science Workshop September 29-30, 2016; Bethesda, MD The goal of the workshop was to identify and prioritize opportunities of significance to the NHGRI Computational Genomics and Data Science Program over the next 3-5 years. A report was generated that outlined the opportunities identified through the course of this workshop. This was presented to the NHGRI council in May 2017.
Investigators interested in submitting applications to NHGRI are encouraged to contact NHGRI program staff before submission to discuss their specific aims and their choice of Funding Opportunity Announcement (FOA). Contact information for NHGRI program staff is at the bottom of this page.
Investigator Initiated Research in Computational Genomics and Data Science (R01, R21, and R43/R44): PAR-21-254 and PAR-21-255 , invite applications for a broad range of research efforts in computational genomics, data science, statistics, and bioinformatics relevant to one or both of basic or clinical genomic science, and broadly applicable to human health and disease.
Genomic Resource Grants for Community Resource Projects (U24): PAR-20-100 is tightly focused on supporting major genomic resources, including those in informatics. Potential applicants are strongly encouraged to contact NHGRI Program Staff before developing an application.
Trans-NIH Biomedical Knowledgebase (U24): PAR-20-097 is designed to support biomedical knowledgebases. Biomedical knowledgebases under this announcement should have the primary function to extract, accumulate, organize, annotate, and link growing bodies of information related to core datasets.
Trans-NIH Biomedical Data Repository (U24): PAR-20-089 Biomedical data repositories under this announcement should have the primary function to ingest, archive, preserve, manage, distribute, and make accessible the data related to a particular system or systems.
Development and Implementation of Clinical Informatics Tools to Enhance Patients’ Use of Genomic Information (NOSI): NOT-HG-22-011 encourages applications to develop and implement patient-facing genomic-based clinical informatics tools that facilitate or enhance patient-provider electronic communication, patient tracking and registry functions, patient self-management and support, provider electronic prescribing, test tracking, referral tracking, and health care decision-making.
Parent NIH Solicitations: R01 ( PA-20-185 and PA-20-183 ), Parent R21 ( PA-20-195 and PA-20-194 ), and Parent K25 ( PA-20-199 ) solicitations. These investigator-initiated grants allow researchers to target their specific area of science relevant to NHGRI’s mission (per the NHGRI Funding Policy ). Other funding opportunities include PAR-21-075 , which focuses on research experiences for students seeking a master’s degree. Additionally, NIH funding opportunities for Small Business Innovation Research (SBIR) and Small Business Technology Transfer (STTR) grants can be found at https://sbir.nih.gov/funding .
Other Relevant NIH Funding Opportunities
NHGRI's Funding Opportunities page links to various NHGRI funding opportunities and provides instructions for signing up for NHGRI's funding opportunities email list.
The webpage of the Office of Data Science Strategy (ODSS) provides resources and links to various informatics-related funding opportunities across the NIH and other Federal agencies.

Program Directors

- Program Director
- Division of Genome Sciences

- Office of Genomic Data Science

- Division of Genomic Medicine

- Program Director, Computational Genomics and Data Science

- Scientific Program Specialist
Scientific Program Analysts

- Scientific Program Analyst

Last updated: February 16, 2023
Click through the PLOS taxonomy to find articles in your field.
For more information about PLOS Subject Areas, click here .
Loading metrics
Open Access
Peer-reviewed
Research Article
A large-scale analysis of bioinformatics code on GitHub
Roles Conceptualization, Data curation, Formal analysis, Methodology, Project administration, Resources, Software, Supervision, Visualization, Writing – original draft, Writing – review & editing
* E-mail: [email protected]
Affiliation Department of Biostatistics and Informatics, Colorado School of Public Health, Aurora, CO, United States of America

Roles Data curation, Writing – review & editing
Roles Conceptualization, Writing – review & editing
Affiliation High-Performance Algorithms and Complex Fluids, National Renewable Energy Laboratory, Golden, CO, United States of America
Roles Data curation, Investigation, Writing – original draft, Writing – review & editing
Affiliation Health Sciences Library, University of Colorado Anschutz Medical Campus, Aurora, CO, United States of America
Roles Funding acquisition, Project administration, Supervision, Writing – review & editing
- Pamela H. Russell,
- Rachel L. Johnson,
- Shreyas Ananthan,
- Benjamin Harnke,
- Nichole E. Carlson

- Published: October 31, 2018
- https://doi.org/10.1371/journal.pone.0205898
- Reader Comments
In recent years, the explosion of genomic data and bioinformatic tools has been accompanied by a growing conversation around reproducibility of results and usability of software. However, the actual state of the body of bioinformatics software remains largely unknown. The purpose of this paper is to investigate the state of source code in the bioinformatics community, specifically looking at relationships between code properties, development activity, developer communities, and software impact. To investigate these issues, we curated a list of 1,720 bioinformatics repositories on GitHub through their mention in peer-reviewed bioinformatics articles. Additionally, we included 23 high-profile repositories identified by their popularity in an online bioinformatics forum. We analyzed repository metadata, source code, development activity, and team dynamics using data made available publicly through the GitHub API, as well as article metadata. We found key relationships within our dataset, including: certain scientific topics are associated with more active code development and higher community interest in the repository; most of the code in the main dataset is written in dynamically typed languages, while most of the code in the high-profile set is statically typed; developer team size is associated with community engagement and high-profile repositories have larger teams; the proportion of female contributors decreases for high-profile repositories and with seniority level in author lists; and, multiple measures of project impact are associated with the simple variable of whether the code was modified at all after paper publication. In addition to providing the first large-scale analysis of bioinformatics code to our knowledge, our work will enable future analysis through publicly available data, code, and methods. Code to generate the dataset and reproduce the analysis is provided under the MIT license at https://github.com/pamelarussell/github-bioinformatics . Data are available at https://doi.org/10.17605/OSF.IO/UWHX8 .
Citation: Russell PH, Johnson RL, Ananthan S, Harnke B, Carlson NE (2018) A large-scale analysis of bioinformatics code on GitHub. PLoS ONE 13(10): e0205898. https://doi.org/10.1371/journal.pone.0205898
Editor: Zhaohui Qin, Emory University Rollins School of Public Health, UNITED STATES
Received: June 1, 2018; Accepted: October 3, 2018; Published: October 31, 2018
Copyright: © 2018 Russell et al. This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: Full metadata on the articles describing each published repository are within the paper and its Supporting Information files. All data extracted from the GitHub API, except file contents, are freely available at https://doi.org/10.17605/OSF.IO/UWHX8 . For file contents, in the absence of explicit open source licenses for the majority of repositories studied, we recorded the Git URL for the specific version of each file so that the exact dataset can be reconstructed using our provided scripts. Additionally, we have removed personal identifying information from commit records, but have included API references for each commit record so that the full records can be reconstructed. Detailed instructions for reconstructing the omitted columns are provided in Supplemental Section 1 in S1 File .
Funding: This work was supported by National Institutes of Health / National Center for Advancing Translational Sciences Colorado Clinical and Translational Science Awards Biostatistics, Epidemiology and Research Design Program, Grant Number UL1 TR001082, received by N.C. ( http://www.ucdenver.edu/research/CCTSI/programsservices/berd/Pages/default.aspx ). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
Introduction
Bioinformatics is broadly defined as the application of computational techniques to analyze biological data. Modern bioinformatics can trace its origins to the 1960s, when improved access to digital computers coincided with an expanding collection of amino acid sequences and the recognition that macromolecules encode information [ 1 ]. The field underwent a transformation with the advent of large-scale DNA sequencing technology and the availability of whole genome sequences such as the draft human genome in 2001 [ 2 ]. Since 2001, not only the volume but also the types of available data have expanded dramatically. Today, bioinformaticians routinely incorporate whole genomes or multiple whole genomes, high-throughput DNA and RNA sequencing data, large-scale genetic studies, data addressing macromolecular structure and subcellular organization, and proteomic information [ 3 ].
Some debate has centered around the difference between “bioinformatics” and “computational biology”. One common opinion draws a distinction between bioinformatics as tool development and computational biology as science [ 4 ]. However, no consensus has been reached, nor is it clear whether one is needed. The terms are often used interchangeably, as in the “Computational biology and bioinformatics” subject area of Nature journals, described as “an interdisciplinary field that develops and applies computational methods to analyse large collections of biological data” [ 5 ]. In this article we use the umbrella term “bioinformatics” to refer to the development of computational methods and tools to analyze biological data.
In recent years, the explosion of genomic data and bioinformatic tools has been accompanied by a growing conversation around reproducibility of results and usability of software [ 6 – 9 ]. Reproducibility requires that authors publish original data and a clear protocol to allow repetition of the analysis in a paper [ 7 ]. Usability refers to ease and transparency of installation and usage. Version control systems such as Git and Subversion, which allow developers to track changes to code and maintain an archive of all old versions, are widely accepted as essential to the effective development of all non-trivial modern software. In particular, transparent version control is important for long-term reproducibility and usability in bioinformatics [ 6 – 9 ].
The dominant version control system today is the open source distributed system Git [ 10 ], used by 87.2% of respondents to the 2018 Stack Overflow Developer Survey [ 11 ]. A Git “repository” is a directory that has been placed under version control, containing files along with all tracked changes. A “commit” is a snapshot of tracked changes that is preserved in the repository; developers create commits each time they wish to preserve a snapshot. Many online sharing sites host Git repositories, allowing developers to share code publicly and collaborate effectively with team members. GitHub [ 12 ] is a tremendously popular hosting service for Git repositories, with 24 million users across 200 countries and 67 million repositories in 2017 [ 13 ]. Since its initial launch in 2008, GitHub has grown in popularity within the bioinformatics field, as demonstrated by the proportion of articles in the journal Bioinformatics mentioning GitHub in the abstract ( Fig 1 ). For an excellent explanation of Git and GitHub including additional definitions, see [ 14 ].
- PPT PowerPoint slide
- PNG larger image
- TIFF original image
Here the term “repository” refers to online code hosting services. The journal Bioinformatics publishes new developments in bioinformatics and computational biology. If a paper focuses on software development, authors are required to state software availability in the abstract, including the complete URL [ 15 ]. URLs for software hosted on the popular services GitHub, Bitbucket, and SourceForge contain the respective repository name except in rare cases of developers referring to the repository from a different URL or page. The figure shows the results of PubMed searches for the repository names in the title or abstract of papers published in Bioinformatics between 2009 and 2017. The category “Abstracts with none of these” captures all remaining articles published in Bioinformatics for the year, and likely includes many software projects hosted on organization websites or featuring their own domain name, as well as any articles that did not publish software.
https://doi.org/10.1371/journal.pone.0205898.g001
The bioinformatics field embraces a culture of sharing—for both data and source code—that supports rapid scientific and technical progress. In this paper, we present, to our knowledge, the first large-scale study of bioinformatics source code, taking advantage of the popularity of code sharing on GitHub. Our analysis data include 1,720 GitHub repositories published along with bioinformatics articles in peer-reviewed journals. Additionally, we have identified 23 “high-profile” GitHub repositories containing source code for popular and highly respected bioinformatic tools. We analyzed repository metadata, source code, development activity, and team dynamics using data made available publicly through the GitHub API [ 16 ]. We provide all scripts used to generate the dataset and perform the analysis, along with detailed instructions. We work within the GitHub Terms of Service [ 17 ] to make all data except personal identifying information publicly available, and provide instructions to reconstruct the removed columns if needed. Our main analysis results are provided as a table with over 400 calculated features for each repository.
Although the software engineering literature describes many analyses of GitHub data [ 18 – 24 ], bioinformatics software has not been looked at specifically. These software engineering studies often look only at highly active projects in wide community use, with many contributors utilizing the collaborative features of GitHub. Public bioinformatics software serves a variety of purposes, from analysis code supporting scientific results to polished tools intended for adoption by a wide audience. With exceptions, code bases published along with bioinformatics articles tend to be small, with one or a few contributors, and use GitHub mostly for its version control and public sharing features. Additionally, the interdisciplinary nature of bioinformatics creates a unique culture around programming, with developers bringing experience from diverse backgrounds [ 25 ]. The projects in our dataset treat a variety of scientific topics, use many different programming languages, and show a diverse range of team dynamics.
We describe our dataset from the perspective of the articles announcing the repositories, the source code itself, and the teams of developers. We observe several features that are associated with overall project impact. Our analysis points to simple recommendations for selecting bioinformatic tools from among the thousands available. Our dataset also contributes to and highlights the importance of the ongoing conversation around reproducibility and software quality.
A dataset of 1,740 bioinformatics repositories on GitHub
We curated a set of 1,720 GitHub repositories mentioned in bioinformatics articles in peer-reviewed journals (referred to throughout the paper as the “main” dataset), as well as 23 high-profile repositories that were not necessarily on GitHub at the time of publication or are not published in journals. Three repositories overlapped between the two sets. As a resource for the community, we provide the full pipeline to extract all repository data from the GitHub API, all extracted data except personal identifying information, scripts to perform all analysis, and citations for the articles announcing each repository.
Article topics
We performed topic modeling [ 26 ] on the abstracts of the articles announcing each repository in the main dataset, associating each article with one or more topics. We manually assigned labels to each topic based on top associated terms (Fig A in S1 File ); for example, the topic “Transcription and RNA-seq” is associated with the terms “rna”, “seq”, and “transcript”. We found that the topic “Web and graphical applications” was positively associated with several measures of project size and activity, as were, to a lesser extent, some other topics ( Fig 2 ). We found that code for articles about certain topics was disproportionately written in certain languages; for example, the greatest amount of code for “Assembly and sequence analysis” was in C and C++, while the greatest amount of code for “Web and graphical applications” was in JavaScript (Fig B in S1 File ). Bioinformatics was the most common journal for all topics, probably due in part to the relative ease of finding relevant projects in this journal (Fig C in S1 File ). Fig D in S1 File shows topic distribution by year of initial commit and article publication.
Projects are broken into groups according to whether the accompanying paper abstract is associated with each topic category. Projects that are associated with multiple topics are counted separately for each topic. Topic labels were assigned manually after examining top terms associated with each category. We added one to several variables to facilitate plotting on a log scale; these are noted in the variable name. All variables refer to the GitHub repository except “1 + mean PMC citations / week”, which refers to the paper and looks at citations in PubMed Central per week starting two years after the initial publication of the paper. Commits is the total number of commits to the default branch. Commit authors have created commits but do not necessarily have push access to the main branch; we attempted to collapse individuals with multiple aliases. Forks are individual copies of the repository made by community members. Subscribers are users who have chosen to receive notifications about repository activity. Stargazers are users who have bookmarked the repository as interesting. Megabytes of code and total files include source code only, excluding data file types such as JSON and HTML. The horizontal line at the center of the notch corresponds to the median. The lower and upper limits of the colored box correspond to the first and third quartiles. The whiskers extend beyond the hinges by at most an additional 1.5 times the inter-quartile range. Outliers are plotted individually. The notches correspond to roughly a 95% confidence interval for comparing medians [ 27 ]. The table of repository features is provided as S8 Table .
https://doi.org/10.1371/journal.pone.0205898.g002
Programming languages
We identified a programming language for each source file and analyzed the prevalence of languages along several dimensions including total number of source files, lines of code, and size of source files in bytes. In high-profile repositories, the greatest amount of code in bytes was in Java, followed by C and C++. In the main dataset, two repositories contained entire copies of the large C++ Boost libraries [ 28 ]. Ignoring those copies of Boost, the greatest amount of code in the main dataset was in Javascript, followed by Java, Python, C++, and C (Fig E in S1 File ).
We analyzed language features including primary execution mode (interpreted or compiled), type system (static or dynamic, strong or weak), and type safety. High-profile repositories tended to emphasize compiled, statically typed languages, with the largest contribution being from Java. The main dataset contained a greater proportion of code written in interpreted or hybrid interpreted/compiled (such as Python) and dynamically typed languages ( Fig 3 , Fig F in S1 File , S6 and S7 Tables). This difference could reflect the fact that interpreted and dynamically typed languages provide a powerful platform to quickly design prototypes for small projects, while static typing provides important safety checks for larger projects. Indeed, there was a relationship between project size (total lines of code) and amount of statically typed code (percentage of bytes in statically typed languages): the Spearman correlation between these variables over the entire dataset was 0.41 (P = 2.2e-16) ( S8 Table ). Our data support the intuition that Java, Python and R are more succinct than lower-level languages such as C and C++, as the former group tended to have fewer lines of code per source file in the presumably sophisticated high-profile repositories ( Fig 3 ).
Languages included in at least 50 main repositories are shown. Each dot corresponds to one repository and indicates the number of files in the language and the mean number of lines of code per file not including comments. The data are provided as S8 Table .
https://doi.org/10.1371/journal.pone.0205898.g003
Developer communities
For version control systems such as Git, “commits” refer to batches of changes contributed by individual users; each commit causes a snapshot of the repository to be saved along with records of all changes. Each GitHub repository has a core team of developers with commit access; these developers can push changes directly to the repository. In addition, GitHub facilitates community collaboration through a system of forks and pull requests. Anyone can create a personal copy of a public repository, called a “fork”, and make arbitrary changes to their fork. If an outside developer feels their changes could benefit the main project, they can create a “pull request”: a request for members of the core team to review and possibly merge their changes into the main project. In that case, the commit records for the main project would show the outside contributor as the commit author and the core team member who merged the changes as the committer. Throughout our analysis, we use the term “outside contributors” to refer to commit authors who are never committers for the repository.
We looked at the size of each developer team (including users with commit access and outside contributors) as well as other measures of community engagement, including number of forks, subscribers, and stargazers. Subscribers are users who have chosen to receive notifications about repository activity. Stargazers are users who have bookmarked the repository as interesting. Neither subscribers nor stargazers necessarily touch any code, though in practice they are likely to include the developer team. Not surprisingly, the size of the developer team (all commit authors) was strongly associated with the number of forks, subscribers, and stargazers. High-profile repositories tended to have larger teams and more community engagement by these measures ( Fig 4 ). The number of outside contributors was also associated with these measures, though less strongly, perhaps because only 14% of main repositories had any outside contributors and these already tended to be within the highly active subset; 70% of high-profile repositories had outside contributors (Fig G in S1 File ).
Various measures of community engagement are plotted against the number of commit authors. Each dot represents one repository or a set of repositories with identical values for the variables. We added one to the vertical axis variables to facilitate plotting on a log scale due to many zero values. The pearson correlation and associated p-value are displayed for each variable versus number of commit authors. Commit authors refers to the number of unique commit authors to the default branch. The high-profile repository with a single contributor is s-andrews/FastQC [ 29 ]. This repository appears to have been created by a single developer importing a previously existing code base to GitHub. The table of repository features is provided as S8 Table .
https://doi.org/10.1371/journal.pone.0205898.g004
Gender distribution of developers and article authorships
We analyzed the gender distribution of developers and article authorships in the dataset as a whole and within teams. Developer and author first names were submitted to the Genderize.io API [ 30 ] and high-confidence gender calls were counted. We found that the proportion of female authors decreased with seniority in author lists and the proportion of female developers was lower in high-profile repositories compared to the main dataset. In the main dataset, 12% of developers were women while only 6% of commits were contributed by women; these numbers were lower in the high-profile dataset (7% and 2%, respectively). In biology articles, it is customary to list the lead author first and the senior author last, with additional authors in the middle. We found that in the articles announcing each repository, middle authors included the greatest proportion of women. Women comprised 22% of all authorships in the main dataset and 21% in the high-profile dataset, compared to 18% and 0% for first authors and 14% and 8% (representing only one person) for the most senior last authors ( Fig 5 ). A separate study of author gender in computational biology articles found a similar trend of decreased representation of women with increased seniority in author lists; the authors additionally identified a pattern of more female authors on papers with a female last author [ 31 ].
“Developers” are unique commit authors or committers over the entire dataset; we attempted to collapse individuals with multiple aliases. “Commits” are individual commits to default branches of repositories. “Paper authors” are individual authorships on papers, not necessarily unique people. For each repository, the one paper announcing the repository is included; we note that some repositories may be developed over multiple publications, while only one publication per repository is included here. Papers were then deduplicated because some papers announced multiple repositories. First and last authors are only counted for papers with at least two authors. Names for which a gender could not be inferred are excluded. Bar height corresponds to the number of female contributors divided by the number of contributors with a gender call; these numbers are labeled above each bar. The features for each repo are provided in S8 Table .
https://doi.org/10.1371/journal.pone.0205898.g005
We analyzed the gender composition of each team of developers and paper authors. The most common type of team in the main dataset was a single male developer and an all-male author list. The most common type of team in the high-profile dataset was a majority-male developer team and an all-male author list. Only ten main repositories and no high-profile repositories had all or majority female developer and author teams; all ten of these developer teams consisted of a single female developer (Fig H in S1 File ).
We quantified gender diversity within teams using the Shannon index of diversity [ 32 ]. A Shannon index of 0 means all members have the same gender, while the maximum value of the Shannon index with two categories is ln(2) = 0.69, achieved with equal representation of both categories. We found that 13% of main repositories and 62% of high-profile repositories had a nonzero Shannon index for the developer team. There were no high-profile repositories with a Shannon index greater than 0.4; the percentage of main repositories with Shannon index greater than 0.4 was 12% (Fig I in S1 File ).
Commit dynamics
We looked at several measures of commit timing along with total number of commits to each repository. Not surprisingly, the total number of commits was strongly associated with density of activity (commits per month and maximum consecutive months with commits) and overall project duration. High-profile repositories tended to have longer project duration and greater density of commit activity ( Fig 6 ).
Various timing dynamics are plotted versus total commits to the default branch. Each dot represents one repository or a set of repositories with identical values for the variables. For each variable, the total time interval covered by the project is the interval starting with the first commit and ending with the last commit at the time we accessed the data. For example, “Mean new files per month” counts only months from the first to last commit. The high-profile repository with only 16 commits and all files added on a single day is s-andrews/FastQC [ 29 ]. This repository appears to have been created by importing a previously existing code base to GitHub. The data are provided as S8 Table .
https://doi.org/10.1371/journal.pone.0205898.g006
A simple proxy for project impact
We looked at the simple binary feature of whether any commits were contributed to each repository after the associated article appeared in PubMed. We found that this simple feature was associated with several measures of project activity and impact ( Fig 7 ). Not surprisingly, it was strongly associated with the total number of commits and size of the developer team. Presumably, larger projects tend to be those that are useful to many people and for which development continues after the paper is published. The metric was also associated with measures of community engagement such as forks, stargazers, and outside contributors. This could be explained in part by the previous point and in part by outside community members voluntarily becoming involved in the project after reading the paper. However, interestingly, the association with the proportion of commits contributed by outside authors was not statistically significant, suggesting that overall team size may be the principal feature driving the relationship with the number of outside commit authors. Additionally, the metric was associated with frequency of citations in PubMed Central, which could indicate that people are discovering the code through the paper and using it, and the code is therefore being maintained. Interestingly, repositories with commits after the paper was published had longer commit messages (explanations included by commit authors along with their changes to the repository). This could be due to a relationship between both variables and the size of the developer team; perhaps members of larger teams tend to write longer commit messages to meet the increased burden of communication with more team members. Indeed, there was a moderate linear relationship (r = 0.14, p = 1.9e-09) between total number of commit authors and mean commit message length in the main dataset.
Each data point contributing to each box plot is one repository in the main dataset. Repositories are separated by whether the last commit timestamp at the time we accessed the data was after the date the corresponding publication appeared in PubMed. Repositories for which we do not have a publication date in PubMed are excluded. See Fig 2 legend for the explanation of “Total commits”, “Commit authors”, “Total forks”, “Total subscribers”, “Total stargazers”, and “PMC citations / week”. “Commit message length” is the mean number of characters in a commit message. “Pct outside commits” is the proportion of commits by authors who submitted code only through pull requests, and can therefore be assumed not to be core members of the development team with commit access. Similarly, “Outside commit authors” is the number of contributors who submitted code through pull requests only. The p-value refers to the two-sided t -test for different means between the two groups. The data used to compute the p-value include zero values, but for the plot, we replaced zeros by the minimum positive value of each variable to facilitate plotting on a log scale. The horizontal line across the box corresponds to the median. The lower and upper limits of the box correspond to the first and third quartiles. The whiskers extend beyond the box by at most an additional 1.5 times the inter-quartile range. Outliers are plotted individually. The table of repository features is provided as S8 Table .
https://doi.org/10.1371/journal.pone.0205898.g007
We have presented the first large-scale analysis of bioinformatics code to our knowledge. Our analysis gives a high-level picture of the current state of software in bioinformatics, summarizing scientific topics, source code features, development practices, community engagement, and team dynamics. The culture of sharing in bioinformatics will continue to enable deeper study of software practices in the field. Our hope is that readers will uncover additional insights in our tables of hundreds of calculated features for each repository ( S8 Table ), many of which were not analyzed in this paper, and that some readers will use or adapt our code to generate data and analyze repositories in unanticipated ways.
Interestingly, despite being made public on GitHub, nearly half of all repositories in our dataset do not feature explicit licenses (Fig J in S1 File ), in most cases likely unintentionally restricting the rights of others to reuse and modify the code. Nonetheless, the type of research described here may proceed under the GitHub Terms of Service [ 17 ] and Privacy Statement [ 33 ].
With the overwhelming variety of public bioinformatics software available, users are constantly faced with the question of which tool to use. Several features of our analysis point to simple heuristics based on information available on GitHub. We observed relationships between community engagement and various measures of project size and activity level ( Fig 4 , Fig 6 , Fig G in S1 File ). Our final analysis looked at the simple question of whether the developers had revisited their code at all after the paper was published; we found that this feature is associated with several measures of impact ( Fig 7 ). Intuitively, these points suggest that users should prioritize software that is being consistently maintained by an active team of developers. The GitHub web interface prominently displays the total number of commits, number of contributors, and time of latest commit on the front page for each repository. Additionally, GitHub provides a full-featured mechanism, called Issues, that allows the developer team or any user to create tracked requests within the project. We did not analyze issues because these are a relatively advanced feature that is rarely used in our dataset; nonetheless, a consistent flow of issues can help identify sophisticated projects under active development.
Bioinformatics is a hybrid discipline combining biology and computer science. There are three major paths into the field: (1) computer scientists and programmers can become familiar with the relevant biology, (2) biologists can learn programming and data analysis, or (3) students can train specifically in increasingly popular bioinformatics programs [ 25 ]. Our dataset likely includes developers from all three major paths. However, our analysis of developer gender demonstrates that the gender distribution in bioinformatics more closely resembles that of computer science than biology. Indeed, the underrepresentation of women in our dataset was more extreme than among students awarded PhDs in computer science in the United States in 2016 [ 34 ]. A possible reason for this could be that, despite relatively high numbers of women in biology, biologists who make the transition to bioinformatics tend to be male. Another possible explanation could be that the subset of bioinformaticians who publish code on GitHub are disproportionately those from the computer science side. Importantly, our analysis does not address other intersections of identity and demographics that affect individuals’ experience throughout the academic life cycle. Beyond simply pushing for fair treatment of all scientists, researchers have argued that team diversity leads to increased productivity of software development and higher quality science [ 35 – 37 ].
Limitations
Our dataset represents a large cross section of bioinformatics code bases, but many projects are excluded for various reasons. First of all, due to the challenges of full-text literature search, we did not identify all articles in the biomedical literature that mention GitHub. In particular, we did not use the open access set of articles in PubMed Central because these included too many mentions of GitHub to manually curate for both bioinformatics topics and code being announced with the respective articles, and efforts to train automated classifiers left too many false positives that tended to skew the picture of repository properties compared to true announcements of bioinformatics code. We therefore selected a search strategy that was limited enough to generate a high-quality hand-curated set and could include papers that were not open access. Second, we are missing repositories that were not on GitHub at the time of publication or are primarily described on a main project website other than GitHub, with the exception of the high-profile repositories we added manually. Third, the high-profile dataset excludes the popular projects Bioconductor [ 38 ], BioJava [ 39 ], and Biopython [ 40 ], because we used the criterion of standalone tools, and in the case of Bioconductor because of its conglomerative substructure. Finally, our dataset could be biased due to our use of GitHub itself: it is possible that developers with certain backgrounds are disproportionately likely to host code on GitHub, while we have not analyzed any code not hosted on GitHub.
The spirit of sharing has led to an increase in popularity of preprints: advance versions of articles that have not yet been published in peer-reviewed journals. Preprints can allow scientific progress to continue during the sometimes extensive review process. However, we chose not to include preprints in our literature search for three main reasons. First, we believed that successful peer review was a fair criterion on which to identify serious code bases. Second, we wanted to analyze article metadata that would only be available from databases such as PubMed. Third, the most popular preprint server for biology, bioRxiv [ 41 ], does not currently provide an API, putting programmatic access out of reach.
Future research
Several interesting future analyses are possible with our dataset or extensions to it. First, we did not examine the important topic of software documentation, either within source code or for users. The myriad forms of user documentation (README files, help menus, wikis, web pages, forums, and so on) make this a difficult but important topic to study. Second, static code analysis would provide deep insight into software quality and style. While impractical for a large heterogeneous set of code bases written in many different languages, future studies could uncover valuable insights through focused static analysis of repositories sharing common features. Third, we did not study the behavior of individual developers in depth. Future studies could analyze the social and coding behavior of individuals across all their projects and interests on GitHub. Finally, our analysis does not address the important question of software validity: whether a program correctly implements its stated specification and produces the expected results. The complexity of bioinformatic analysis makes validity testing a very challenging problem. Nevertheless, progress has been made in this area [ 42 – 44 ]. Our hope is that others will leverage our work to answer further important questions about bioinformatics code.
Toward better bioinformatics software
Our work provides data to enhance the ongoing community-wide conversation around reproducibility and software quality in bioinformatics. Several features of our data suggest a need for community-wide software standards, including the widespread absence of open source licenses (46% of main repositories have no detectable license), the number of repositories not appearing to use version control effectively (12% of main repositories added all new files on a single day, while 40% have a median commit message length less than 20 characters), and the apparent lack of reuse of the software (28% of papers in the main dataset have never been cited by articles in PubMed Central, while 68% have fewer than five citations) ( S8 Table ). Similarly, a study based on text mining found that over 70% of bioinformatics software resources described in PubMed Central were never reused [ 45 ]. These orthogonal lines of evidence support the need for the already growing efforts toward supporting better software in bioinformatics and scientific research in general.
Existing efforts to improve research software include the Software Sustainability Institute [ 46 , 47 ], which works toward a mission of improving software to enable more effective research; Better Scientific Software [ 48 ], a project that provides resources to improve scientific and engineering software; and Software Carpentry [ 49 – 51 ], which provides highly practical training for research computing. In addition, several reviews recommend specific practices for the software development lifecycle in academic science. In [ 8 ], the author provides specific recommendations to improve usability of command line bioinformatics software. The authors of [ 52 ] recommend specific software engineering practices for scientific computing. In [ 9 ], the authors outline several practices for the entire software development lifecycle. In [ 53 ], members of a small biology lab describe their efforts to bring better software development practices to their lab. In [ 54 ], the author advocates for changes at the institutional and societal levels that would lead to better software and better science.
Our contribution to this conversation, in addition to the specific conclusions from our analysis, is to demonstrate that it is possible to study bioinformatics software at the atomic level using hard data. With continued updates, this paradigm will enable a more effective, data-driven conversation around software practices in the bioinformatics community.
Identification of bioinformatics repositories on GitHub
GitHub repositories containing bioinformatics code were found through their mention in published journal articles pertaining to bioinformatics topics. Briefly, a literature search identified articles that were likely to pertain to bioinformatics topics and contained mentions of GitHub. Manual curation identified the subset of these articles treating bioinformatics topics, using a detailed definition of bioinformatics. GitHub repository names were automatically extracted from the bioinformatics articles. Mentions of each repository in each article were manually examined to identify repositories containing code for the paper, as opposed to mentions of outside repositories. Repository names were manually deduplicated and fixed for other noticeable issues such as inclusion of extra text due to the automatic parsing of context around the repository name. Repository names were automatically checked for validity using the GitHub API, and repositories with issues in this check were manually fixed or removed if the repository no longer existed. The final set included 1,720 repositories. In addition to the 1,720 repositories identified through the literature search, we also curated a separate set of 23 high-profile repositories—highly popular and respected tools in the bioinformatics community—based on the high volume of posts about these projects on the online forum Biostars [ 55 ]. The two datasets are referred to throughout the paper as the “main” and “high-profile” datasets. See Supplemental Section 2 for details. The repositories are listed in S4 and S5 Tables.
Extraction of repository data from GitHub API
Repository data were extracted from the GitHub REST API v3 [ 16 ] and saved to tables on Google BigQuery [ 56 ] for efficient downstream analysis. Data extracted for each repository include repository-level metrics, file information, file creation dates, file contents, commits, and licenses. GitHub API responses were obtained using the PycURL library [ 57 ]. The JSON responses were converted to database records and pushed to tables on BigQuery using the BigQuery-Python library [ 58 ]. See Supplemental Section 3 for details.
Topic modeling of article abstracts
We used latent Dirichlet allocation (LDA) [ 59 ] to infer topics for abstracts of the articles announcing each repository in the main dataset. From the LDA model, we identified terms that were primarily associated with a single topic. We chose a model with eight topics due to its maximal coherence of concepts within the top topic-specialized terms. We manually assigned a label to each of the eight topics that captures a summary of the top terms. We then classified each article abstract into one or more topics. Details are in Supplemental Section 4.
We identified 515,017 total files files among the repositories in the main dataset and 22,396 total files in the high-profile dataset. Contents of 425,967 and 18,501 files respectively (349,834 and 16,917 with unique contents) with size under 999KB were saved to tables in BigQuery for further analysis. (See Supplemental Section 3.) We used cloc (Count Lines of Code) version 1.72 [ 60 ] to identify the programming language, count lines of code and comments, and extract comment-stripped source code for each file. A total of 221,343 unique files in the main dataset and 11,425 in the high-profile dataset had an identifiable programming language. Language execution modes were obtained from [ 61 ]. Type systems were obtained from [ 62 ]. Further details are presented in Supplemental Section 5.
We identified the number of commit authors and outside contributors for each repository. For commit authors, we attempted to count unique people by collapsing users with the same name or login. For outside contributors, we counted commit authors whose author ID is never a committer ID for the repository. The counts of forks, subscribers and stargazers were returned directly from the GitHub API. Further details are presented in Supplemental Section 6.
Gender analysis
We attempted to infer a gender for each commit author, committer, and article author using the Genderize.io API [ 30 ], which returns a gender call and probability of correctness for a given first name. Names were first cleaned to remove noise such as single-word handles or organization names, and then the first word of each cleaned full name was submitted to Genderize. We accepted gender calls whose reported probability was 0.8 or greater. We proceeded with analysis of “female” and “male” categories only. We assume that transgender and non-binary contributors have names that reflect their gender identity. There may be erroneous calls for individuals who do not identify with a binary gender. The gender calls are also expected to include a few errors for cisgender individuals as we accept calls with global probability of 0.8 or higher.
To analyze the gender breakdown of developers, we counted unique full names of authors and committers. For commits, we joined commit records to genders by the full name of the commit author and counted individual commits. For paper authors, we counted individual authorships on papers instead of unique individuals, reasoning that multiple different authorships for the same individual should be counted separately. We analyzed team composition for the 504 projects in the main dataset for which we could infer a gender for at least 75% of developers and 75% of paper authors (Fig H in S1 File ). We calculated the Shannon index of diversity [ 32 ] for the 602 repositories in the main dataset for which we could infer a gender for at least 75% of developers (Fig I in S1 File ). Details are described in Supplemental Section 7.
We defined project duration as the time span between the first and last commit timestamps for the repository. Metrics describing monthly activity are with respect to the number of months in the project duration. We identified the initial commit time for each file by taking the earliest timestamp of all commits touching the file. Details are described in Supplemental Section 8.
Proxy for project impact
We defined “commits after publication” to be true if the latest commit timestamp at the time we accessed the data was after the day the associated article appeared in PubMed. Articles were identified and article metadata were extracted as described in Supplemental Section 2. Repository data were extracted from the GitHub API as described in Supplemental Section 3. Details are described in Supplemental Section 9.
Availability of data and software
All repository data extracted from the GitHub API, except file contents, are available at https://doi.org/10.17605/OSF.IO/UWHX8 . For file contents, in the absence of explicit open source licenses for the majority of repositories studied, we recorded the Git URL for the specific version of each file so that the exact dataset can be reconstructed using our downstream scripts. Additionally, we have removed personal identifying information from commit records, but have included API references for each commit record so that the full records can be reconstructed. Software to generate the dataset and replicate the results in the paper is available at https://github.com/pamelarussell/github-bioinformatics . See Supplemental Section 1 for details on the data and software.
Supporting information
S1 file. supplemental information, methods, and figures..
https://doi.org/10.1371/journal.pone.0205898.s001
S1 Table. Definition of bioinformatics topics.
https://doi.org/10.1371/journal.pone.0205898.s002
S2 Table. Manual classification of articles as bioinformatics or not.
https://doi.org/10.1371/journal.pone.0205898.s003
S3 Table. Automatic identification of GitHub repository names in articles.
https://doi.org/10.1371/journal.pone.0205898.s004
S4 Table. Manual curation of GitHub repository names.
https://doi.org/10.1371/journal.pone.0205898.s005
S5 Table. High-profile repositories.
https://doi.org/10.1371/journal.pone.0205898.s006
S6 Table. Programming language type systems.
https://doi.org/10.1371/journal.pone.0205898.s007
S7 Table. Programming language execution modes.
https://doi.org/10.1371/journal.pone.0205898.s008
S8 Table. Calculated repository features.
https://doi.org/10.1371/journal.pone.0205898.s009
Acknowledgments
We thank Debashis Ghosh, Wladimir Labeikovsky, and Matthew Mulvahill for helpful conversations and comments on the manuscript. We thank the GitHub support staff for their effort in determining how we could work within the GitHub Terms of Service to publish a reproducible study.
- View Article
- PubMed/NCBI
- Google Scholar
- 3. Scope Guidelines | Bioinformatics | Oxford Academic [Internet]. [cited 19 Mar 2018]. Available: https://academic.oup.com/bioinformatics/pages/scope_guidelines
- 5. Computational biology and bioinformatics—Latest research and news | Nature [Internet]. 7 Mar 2018 [cited 24 Mar 2018]. Available: https://www.nature.com/subjects/computational-biology-and-bioinformatics
- 10. Git [Internet]. [cited 24 Mar 2018]. Available: https://git-scm.com/
- 11. Stack Overflow Developer Survey 2018. In: Stack Overflow [Internet]. [cited 18 Mar 2018]. Available: https://insights.stackoverflow.com/survey/2018/
- 12. Build software better, together [Internet]. Github; Available: https://github.com
- 13. GitHub Octoverse 2017 [Internet]. Github; Available: https://octoverse.github.com/
- 15. Instructions for Authors | Bioinformatics | Oxford Academic [Internet]. [cited 27 Apr 2018]. Available: https://academic.oup.com/bioinformatics/pages/instructions_for_authors
- 16. GitHub API v3 | GitHub Developer Guide [Internet]. Github; Available: https://developer.github.com/v3/
- 17. GitHub Terms of Service—User Documentation [Internet]. Github; Available: https://help.github.com/articles/github-terms-of-service/
- 18. Ray B, Posnett D, Filkov V, Devanbu P. A large scale study of programming languages and code quality in github. Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering. ACM; 2014. pp. 155–165. https://doi.org/10.1145/2635868.2635922
- 19. Kochhar PS, Bissyandé TF, Lo D, Jiang L. An Empirical Study of Adoption of Software Testing in Open Source Projects. 2013 13th International Conference on Quality Software. 2013. pp. 103–112. https://doi.org/10.1109/QSIC.2013.57
- 21. Borges H, Hora A, Valente MT. Understanding the Factors that Impact the Popularity of GitHub Repositories [Internet]. arXiv [cs.SE]. 2016. Available: http://arxiv.org/abs/1606.04984
- 23. Ma W, Chen L, Zhou Y, Xu B. What Are the Dominant Projects in the GitHub Python Ecosystem? 2016 Third International Conference on Trustworthy Systems and their Applications (TSA). 2016. pp. 87–95. https://doi.org/10.1109/TSA.2016.23
- 24. Sheoran J, Blincoe K, Kalliamvakou E, Damian D, Ell J. Understanding “Watchers” on GitHub. Proceedings of the 11th Working Conference on Mining Software Repositories. New York, NY, USA: ACM; 2014. pp. 336–339. https://doi.org/10.1145/2597073.2597114
- 25. Spotlight on Bioinformatics. NatureJobs. Nature Publishing Group; 2016; https://doi.org/10.1038/nj0478
- 26. Blei DM. Probabilistic Topic Models. Commun ACM. New York, NY, USA: ACM; 2012;55: 77–84. https://doi.org/10.1145/2133806.2133826
- 28. Boost C++ Libraries [Internet]. [cited 18 Mar 2018]. Available: http://www.boost.org/
- 29. Babraham Bioinformatics—FastQC A Quality Control tool for High Throughput Sequence Data [Internet]. [cited 18 Mar 2018]. Available: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
- 30. Strømgren C. Genderize.io | Determine the gender of a first name [Internet]. [cited 25 Jan 2018]. Available: https://genderize.io/
- 33. GitHub Privacy Statement—User Documentation [Internet]. Github; Available: https://help.github.com/articles/github-privacy-statement/
- 34. National Science Foundation, National Center for Science and Engineering Statistics. Doctorate Recipients from U.S. Universities: 2016 [Internet]. Alexandria, VA.: National Science Foundation; 2017. Report No.: Special Report NSF 18–304. Available: https://www.nsf.gov/statistics/2018/nsf18304/
- 37. Vasilescu B, Posnett D, Ray B, van den Brand MGJ, Serebrenik A, Devanbu P, et al. Gender and Tenure Diversity in GitHub Teams. Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems. New York, NY, USA: ACM; 2015. pp. 3789–3798. https://doi.org/10.1145/2702123.2702549
- 41. bioRxiv.org —the preprint server for Biology [Internet]. [cited 24 Mar 2018]. Available: https://www.biorxiv.org/
- 46. The Software Sustainability Institute | Software Sustainability Institute [Internet]. [cited 2 May 2018]. Available: https://www.software.ac.uk/
- 47. The Software Sustainability Institute: changing research software attitudes and practices | Software Sustainability Institute [Internet]. [cited 2 May 2018]. Available: https://www.software.ac.uk/software-sustainability-institute-changing-research-software-attitudes-and-practices
- 48. Better Scientific Software [Internet]. [cited 2 May 2018]. Available: https://bssw.io/pages/about
- 49. Software Carpentry. In: Software Carpentry [Internet]. [cited 2 May 2018]. Available: https://software-carpentry.org/
- 56. BigQuery—Analytics Data Warehouse | Google Cloud Platform. In: Google Cloud Platform [Internet]. [cited 19 Mar 2018]. Available: https://cloud.google.com/bigquery/
- 57. Kjetil Jacobsen MFXJO. PycURL Home Page [Internet]. [cited 19 Mar 2018]. Available: http://pycurl.io/
- 58. Treat T. BigQuery-Python [Internet]. Github; Available: https://github.com/tylertreat/BigQuery-Python
- 60. cloc [Internet]. Github; Available: https://github.com/AlDanial/cloc
- 61. Wikipedia contributors. List of programming languages by type. In: Wikipedia, The Free Encyclopedia [Internet]. 12 Dec 2017 [cited 15 Mar 2018]. Available: https://en.wikipedia.org/w/index.php?title=List_of_programming_languages_by_type&oldid=814994307
- 62. Wikipedia contributors. Comparison of type systems. In: Wikipedia, The Free Encyclopedia [Internet]. 5 Sep 2017 [cited 15 Mar 2018]. Available: https://en.wikipedia.org/w/index.php?title=Comparison_of_type_systems&oldid=799049191

IMAGES
VIDEO
COMMENTS
bioinformatics, homework, project, protein, modeling Abstract This final project assignment can be used to implement the accompanying walkthrough in bioinformatics (or other applicable) classes.
Here is a list of project ideas based on Bioinformatics. Students belonging to third year or final year can use these projects as mini-projects as well as mega-projects. This list has...
Available Projects in Bioinformatics and Machine Learning. If anyone is looking for a project in either the areas of machine learning or bioinformatics, I have many projects available. Below are 7 potential projects. The descriptions are sparse, but I can provide many more details. 1. Discriminative Graphical Models for Protein Sequence Analysis.
Bioinformatics final project, Is being continued as independent project. Completed in python Using Kivy, Webscraping (Selenium and Beautiful Soup), and Regex to edit fasta and ALN files - GitHub - hollant3510/Bioinformatics-project: Bioinformatics final project, Is being continued as independent project.
Bioinformatics Final Project Home Diagnosing breast cancer based on PBMC gene expression profile using Bayesian additive regression trees (BART) method Zicheng Hu and Kaijun Lu Molecular Genetics...
Description Bioinformatics Projects: Bioinformatics is an interdisciplinary field that develops methods and software tools for understanding biological data. It combines computer science, statistics, mathematics and engineering to analyze and interpret biological data.
Final Year Bioinformatics Project Ideas at ElysiumPro Bioinformatics Project Ideas: Bioinformatics is a massive domain. As a matter of fact, it is an interdisciplinary field which develops methods and software tools for understanding biological data.
Rust Bio ⭐ 1,213. This library provides implementations of many algorithms and data structures that are useful for bioinformatics. All provided implementations are rigorously tested via continuous integration. dependent packages 54 total releases 79 latest release March 30, 2022 most recent commit a month ago.
60.Visualizing the Three-Dimensional Structure of a Molecule. For bioinformatics 2022-2023 IEEE Project Titles,Please call: 9591912372 or Email to: [email protected] Projects at Bangalore offers Final Year students Engineering projects - ME projects,M.Tech projects,BE Projects,B.Tech Projects, bioinformatics project for students ...
This final project assignment can be used to implement the accompanying walkthrough in bioinformatics (or other applicable) classes. Discover the world's research 20+ million members 135+...
Bioinformatics project Carry out research using computers. This may be achieved by running other software, or querying online data resources, or it may be done by designing and writing your own software.
In this bioinformatics, AI and machine learning project, strategies for finding the variation corresponding to disease are developed, along with statistics to support the predictions. Furthermore, this project develops methods for predicting how a gene mutation can alter the structure of the protein or the regulatory structure.
Bachelor's Degree in Bioinformatics - Final Project - ESCI-UPF Bachelor's Degree Final Project The Bachelor's Degree in Bioinformatics includes a mandatory Bachelor's Degree Final Project of a scientific-professional nature to be carried out during the last two terms of students' final year.
OBF is an umbrella organization which represents many different programming languages used in bioinformatics. In addition to working with each of the "Bio*" projects (listed below) we also accept "cross-project" ideas that cover multiple programming languages or projects.
Bioinformatics has diverse branches when it comes to picking a project. You can chose from Genomics, proteomics, and cheminformatics. Genomics will deal with sequences, alignments, phylogeny and finding conserved domains and motifs, which I think is pretty standard.
This project must be a hypothesis-based, original research study. The student must complete 410.800 Independent Research Project and fulfill the requirements of that course, including submission of a project proposal, final paper, and poster presentation, before enrolling in the subsequent thesis course.
Projects focusing on a single disease are less likely to be relevant to NHGRI than those generalizable across multiple diseases. Program Breadth The Computational Genomics and Data Science Program (CGDS) supports the development of advanced computational approaches, innovative data analysis tools, and data resources that provide scientific ...
This video is based on tips how to select your bioinformatics final year projects and how to gets the ideas from the Research Article
Our dataset represents a large cross section of bioinformatics code bases, but many projects are excluded for various reasons. First of all, due to the challenges of full-text literature search, we did not identify all articles in the biomedical literature that mention GitHub. ... The final set included 1,720 repositories. In addition to the ...