Abstract:
The abundance of bioinformatics tools has grown exponentially over the last three decades. Concurrently, many tools become outdated due to their discontinuation. Staying updated is highly difficult and is costly in terms of time and effort. The existing systems to ease tool discovery primarily attempts to index all the available tools according to their functionalities. Some of them provides the number of cites the tools received to indicate their popularity. The size of these lists have grown large calling for more targeted retrieval approaches. Since the tools are typically used in conjunction, a recent approach allows the users to browse the tools according to expert maintained pipelines. However, generating and updating these pipelines remains manual. Moreo- ever, the actual conjunct use of the tools are not provided. This thesis suggests an auto- mated pipeline derivation method through literature mining and cross-citation analysis. The analysis patterns and the tool usage patterns were recovered from the data. Recommendation models were built from the data and the derived patterns. Through evaluating the models, actual tool selection behaviors were also understood. During 2009-2016, the tool functionalities with their popularity considered was highly predictive of whether they have been chosen. Along the period, substituting the overall popularity with the local popularity within the recovered analysis patterns became increasingly predictive and was on par in 2016. Such results implies that the recovered patterns resemble the pipelines used in community. Lastly, the pipelines were queried into the recommendation models to obtain the community accepted best-practices. These analysis patterns and best-practices can be used to inform experts regarding the status-quo of the field and can be used as guidelines for newcomers entering the field.