Data analytics is the term applied to the process of analysing and visualising data, with the goal of drawing conclusions and understanding from the data. Data analytics is becoming increasingly important as laboratories have to process and interpret the ever-increasing volumes of data that their systems generate.
In the laboratory, the primary purpose of data analytics is to verify or disprove existing scientific models to provide better understanding of the organisation’s current and future products or processes.
Data mining is a related process that utilises software to uncover patterns, trends and relationships in datasets. Although data analytics and data mining are often thought of in the same context, often in connection with ‘big data’, they have different objectives.
Data mining can broadly be defined as a ‘secondary data analysis’ process for knowledge discovery. It analyses data that may have originally been collected for other reasons. This differentiates it from data analytics, where the primary objective is based on either exploratory data analysis (EDA), in which new features in the data are discovered, or confirmatory data analysis (CDA), in which existing hypotheses are proven true or false.
In recent years, some of the major laboratory informatics vendors have started to offer data analysis and visualisation tools in their product portfolios. These tools typically provide a range of statistical procedures to facilitate data analysis; and visual output to help with interpretation. Alongside integrated data analytics tools, vendors increasingly offer generic tools to provide software that can extract and process data from simple systems, through to multiple platforms and formats.
The benefit of integrated data analysis tools is that they will provide a seamless means of accessing data, eliminating concerns about incompatible data formats. As with any other laboratory software, defining functional and user requirements are essential steps in making the right choice. Key areas to focus on are that the tools have appropriate access to laboratory, and other data sources; that they provide the required statistical tools; and that they offer presentation and visualisation capabilities consistent with broader company preferences and standards.
Data analytics plays an important role in the generation of scientific knowledge and, as with other aspects of ‘knowledge management’, it is important to understand the relationship between technology, processes and people. In particular, staff need to have the appropriate skills to interpret, rationalise, and articulate the output presented by the data analysis tools. To take full advantage of data analytics, it should be considered as part of a holistic process that starts with the design of the experiment.
A quote attributed to Sir Ronald Fisher, ca 1938, captures this point: ‘To call in the statistician after the experiment is done, may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of.’
The explosion of AI
In the last five years, the advent of artificial intelligence (AI) and machine learning (ML) has begun to enable new ways to carry out research or to automate activities such as data analytics.
This field requires large data volumes and considerable computing power but offers to shed light on scientific challenges that were previously too large or complex to be understood through traditional workflows.
In the most basic sense, AI refers to the ability of a computer model to display intelligence. Colloquially, this term is often used in reference to describe a computer model that can mimic functions that humans associate with the human mind, such as learning and problem-solving.
ML is a subset of this discipline that focuses on the use of algorithms and statistical models to perform a specific task without using explicit instructions, relying on patterns and inference instead. ML builds a mathematical model based on sample data, known as ‘training data’, in order to make predictions or decisions without being explicitly programmed to perform the task. Machine learning algorithms are used in a wide variety of applications, such as email filtering and computer vision, where it is difficult or infeasible to develop a conventional algorithm for effectively performing the task.
ML is related to computational statistics. The study of mathematical optimisation delivers methods, theory and application domains to the field of machine learning. Data mining is a field of study in ML and focuses on exploratory data analysis through unsupervised learning.
Deep learning (DL) is part of a broader family of ML methods based on artificial neural networks with representation learning. Learning can be supervised, semi-supervised or unsupervised.
Deep learning uses architectures such as deep neural networks which have been applied to fields including computer vision, speech recognition, natural language processing, bioinformatics, drug design, medical image analysis, where they have produced results comparable to, and in some cases surpassing, human expert performance.
Deep learning uses ML techniques to solve real-world problems by making use of neural networks that simulate human decision-making. To use this technology effectively can be costly as the technique requires huge datasets to train itself. This is because there are a huge number of parameters that need to be understood by a learning algorithm, which needs sufficient training to make accurate predictions. For example, a deep learning algorithm could be trained to ‘learn’ what cancer looks like in a medical image. However, it would take an enormous dataset of images for it to understand the minor details that distinguish cancer from healthy cells.
Development of new techniques
As far back as 1995 scientific journals began to see papers detailing the scientific benefits of implementing AI, particularly in the clinical laboratory, as techniques such as image recognition can be employed to make scientist’s jobs quicker and easier.
A paper from Place et al in 1995, Use of artificial intelligence in analytical systems for the clinical laboratory states: ‘The incorporation of information-processing technology into analytical systems in the form of standard computing software has recently been advanced by the introduction of AI, both as expert systems and as neural networks.
‘AI is characterised by its ability to deal with incomplete and imprecise information and to accumulate knowledge. Expert systems, building on standard computing techniques, depend heavily on the domain experts and knowledge engineers that have programmed them to represent the real world. Neural networks are intended to emulate the pattern-recognition and parallel processing capabilities of the human brain, and are taught rather than programmed. The future may lie in a combination of the recognition ability of the neural network and the rationalisation capability of the expert system.’
New approaches to science
In the laboratory, improvements in instrumental systems, data management and data integrity and standardisation are opening up new possibilities for scientists and researchers to make use of AI and ML.
The Pistoia Alliance has a useful list of AI and DL papers, links and articles to further expand the knowledge base for scientists who want to adopt AI and ML into their workflows. This can be used as a starting point for those who want to begin using AI, or for scientists who want guidance on the type of workflows that are already being seen in the laboratory today.
Examples can be found covering pharmaceutical research, medicine and healthcare, but the potential for AI goes beyond just these disciplines. Predictive analytics, for example, is one tool that the Pistoia Alliance is using to better understand laboratory instruments and how they might fail over time.
A paper from last June published in Nature from Almeida et al: Synthetic organic chemistry driven by artificial intelligence notes that ‘By examining the underlying concepts, we aim to demystify AI for bench chemists in order that they may embrace it as a tool, rather than fear it as a competitor, spur future research by pinpointing the gaps in knowledge, and delineate how chemical AI will run in the era of digital chemistry.
With the right data management strategies and careful consideration of metadata, how to best store data so that it can be used in future AI and ML workflows is essential to the pursuit of AI in the laboratory. Utilising technologies such as LIMS and ELN enables lab users to catalogue data, providing context and instrument parameters that can then be fed into AI or ML systems. Without the correct data or with mismatched data types, AI and ML will not be possible, or at the very least, could provide undue bias trying to compare data from disparate sources.