Big data taking medicine out of its comfort zone

Monitoring and evaluating the performance of drugs in the age of Big Data

Carolina Ödman
13 min readMay 25, 2017


This is a write-up of a TechTalk I gave in April in Cape Town and which looks at ways Big Data could impact the pharmaceutical industry, while raising issues that hold this back.

Drugs: from idea to market

We first need to understand how drugs are created and brought to market. It is a complex and long process. It can often take over a decade for a drug to reach the shelves of pharmacies. A drug starts from an idea, which gets tested in a laboratory in vitro, possibly simulated, then tested again in a laboratory but in vivo on animals, after which it gets submitted for clinical trials. This is the phase in which drugs are first tested in humans.

A phase one clinical trial involves at most a few dozen individuals, and assesses the initial effect of the drug. Phase two involves a couple of hundred people, and is aimed at recommending dosage. Phase three can involve up to a few thousand people and serves to confirm the safety and the benefits of the drug.

Once this has been completed, a drug can be approved (if all goes well), and once it is on the market, some post-marketing studies are usually carried out, to assess the long-term effects of the drug and spot any un-encountered adverse effects. This long and complicated process is very expensive.

The process form idea to market takes often oover a decade.

How much does it cost to bring a new drug to the market?

It costs highly qualified staff, materials, laboratory equipment as a start. Add to that the cost of clinical trials in which volunteers are usually compensated for their participation, and a vast amount of administrative overheads related to the complex regulations ruling the pharmaceutical market. The total cost is extremely difficult to estimate before starting to invest in the research but it can be worked out in retrospect.

A 2013 Forbes study shows that the development of a new drug can cost its pharmaceutical company a staggering US$ 350 million. When taking into account the cost of the drugs that fail at any stage of the process that costs soars to new heights. Bearing in mind that less or around 10% of drugs that reach phase one of clinical trials are eventually approved, companies can be looking at a bill of about US $ 5billion per new drug.

Once a drug is on the market, it is protected by a patent that can cover the composition of the drug as well as processes related to its manufacture for 20 or 25 years depending on the market. That is not very long to recoup that level of investment. In some countries like India or China, government regulations have been put in place, that shorten the patent protection time to enable generic drugs to reach the market faster, in an effort to make healthcare more affordable.

I couldn’t find the source to credit for the image on the right. I’d be happy to correct this,

Improving the success rate of drugs on the path to market in one area of focus to reduce the cost of drug development. Some new thinking along the lines of capitalising on the body’s own defences instead of replacing them appears to yield better results. Using tissue-engineered living surrogates before clinical trials is also very promising, as well as simulations, for which High Performance Computing (HPC) is currently required.

This is also an area where Big Data can assist. With massive data sets monitoring thousands of parameters, from the chemical compounds of a drug to the living context of the patient, artificial intelligence can hint at patterns of effectiveness, which can be emulated in new drug development.

How clinical trials work

Clinical trials are based on randomised control trials, which are quantitative studies in which participants (patients, volunteers) are allocated at random to receive either a clinical intervention, or none at all. The latter is the sugar pill, the placebo. Randomised control trials seek to compare the outcomes of the different groups and statistically assess whether the drug has a significant effect.

What are the shortcomings of this method?

Mind over matter

It is known that a placebo isn’t just ‘nothing’ or a sugar pill. The strength of the placebo effect, when a patient gets better even though they are not getting any drugs, has been shown to evolve, and to depend on the patients’ psychology.

On a systemic level, it has been shown that direct advertising of medication to the public, only allowed in the USA and in New Zealand, has led to a generally less doubtful attitude towards medication, and a stronger placebo effect in those two countries. On an individual level, research has also shown that the nature of the interaction and the relationship between the carer and the patient has an influence on the outcome of the care, sugar pill or not.

The path to a post-antibiotic age

What stimulates the rise of resistant bacterial strains is non-lethal exposure to antibiotics, which leads bacteria to evolve and develop resistance. This exposure can happen in several ways. The wide-spread use of antibiotics in agriculture, for example is a well-known problem. Large industrial agro-alimentary producers, in an effort to reduce loss of cattle for example, will systematically give them antibiotics, which end up in the waste water, and in the products derived from the cattle, which also eventually end up in waterways. Another common path of antibiotics to the wild where bacteria can develop resistance is the misuse of antibiotic drugs. But we are not currently in a position to optimise antibiotic use such that they only get delivered to the place within the body where infection rages, and in just the right amount to kill the infectious agent without too much of it being excreted into the world again.


The effectiveness of a drug is a function of many parameters, including the age and sex of the patient, their metabolic rates, the amount of drug administered, and many other. When carrying out a mathematical analysis of the effectiveness of a drug, ignoring a parameter, e.g. gender, doesn’t mean the effectiveness is insensitive to that parameter, it merely means that it is being ignored. The parameters that are being ignored are often so either by tradition (e.g. gender — women have traditionally been underrepresented in clinical trials because of hormonal variations deemed to have uncontrollable effects and lead to greater uncertainty in the results) or by convenience and cost (e.g. cost associated to systematically analysing the gut microbiome of hundreds of people in a clinical trial is likely to make the total cost of the trial prohibitive). Moreover, the number of parameters that can be analysed with any level of significance is limited by the number of available data points, i.e. the number of participants in a trial.

Let us look at gender for example. The recommended doses of certain drugs have been reduced by as much 50% for women following the ‘discovery’ of different effects depending on the sex of the patient. Pharmacokinetics and pharmacodynamics, the fields of study of how drugs are processed through the human body is now a flourishing field of research. It has been reported that <10% of New Drug Applications (NDAs) in the US included a sex analysis. In those, a minimum of 40% observed difference in pharmacokinetics between men and women led to no different dosing recommendations[1].

Considering this, it is clear that smaller sub-groups, such as ethnic minorities remain sorely underrepresented in traditional drug development and accounting for diversity is often done a posteriori, and is not a requirement pre-approval.

What possibilities does the rise of Big Data open up?

Big Data takes us from a snapshot limited view of a situation to a continuous (if not real-time) multi-parameter, multi-agent, ocean of data that can be analysed, sliced and diced, to surface insights and monitor knowledge that limited data couldn’t previously deliver.

Big Data is the difference between a crash test (break a few cars and see what happens to the dummies) and self-driving cars (continuous evaluation of, and reaction to driving conditions, obstacles and other vehicles). Big Data is the difference between a once-off environmental impact assessment before a construction project and connected, smart cities.

What can Big Data mean for drug development and clinical interventions?

Let us now speculate. No need to dream of new technology, it is all here already. For the sake of the argument, let us imagine that all the technology that we have today is simply available and that data collection happens automagically… (We’ll get back to that later)

Imagine an enormous database with endless parameters monitored, where information about anyone being prescribed any medicine or seeing a physician for whatever reason is entered with all possible contextual information, including their location, gender, ethnicity, other medical conditions, eating and drinking information, metabolic rates, socio-economic situation, etc. Add to that the health monitoring data of the quantified self, sleep, activity levels, step counters, Internet of Things home scales and fridges. Add simple tech like urine analysis strips. Imagine that there’s one in every nappy for example and that all that data gets fed into this data ocean. What can we gain from this?

Better medicine

With a data ocean like that, many dream of achieving personalised medicine, where a patient is continuously monitored and their medical regimen adapted according to their individual response to treatment.

But that, for now, sounds like another prize, available only to the wealthiest few % of the world’s population. Let us make this relevant to more people. Take South Africa for example. It is an understudied and underserviced vast population of people with complex compounding medical conditions with several distinct ethnic groups and others straddling ethnicities and ungroupable. How can we use the data ocean to help?

We can firstly solve the mathematics problem highlighted above. One of the powers of machine learning and artificial intelligence is the capability to interpolate intelligently where data is missing. Specifically, when other parameters/dimensions are available, in which the data is not absent, the interpolation becomes that much more robust. For example, we do not have the weight of a patient, but we have their age, gender, location, cultural and socio-economic background, AI can estimate a likely weight range for that patient. Whatever stereotypes just appeared in your minds, trust that artificial intelligence is cleverer than that.

So entering the available, if incomplete data about a patient from, e.g. a remote rural area without access to a multitude of medical exams, such an engine can yield not only a near-optimised recommended course of treatment based on similar complex profiles in the ocean of data, but also what in particular to watch out for. The data ocean and artificial intelligence applied to it can identify likely susceptibilities and factors of elevated risk for that patient. That is personalised medicine without the personal hospital crew.

Better pharmacology

With a data ocean, we can also weaken assumptions that were previously required to make sense of limited data, like assuming that medicines work the same all the time. In a data ocean like this, the drugs themselves can become the variables that are studied.

Thus, instead of only understanding drugs’ effect as a function of patients and their context, we can analyse clinical outcomes on patients as a function of drugs and their effectiveness. We can change perspective. We can potentially identify where and under which conditions certain drugs become less effective over time, or when new adverse effects arise.

This is potentially very important, as it present us with the opportunity to carry out a whole new kind of pharmacological research based on continuous data mining, and of continuous drug performance monitoring, to which the systems we currently have in place are not well adapted.

Indeed, the punctual scientific publication is by design a snapshot. Government agencies’ drug approval processes leave no room for continuous re-evaluation, as they are at best reactive. Also, medical training is not adapted to data-driven deployment of clinical best practice. Marketing efforts to physicians and medical practitioners too. Nothing is designed to cater for continuous, critical, multidimensional monitoring and large-scale personal optimisation of drug development and administration. We have now officially left the comfort zone.

What are the challenges?

The challenges at this stage will be many and complex, with aspects related to ethics and law, culture, market forces, etc. For example, it can lead to a lesser choice for patients. If pharmaceutical companies are in a position to foresee the decline in market size, efficacy, revenue potential, etc. of a drug, they may decide against making a generic of that particular drug as it can be seen as not worth it. Patients who really need that drug then would be stuck with the premium price of the original manufacturer who are not challenged to reduce their margins by competing products.

Data availability, compatibility and accessibility

Long before it comes to that though, the biggest challenge for this dream is clearly the availability and accessibility of data. Challenges include:

· Accuracy of clinical data: There is a known bias in research towards positive results. The Cochrane meta-analyses are a good way to become aware of it, but even so, the research environment has reward systems that may encourage data manipulation. Such flaws are sheltered by the fact that non-reproducibility in research is a very real issue.

Credit: PhD Comics
Credit: Unknown. I found this image online used by others but I couldn’t find the original, nor a credit line when it is used elsewhere. I’d be more than happy to correct this if someone knows where the credit is due!

· The potential of big data can only be achieved if the data is available, which raises issues of data governance, and accessible, which raises issues of formats and interface, and when different data sets are compatible, for which a lot of work is needed. Imagine for example that some national database of medical records becomes available, as well as a suitably anonymised data set from a major health monitoring wearable tech firm. The users of that firm’s trackers may well be in the medical records database as well, but there is no way to link the two unless some form of identification system is in place. And no such identification system currently exists without carrying risks to privacy.

· Also, risk and reputation play a role. Considering the legal liabilities for businesses associated with loss of life, who will be willing to take risks and share their data when mistakes happen, or things go wrong just because they do, for fear of someone seeking damages?

Credit for the image in the centre.


I should point out that everything above is merely a speculation on the possible evolution of medicine and pharmacology that Big Data, HPC and AI can catalyse. From that perspective, today’s system may appear flawed, but it is the best we have now. This system also has the advantage of always getting better, progressing, innovating and breaking new ground nationally and internationally thanks to public and private initiatives.

So yes, the picture painted above may seem out of reach for now, but in a world where we are so many, and where so many are suffering, can we afford not to tap into the potential of better understanding how medicine works, to go along bioinformatics and other research to better understand how the human body works?

Note on the audience questions

I picked up two main reactions from the audience in the room. Firstly, a grest concern about data privacy and secondly how to overcome the issue while sharing as much data as possible.

There was mention of identity tokenisation and other technological concept, to which my answer was that I cannot currently foresee a technological solution to privacy, not because of lack of technology but because of human nature. Being social animals, we are always open and that should not change but the privacy question remains.

The second main reaction was quite enthusiastic, as many mentioned great pilot initiatives where people would volunteer their data. I did have a word of warning, however, which goes along the lines of: This is health we are talking about. Health issues like disease and other conditions are extremely intimate and personal, as they have to do with our bodies and souls. Therefore, while it can be great to see data points added to the ocean of health data, we should still only approach the issue of seeing people as data with great humility and respect for how people deal with their good or ill health, and realise what an effort it is to volunteer and share deeply personal information.

References (when not linked directly in the text):

[1] Anderson GD, “Sex and racial differences in pharmacological response: where is the evidence? Pharmacogenetics, pharmacokinetics, and pharmacodynamics.”, J Womens Health (Larchmt). 2005 Jan-Feb;14(1):19–29.

48 second summary:

Whole talk, somewhat different from the write-up (30m) :

Thank you to Robyn Farah and KAT-O for the invitation to talk.

Image credits: All images are creative commons license CCO or my own unless stated otherwise. Apologies where credit and right to use is unknown, also stated above.



Carolina Ödman

Assoc. Prof. UWC Physics & Astronomy. Associate Director Development & Outreach at IDIA. EPFL and Cambridge Alumna. ❤️ my family. On a cancer journey