Spark and laboratory data

July 25th, 2018

Spark and laboratory data

At Diaceutics we are using a number of tools to leverage our data and to generate actionable insights.  We mine data to forecast market trends and to understand the changing biomarker landscape.  We are able to understand the use of new biomarker targets over time, review the dissemination of new companion diagnostics across the globe and build an understanding of the patient journey and related events (numbers of tests and timeliness, for example) linked to health outcomes. There are many methods that can be used to generate valuable insights through use of lab data.  

A key technology we leverage is Apache Spark®.  We use this big data technology to analyze our proprietary laboratory test data for our pharmaceutical clients.  We incorporate Spark in our regular processes to handle weekly uploads to our data warehouse and it takes just one hour to parse the entirety of our historical data.  This is possible through use of parallel processing using Amazon Web Services (AWS) as a cloud hosting provider to scale up data processing as needed.  When we deploy Spark we efficiently use the right number of machines to process data, while Amazon manages the machines and software.  Furthermore, Spark is extensible, featuring machine learning libraries that can be used to mine deeper insights into our laboratory data.  

We have used Spark for both analysis as well as for processing. One example of how we leverage processing power is to aggregate all biomarkers tested per patient over their entire history.  For analysis, we can take the list of tests performed and calculate probabilities of how likely one test will lead to another.  We can also take the list of biomarkers and cluster patients according to treatment histories.  Having one place to bulk process records is valuable as we discover deep insights using our machine learning algorithms through reviews of many different slices of data.   

Spark can be used to correct anomalous data and augment gaps in data.  Data that is manually entered or missing can be forecast or given context.  For example, body sites of a biopsy are often manually entered by a physician or pathologist and at times there are spelling errors that may prevent accurate tracking of a sample’s origin.  When there are gaps in information, Spark allows us to cross reference a body site for a given test event with the entire history of a sample, allowing us to interpolate where in the body the sample originated.  

The possibilities for using Spark are great and many large companies use Spark regularly, including Alibaba, Amazon, Autodesk, Tencent and TripAdvisor.  For example, Trip Advisor is capable of processing every review that has been added to their site through use of Spark – they apply natural language processing to the reviews to make the content more useful.  

Spark has been around since 2012 and its use in the health sciences continues to grow.  At Diaceutics, we are continually looking for new ways to leverage the latest technologies to actively break down barriers to deliver better testing, therefore better treatment, for patients.  

#ApacheSpark #Labdata #Diaceutics #MachineLearning #BigData  

Refs: 

http://spark.apache.org/powered-by.html 

https://conferences.oreilly.com/strata/big-data-conference-ca-2015/public/schedule/detail/40463 

http://engineering.tripadvisor.com/using-apache-spark-for-massively-parallel-nlp/ 

Latest Blogs

View all