Data Engineering On Databricks
HealthcareA leading therapeutics company revolutionising data engineering and transformation with Databricks
About Data Engineering On Databricks
A leading Therapeutics company is committed to developing novel therapies with the potential to transform the lives of people with debilitating disorders of the brain. They are pursuing new pathways to improve brain health and run depression, neurology, and neuropsychiatry franchise programs that aim to change how brain disorders are perceived and treated. Their mission is to make medicines that matter so people can get better, sooner. They aim to transform the practice of neuroscience research and rethink how central nervous system (CNS) disorders are understood and treated. Their mission is to pioneer solutions to deliver life-changing brain health medicines, so every person can thrive.
The Challenge
A pioneering Therapeutics company is focused on delivering life-changing brain health medicines and therapies. They focus on drug and compound research and development. They use translational data to drive efficiency in drug development, explore the impact of their proprietary compounds and understand their potential in the treatment of disorders of the brain. They have designed a portal to offer accurate, balanced, and current scientific information to support medical professionals with AntStack.
Our Goals
Their initial expectations involved building pipelines to load data from various sources, performing required transformations on said data, and making them available to business users and analysts, all using the Databricks platform. The data to be loaded ranged from research and development data regarding tests, drugs, and compounds to commercial and customer data collected both internally and from external vendors. While the data sources varied from SFTP servers and external RDBMS databases to text and CSV files made available via AWS S3. The requirement also involved the eventual development of a framework and process which could be adapted for any use case and to handle any type and scale of data.
Technology Advancement with New Serverless Platform
The Therapeutics company was facing challenges with the existing system and was delighted with the following outcomes:
Speed and Reliability Goals
While they use the tool https://healthchecks.io/ for selected use cases and follow a general practice of maintaining checklists for quality and sanity checks, they now wanted To speed up the primary metric for speed and reliability. The ability to apply the aforementioned process/ framework in non-generic use cases and ad hoc requirements. AntStack was able to provide resilient solutions to hurdles in data loading and transformation within a relatively short period while maintaining the data quality.
Simple and Effective Cron Job Monitoring
They were looking for a notification system for the nightly backups, weekly reports, cron jobs, and scheduled tasks. Most of these jobs were not running on time. AntStack solved their issues with a process flow, wherein a user generates a unique ping URL for their background job. Then update the job to send an HTTP request to the ping URL every time the job runs. When the job does not ping Healthchecks.io on time, Healthchecks.io alerts the user. This simple yet effective solution helped them deliver on time.
Seamless Integration with External Storage Services
The Therapeutics company did not want to go serverless in its implementation to manage clusters. Instead, they wanted Databricks to take care of spinning up, managing, and orchestrating the compute clusters used for the ETL process as well as SQL Endpoints for querying and analytics. AntStack utilized Databricks and helped them seamlessly integrate with external storage services, job orchestration, and workflow capabilities, along with git integration for source control and preconfigured spark environment. The program featured rich notebooks with support for multiple languages, including SQL, Python, Shell, etc., making the trade-off of managed clusters over serverless computing worth it.
Technological Loading and Transformation
They were loading data and applying the required transformations to the data, which could vary from adding new columns, doing various aggregations, and joining to combining data from various bronze tables to single or multiple target tables across the refined/silver and trusted/gold layers. The implementation involved using the Databricks platform to load data from various sources using spark methods available through PySpark and spark SQL. The source data is loaded from various sources using different methods of reading ‘said data’ supported by the spark to a raw/bronze layer. The cleaned and transformed data is then made available to business users and analysts via SQL warehouses (endpoints) provided by Databricks with granular permissions. Databricks notebooks and workflows are employed to achieve the bulk of the loading and transformation, while the AWS Glue data catalog acts as a hive metastore alongside ample use of other AWS services like SES for reporting.
Faster and Reliable Framework
The Therapeutics company lacked a fast and reliable framework to load and transform huge chunks of data spread across multiple sources, systems, and teams. AntStack offered a solution involving setting up processes and templates to handle various generic data loading scenarios to improve and streamline the time taken between collecting the data and being able to explore it. This approach helped reduce the time involved and helped identify and understand the pain points which could be focused on, specifically the generic cases that have been handled faster.