Running Cox Proportional Hazards Model on Large Datasets with Lifelines and Apache Spark #1610

taoxu2016 · 2024-04-28T19:39:47Z

taoxu2016
Apr 28, 2024

Hello Lifelines Community,

I am currently working with very large datasets on which I'd like to perform survival analysis using the Cox Proportional Hazards model. I appreciate the robustness of lifelines for survival analysis but face the challenge of handling my dataset, which is too large to fit into memory on a single machine.

I am running my data preprocessing steps using PySpark on a Spark cluster due to its distributed nature and the size of my data.

Given that lifelines does not natively support a distributed environment like Apache Spark, I wanted to ask for recommendations or best practices on how to efficiently use lifelines in a distributed context, or if there are alternative solutions for Cox regression at scale that plays well with Spark or similar distributed systems.

Specifically, I am wondering:
Is there a recommended approach to perform Cox regression on a large dataset using lifelines with PySpark?

Has any user worked around the memory limitations by combining Spark and lifelines in a creative way?

Are there future plans to support distributed survival analysis within the lifelines roadmap?

Any guidance or sharing of experiences would be greatly appreciated.

Thank you for your time and assistance.

Best regards,
TX

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running Cox Proportional Hazards Model on Large Datasets with Lifelines and Apache Spark #1610

{{title}}

Replies: 0 comments

Select a reply

Running Cox Proportional Hazards Model on Large Datasets with Lifelines and Apache Spark #1610

taoxu2016 Apr 28, 2024

Replies: 0 comments

taoxu2016
Apr 28, 2024