You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am currently working with very large datasets on which I'd like to perform survival analysis using the Cox Proportional Hazards model. I appreciate the robustness of lifelines for survival analysis but face the challenge of handling my dataset, which is too large to fit into memory on a single machine.
I am running my data preprocessing steps using PySpark on a Spark cluster due to its distributed nature and the size of my data.
Given that lifelines does not natively support a distributed environment like Apache Spark, I wanted to ask for recommendations or best practices on how to efficiently use lifelines in a distributed context, or if there are alternative solutions for Cox regression at scale that plays well with Spark or similar distributed systems.
Specifically, I am wondering:
Is there a recommended approach to perform Cox regression on a large dataset using lifelines with PySpark?
Has any user worked around the memory limitations by combining Spark and lifelines in a creative way?
Are there future plans to support distributed survival analysis within the lifelines roadmap?
Any guidance or sharing of experiences would be greatly appreciated.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hello Lifelines Community,
I am currently working with very large datasets on which I'd like to perform survival analysis using the Cox Proportional Hazards model. I appreciate the robustness of lifelines for survival analysis but face the challenge of handling my dataset, which is too large to fit into memory on a single machine.
I am running my data preprocessing steps using PySpark on a Spark cluster due to its distributed nature and the size of my data.
Given that lifelines does not natively support a distributed environment like Apache Spark, I wanted to ask for recommendations or best practices on how to efficiently use lifelines in a distributed context, or if there are alternative solutions for Cox regression at scale that plays well with Spark or similar distributed systems.
Specifically, I am wondering:
Is there a recommended approach to perform Cox regression on a large dataset using lifelines with PySpark?
Has any user worked around the memory limitations by combining Spark and lifelines in a creative way?
Are there future plans to support distributed survival analysis within the lifelines roadmap?
Any guidance or sharing of experiences would be greatly appreciated.
Thank you for your time and assistance.
Best regards,
TX
Beta Was this translation helpful? Give feedback.
All reactions