Howdy!
My name is Arash Pakbin. I am a Ph.D. candidate in the Computer Science and Engineering Department at Texas A&M University, working under the mentorship of Dr. Bobak J. Mortazavi and Dr. Donald K.K. Lee. I previously interned at Tesla Autopilot, where I gained valuable industry experience. My dissertation, "Interpretable Functional Data Boosting in Survival Analysis," introduces a boosting algorithm for hazard estimation. For more details on this project, visit its Github page.
In recent projects, I developed a multimodal real-time risk monitoring system for ICU settings by leveraging a language model. This system integrates time-series data with clinical notes to enhance real-time risk assessment capabilities. Additionally, I am exploring techniques for summarizing clinical texts using language models and adapting them for training smaller models with constrained context lengths.
I am currently seeking my first industry position, where I can apply my expertise in these areas. Feel free to reach out—email is my preferred method of communication.
Sample Code
Here are a few examples of my work:
- In this project, I developed a data preprocessing pipeline that extends Python with C++. The pipeline generates data in Python and processes it in C++ for scalability, utilizing numpy and ctypes in Python, and OpenMP in C++.
- In this project, I implemented a U-Net architecture using PyTorch. This design features two types of convolution layers: a standard conv2d layer and an optimized convolution layer that reduces parameters by decreasing kernel size while maintaining the receptive field by shifting channels along the time dimension.
Projects
Here is a selection of my key projects, along with an overview of each.
Text Summarization for Multimodal Pretraining of Medical Data
May 2024 – Present
- ICU clinical notes can be lengthy, requiring large context lengths for language models to capture the full content.
- Training language models with large context lengths is computationally prohibitive.
- We use a Llama3-8B model for summarizing notes through in-context learning, fitting them into a BERT model with a 512 context length.
- More details coming soon!
Real-time, Multimodal Invasive Ventilation Risk Monitoring using Language Models and BoXHED
Nov 2023 -- Jun 2024
- Conventional invasive ventilation monitoring in ICUs often ignores clinical notes and relies solely on time-series data.
- Achieved AUROC of 0.86 and AUCPR of 0.35 by combining clinical text data with time-series data.
- Used a T5 language model to extract numerical representations from clinical notes.
- Reduced inference latency by 8 times with PyTorch multiprocessing for multi-GPU inference.
- Supervised and mentored an undergraduate student throughout this project.
Scalable Boosting of Dynamic Survival Analysis
Jan 2020 -- Dec 2023
- Created BoXHED, a Python package for nonparametric hazard estimation.
- Achieved 35%+ error reduction and 900x speedup with GPU and multicore CPU support.
- Published BoXHED1.0 at ICML and BoXHED2.0 in Journal of Statistical Software (preprint on arXiv).
- Integrated code with the XGBoost library codebase and made available on PyPI.
- Open-source code is accessible on Github.
Teaching Experience
- Served as a Teaching Assistant for multiple semesters in Machine Learning (graduate and undergraduate), Information Retrieval (undergraduate), and Operating Systems (undergraduate).
- Delivered lectures on Python coding as part of the Machine Learning course.
Skills
These are the areas where I excel:
Languages | Python / C++ / CUDA C / R |
Tools (Python) | PyTorch / Transformers / Hugging Face / Pandas / Weights & Biases / DeepSpeed / NumPy |
Machine Learning | Transformers / Generative AI / Language Models / NLP / Distributed Training / Model Transparency / Boosting |
General | Linear Algebra / Time Series Analysis / Statistical Modeling / AWS |