Howdy!

My name is Arash Pakbin. I am a Ph.D. candidate in the Computer Science and Engineering Department at Texas A&M University, working under the mentorship of Dr. Bobak J. Mortazavi and Dr. Donald K.K. Lee. I previously interned at Tesla Autopilot, where I gained valuable industry experience. My dissertation, "Interpretable Functional Data Boosting in Survival Analysis," introduces a boosting algorithm for hazard estimation. For more details on this project, visit its Github page.

In recent projects, I developed a multimodal real-time risk monitoring system for ICU settings by leveraging a language model. This system integrates time-series data with clinical notes to enhance real-time risk assessment capabilities. Additionally, I am exploring techniques for summarizing clinical texts using language models and adapting them for training smaller models with constrained context lengths.

I am currently seeking my first industry position, where I can apply my expertise in these areas. Feel free to reach out—email is my preferred method of communication.

Sample Code

Here are a few examples of my work:

In this project, I developed a data preprocessing pipeline that extends Python with C++. The pipeline generates data in Python and processes it in C++ for scalability, utilizing numpy and ctypes in Python, and OpenMP in C++.
In this project, I implemented a U-Net architecture using PyTorch. This design features two types of convolution layers: a standard conv2d layer and an optimized convolution layer that reduces parameters by decreasing kernel size while maintaining the receptive field by shifting channels along the time dimension.

Projects

Here is a selection of my key projects, along with an overview of each.

Text Summarization for Multimodal Pretraining of Medical Data

May 2024 – Present

ICU clinical notes can be lengthy, requiring large context lengths for language models to capture the full content.
Training language models with large context lengths is computationally prohibitive.
We use a Llama3-8B model for summarizing notes through in-context learning, fitting them into a BERT model with a 512 context length.
More details coming soon!

Real-time, Multimodal Invasive Ventilation Risk Monitoring using Language Models and BoXHED

Nov 2023 -- Jun 2024

Conventional invasive ventilation monitoring in ICUs often ignores clinical notes and relies solely on time-series data.
Achieved AUROC of 0.86 and AUCPR of 0.35 by combining clinical text data with time-series data.
Used a T5 language model to extract numerical representations from clinical notes.
Reduced inference latency by 8 times with PyTorch multiprocessing for multi-GPU inference.
Supervised and mentored an undergraduate student throughout this project.

Scalable Boosting of Dynamic Survival Analysis

Jan 2020 -- Dec 2023

Estimation of hazard values as a function of time and a covariate.

Created BoXHED, a Python package for nonparametric hazard estimation.
Achieved 35%+ error reduction and 900x speedup with GPU and multicore CPU support.
Published BoXHED1.0 at ICML and BoXHED2.0 in Journal of Statistical Software (preprint on arXiv).
Integrated code with the XGBoost library codebase and made available on PyPI.
Open-source code is accessible on Github.

Teaching Experience

Served as a Teaching Assistant for multiple semesters in Machine Learning (graduate and undergraduate), Information Retrieval (undergraduate), and Operating Systems (undergraduate).
Delivered lectures on Python coding as part of the Machine Learning course.

Skills

These are the areas where I excel:

Languages	Python / C++ / CUDA C / R
Tools (Python)	PyTorch / Transformers / Hugging Face / Pandas / Weights & Biases / DeepSpeed / NumPy
Machine Learning	Transformers / Generative AI / Language Models / NLP / Distributed Training / Model Transparency / Boosting
General	Linear Algebra / Time Series Analysis / Statistical Modeling / AWS