Howdy!

My name is Arash Pakbin. I am a Ph.D. candidate in the Computer Science and Engineering Department at Texas A&M University, working under the mentorship of Dr. Bobak J. Mortazavi and Dr. Donald K.K. Lee. I previously interned at Tesla Autopilot, where I gained valuable industry experience. My dissertation, "Interpretable Functional Data Boosting in Survival Analysis," introduces a boosting algorithm for hazard estimation. For more details on this project, visit its Github page.
In recent projects, I developed a multimodal real-time risk monitoring system for ICU settings by leveraging a language model. This system integrates time-series data with clinical notes to enhance real-time risk assessment capabilities. Additionally, I am exploring techniques for summarizing clinical texts using language models and adapting them for training smaller models with constrained context lengths.
I am currently seeking my first industry position, where I can apply my expertise in these areas. Feel free to reach out—email is my preferred method of communication.

Sample Code

Here are a few examples of my work:

  • In this project, I developed a data preprocessing pipeline that extends Python with C++. The pipeline generates data in Python and processes it in C++ for scalability, utilizing numpy and ctypes in Python, and OpenMP in C++.
  • In this project, I implemented a U-Net architecture using PyTorch. This design features two types of convolution layers: a standard conv2d layer and an optimized convolution layer that reduces parameters by decreasing kernel size while maintaining the receptive field by shifting channels along the time dimension.

Projects

Here is a selection of my key projects, along with an overview of each.

Text Summarization for Multimodal Pretraining of Medical Data

May 2024 – Present
  • ICU clinical notes can be lengthy, requiring large context lengths for language models to capture the full content.
  • Training language models with large context lengths is computationally prohibitive.
  • We use a Llama3-8B model for summarizing notes through in-context learning, fitting them into a BERT model with a 512 context length.
  • More details coming soon!

Real-time, Multimodal Invasive Ventilation Risk Monitoring using Language Models and BoXHED

Nov 2023 -- Jun 2024
  • Conventional invasive ventilation monitoring in ICUs often ignores clinical notes and relies solely on time-series data.
  • Achieved AUROC of 0.86 and AUCPR of 0.35 by combining clinical text data with time-series data.
  • Used a T5 language model to extract numerical representations from clinical notes.
  • Reduced inference latency by 8 times with PyTorch multiprocessing for multi-GPU inference.
  • Supervised and mentored an undergraduate student throughout this project.

Scalable Boosting of Dynamic Survival Analysis

Jan 2020 -- Dec 2023
Estimation of hazard values as a function of time and a covariate.
  • Created BoXHED, a Python package for nonparametric hazard estimation.
  • Achieved 35%+ error reduction and 900x speedup with GPU and multicore CPU support.
  • Published BoXHED1.0 at ICML and BoXHED2.0 in Journal of Statistical Software (preprint on arXiv).
  • Integrated code with the XGBoost library codebase and made available on PyPI.
  • Open-source code is accessible on Github.

Teaching Experience

  • Served as a Teaching Assistant for multiple semesters in Machine Learning (graduate and undergraduate), Information Retrieval (undergraduate), and Operating Systems (undergraduate).
  • Delivered lectures on Python coding as part of the Machine Learning course.

Skills

These are the areas where I excel:

Languages Python / C++ / CUDA C / R
Tools (Python) PyTorch / Transformers / Hugging Face / Pandas / Weights & Biases / DeepSpeed / NumPy
Machine Learning Transformers / Generative AI / Language Models / NLP / Distributed Training / Model Transparency / Boosting
General Linear Algebra / Time Series Analysis / Statistical Modeling / AWS