Joshua Fan


Note: these are mostly old class projects. See my CV for more recent information.

Edit Embedding via Reinforcement Learning (with Shunfu Mao)

The goal of this project was to learn an embedding for edit distance, such that the Hamming distance between the two embedded strings approximates the true edit distance between the original strings. This embedding would allow clustering by edit distance to be done more efficiently (with applications to RNA transcriptome analysis, and protein analysis, and spell checking). We attempted to use Seq2Seq neural networks with attention to first learn an existing edit embedding, and use the REINFORCE algorithm to improve it. Unfortunately, initial results were not positive, but I hope to revisit this problem.

Poster | Report

Optimization Algorithms for Single-Cell Transcriptome Analysis (with Sumit Mukherjee)

During a graduate Machine Learning course, we worked on solving a variant of a matrix factorization problem more efficiently to infer cell types from single-cell transcriptomic data. Previous approaches relied on very slow optimization procedures, and we sought to improve that. We implemented various online optimization algorithms and techniques (such as Stochastic Gradient Descent, Stochastic Variance Reduced Gradient, and Adagrad) in Matlab and Python to minimize the loss function more efficiently. We compared the convergence rates of these methods and verified that the cell archetypes identified were usable for downstream applications such as lineage estimation.

Poster | Report

Storage and Retrieval of Robotic Laser Range Data in Database Systems (with Kaiyu Zheng)

For my graduate Databases course project, Kaiyu Zheng and I implemented a database for laser-range scans to allow for efficient content-based querying and retrieval of images. We used the Bag-of-Visual-Words representation to store images as feature vectors, and implemented the Flexible Image Database System and Locality Sensitive Hashing techniques to speed up nearest-neighbor search. This could be used to expand image training data sets by inferring labels for unseen images based on similar images.

Poster | Report

Political Speech Clustering (with Irina Tolkova)

For my undergraduate Machine Learning course project, Irina Tolkova and I implemented several document clustering techniques, such as K-Means++, bisecting K-Means, and spectral clustering, and applied them on top of tf-idf vectors for 2016 political campaign speeches. These algorithms produced interesting and interpretable clusters of speeches by candidates and issues.


Contextual Bandits notes (with Lalit Jain, Neeraja Abhyankar, Kunhui Zhang)

For an Online and Adaptive Machine Learning course, I worked with two other graduate students and a postdoc (Lalit Jain) to survey recent research on contextual bandits and create a report synthesizing important results from that area.