Hi I'm Sanjay Rathee

Data Scientist

I am Sanjay Rathee, a Data Scientist with eight years of experience in Big Data analytics and Machine Learning algorithms. I am particularly interested in applying the latest computing technology for real-life applications. At present, I am working as a research scientist in the Department of Oncology at The University of Oxford. I have expertise in designing machine learning models to predict the response of a cancer patient to a particular treatment based on patient gene expression data.

My Skill

Big Data Analytics

I am working with Big Data for last 9 years. It was my major during master's and Ph.D.

Hadoop/Spark/Flink

90%

HBase/Hive/Kafka

85%

R/Python/C/C++

80%

Scala/Java

90%

Machine Learning

I have been working with Machine Learning algorithms for last 7 years. It was my major during Ph.D.

Association Rule Mining

95%

Deep Learning

90%

Recommendation Systems

85%

Classification

95%

Education

2018-Present

Postdoc Research Scientist

University of Oxford

Implemented pipelines to predict the response of cancer patients to radiotherapy. Linear Machine Learning models are used to find differentially expressed genes for cancer patients.

2014-2018

PhD Computer Engineering

Indian Institute of Technology Mandi (IIT Mandi)

Title: Distributed Algorithms for Alignment and Analysis of Big Data generated by Next-Generation Sequencing using Big Data Frameworks.

2011-2013

M.Tech Computer Engineering

UIET Kurukshetra University

Title: Big Data Analytics using Apache Hadoop. Majors: Big Data Analytics, Machine Learning, Distributed Computing Frameworks

2007-2011

B.Tech Computer Engineering

Maharshi Dayanand University Rohtak

Final Project Titles: City Event Management System and Organisation Software Management System

2005-2007

Secondary School

Vaish Public School Rohtak

Major: Engineering Sciences
Subjects: Mathematics, Physics, Chemistry, Physical Science

Projects

Prediction Model for Cancer Data

Implemented pipelines to predict the response of cancer patients to radiotherapy. Linear Machine Learning models are used to find differentially expressed genes for cancer patients. These genes are used by support vector machines to predict the response of cancer patients to radiotherapy. Clinical trial data is also used to improve accuracy of predictor.
Keywords: R, General Linear Model, Penalized Linear Regression, Lasso, Random Forest, Support Vector Machines
Blog: Predicting-response-in-cancer-patients-using-machine-learning-models

StreamAligner

Proposed first sequence aligner (StreamAligner) which can map streams of reads on the reference genome. Capable of making sequencing, alignment, and analysis task automatic. Implemented StreamAligner in Java to work on top of Apache Spark streaming engine.
Keywords: Java, Apache Spark, Distributed Sequence Alignment, Stream support
GitHub: https://github.com/sanjaysinghrathi/StreamAligner

AVLR-Mapper

Proposed a new sequence alignment algorithm AVLR-Mapper which outperforms nearly all current era sequence aligners. First sequence aligner with distributed index generator and efficient search mechanism. Implemented AVLLR-Mapper on MapReduce based distributed computing platform Apache Spark in Java.
Keywords: Java, Apache Spark, Distributed Sequence Alignment
GitHub: https://github.com/sanjaysinghrathi/AVLR-Mapper

R-Apriori

Proposed a new distributed association rule mining algorithm R-Apriori to find frequent patterns and used it for making business strategies. Implemented R-Apriori on MapReduce based distributed computing platform Apache Spark in Scala. R-Apriori uses reduced approach to reduce computation during 2nd iteration of Apriori. Reduced approach dramatically reduces the computational complexity by eliminating the candidate generation step and avoiding costly comparisons. Our studies show that our approach outperforms the classical Apriori and state-of-the-art on Spark by many times for different datasets.
Keywords: Scala, Apache Spark, Distributed Association Rule Mining
Article: https://dl.acm.org/doi/10.1145/2809890.2809893


Adaptive-Miner

Proposed an adaptive approach for frequent itemset mining which combines conventional Apriori and reduced Apriori approach to find frequent patterns. A distributed algorithm Adaptive-Miner is implemented on spark using this adaptive approach. Adaptive-Miner outperformed conventional and reduced Apriori for nearly all datasets for all iterations. These association rule mining algorithms are used for Bioinformatics applications like motif discovery, SNP discovery, subgraph mining, Omics detection, and classification.
Keywords: Scala, Apache Spark, Distributed Association Rule Mining
GitHub: https://github.com/sanjaysinghrathi/Adaptive-Miner
Paper: https://link.springer.com/article/10.1186/s40537-018-0112-0


F-Apriori

Basically, an implementation of R-Apriori approach on Apache Flink. It is a distributed association rule mining algorithm to find frequent patterns in Big Data. This work is a natural sequel of our earlier work and targets on implementing, testing and benchmarking Apriori on Apache Flink and compares it with Spark implementation.
Keywords: Scala, Apache Flink, Distributed Association Rule Mining
Paper: https://ieeexplore.ieee.org/document/7732135

Contact

Contact Info

Currently, I am working as a postdoctoral researcher in the Bioinformatics Hub, a collaborative bioinformatics research service, recently launched by the Institute for Radiation Oncology. The Hub’s mission is to provide collaborative support for Bioinformatics projects, also providing ad-hoc training to biologists and clinical fellows. Its expertise range from experimental design, high-throughput Next Generation Sequencing (NGS) analysis, single cell RNA-seq analysis, and microarray analysis, as well as other common bioinformatics analysis techniques.

Address

Warren Crescent, Headington, Oxford, UK

Skype

sanjaysinghrathi@gmail.com

Email

sanjaysinghrathi@gmail.com

Send Message

Your text message sent successfully!

Sorry! Message not sent. Something went wrong!!



Color Panel