profile image

Saihiel Bakshi

Data Engineer, Scientist, Developer

Contact Me

About Me

I am a Data Specialist and Developer that is passionate about Big Data Engineering, Machine Learning and Statistical Modeling. I graduated from the University of Toronto with a degree in Computer Science, Applied Statistics, and Mathematics. I am often found reading research papers or current events.

I consider myself a self-starter and a forever-student, and I am always looking to learn more and further hone my skillset. I enjoy spending my time reading, programming, playing tennis, or solving challenging problems with unorthodox solutions!

I currently work as a Data Engineer at Springboard Data Management in Toronto, Canada.

Latest Projects

A curation of some my projects:


project name

Generating News Headlines Using Autoencoders

Writing a good news headline is an art. It requires a strong command over language to be able to grab the readers attention in a single sentence. In this project I created a model that takes an orginal news headline and generates a related but new one. This project focuses on the application of deep learning to natural language processing.

Find out more

project name

Ensemble Model for Kaggle Competition on Beijing Pollution Data

This data analysis and regression project won me a Kaggle competition on an initially unknown dataset. There were over 200 participating teams in this private competition held for University of Toronto students. I built a weighted ensemble model consisting of a Linear Generalised Additive Model (GAM) and an Extreme Gradient Boosted Tree Regressor Model (XGBM).

Find out more

project name

Analytical Report on Sparse Group LASSO Research Paper

An analytical report on the Sparse Group Lasso method developed by Friedman et al (2013). Also created a presentation to summarize and present my findings and analysis on the research paper. I used R to reproduce the results in the paper and demonstrate the efficacy of the methodology using Monte Carlo simulations.

Find out more

project name

Modelling Songs By Release Dates

Built Logistic Regression and K - Nearest Neighbour models to predict which century a song was released in. I used the "YearPredictionMSD Data Set" from the Million Song Dataset. Built all my models from scratch using only Python and Numpy.

Find out more

Other Projects

Automated Marking System for Grading SQL Assignments Code Proprietary

Built an automated marking system to grade SQL assignments for students at University of Toronto. I also designed a partial marking system that leverages string similarity algorithms to assign partial marks to student submissions. The system is now being used by the University for future database courses. The system has helped increase the efficiency of the Computer Science department by reducing the number of hours Teaching Assistants spending marking assignments by 30%.

Read More

Algorithm for Cleaning Client Data for American Express Code Proprietary

Designed a system for removing duplicate client data from massive overlapping databases. The algorithm was designed to deal with cases where people's names were spelled differently, addresses were incomplete or records contained a lot of null values. I built the system using Java and Hive (Hadoop) using fuzzy logic algorithms. My system saved the department a $10,000 monthly subscription cost, while providing all the same services more accessibly.

Automated Foreign Currency Exchange Trading System Code Proprietary

Created a system that trades the Forex market for my investors using strategies developed and designed by me. The underlying model leverages Machine Learning techniques and performs time-series analysis on incoming live tick data. I scraped raw time series data for every available tick from the last 20 years using MetaQuotes Language. Applied a plethora of cleaning techniques, feature transformations and created a backtesting application with Python. The system uses multiple Recurrent Neural Networks with LSTM cells and is hosted on AWS.

Past Work Experience

Teaching Assistant, Databases - University of Toronto (Sept. 2019 - May 2020)

  • Lead tutorials for 150+ students on SQL, database management systems and relational data modeling.
  • Uncovered a new opportunity that could be automated and created a strategic system using Python to solve this that increased the time-efficiency of the department by almost 30%.
  • Invigilated and graded examinations and assignments based on Relational Databases concepts and SQL scripting/querying.

Data Scientist, Intern - Prolifics (May 2019 - Sept. 2019)

  • Explored various data sources and uncovered business opportunity in client’s chat-bot requests by using NLP, and deployed a sentiment analysis model that increased customer engagement by 4%.
  • Spearheaded the development of multiple machine learning models and experiments for client propensity modelling, customer behaviour analysis and target market identification.
  • Techlead for ”Data Science in a Box” project, developed using AWS S3, Stitch, Snowflake, IBM Watson Studio, DataRobot and sklearn.
  • Presented actionable insights derived from tested hypotheses on diverse datasets to project managers and cross-functional teams.
  • Engaged with stakeholders throughout the organization to identify opportunities for leveraging machine learning and AI algorithms to drive business decisions.
  • Created a comparative analysis of automated machine learning solutions, such as DataRobot, IBM AutoAI and Auto-scikit learn.

Big Data Engineer, Intern - American Express (May 2017 - September 2017)

  • Leveraged data insights derived from statistical tests for optimizing direct marketing campaign that lead to an increased conversion rate of preapproved clients by 1.7%.
  • Championed the development of big data algorithms centred around deduplication and entity resolution using Hadoop and Java.
  • Researched and developed big data algorithms centered around deduplication and entity resolution in Hadoop.
  • Liaised with cross-functional teams across the enterprise to derive data-driven multi-cluster parallel solutions for business use cases deployed in Hadoop.
  • Developed and monitored end-to-end pipeline for data extraction, transformation and loading (ETL) jobs that sourced data from Hadoop.