Graham Thomson

Resume - 2019-06-17 21:13:38

ABOUT Currently a data scientist at Spotify. SKILLS Python, SQL, Spark, Tensorflow/Keras, Scala, Hadoop/Hive, PHP, APIs, git, HTML5, AWS, GCP, Windows/OSX/Linux TCP/IP, HTTP/S, SSL, SSH, FTP, Microsoft Excel/Office EXPERIENCE Spotify, Boston — Senior Data Scientist June 2020 - PRESENT Product data scientist at Spotify. Publicis Groupe, Boston — Lead Data Scientist February 2019 - June 2020 Machine learning and data science development for Publicis Spine.

Cosine Similarity Spark - 2019-06-18 08:57:30

Cosine similarity between a static vector and each vector in a Spark data frame Ever want to calculate the cosine similarity between a static vector in Spark and each vector in a Spark data frame? Probably not, as this is an absurdly niche problem to solve but, if you ever have, here’s how to do it using spark.sql and a UDF. # imports we'll need import numpy as np from pyspark.

Spark + s3a:// = ❤️ - 2019-06-18 09:31:24

Typically our data science AWS workflows follow this sequence: Turn on EC2. Copy data from S3 via awscli to local machine file system. Code references local data via /path/to/data/. ??? Profit. However, if the data you need to reference is relatively small or you’re only passing over the data once, you can use s3a:// and stream the data direct from S3 into your code. Say we have this script as visits_by_day.

List of Lists - 2019-06-18 21:17:24

There are a seemingly infinite number of ways to flatten a list of lists in base Python. By flattening, I mean reducing the dimension of a list of lists. In numpy it would be something like: import numpy as np np.arange(1, 13).reshape(4,3) # array([[ 1, 2, 3], # [ 4, 5, 6], # [ 7, 8, 9], # [10, 11, 12]]) # becomes np.arange(1, 13).reshape(4,3).reshape(-1) # array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]) But what is the fastest and cleanest way in base Python?

Using Scala UDFs in PySpark - 2019-10-14 14:18:21

It is often required to write UDFs in Python to extend Sparks native functionality, especially when it comes to Spark’s Vector objects (the required data structure for feeding data to Spark’s Machine Learning library). However, because of the serialization that must take place passing Python objects to the JVM and back, Python UDFs in Spark are inherently slow. To minimize the compute time when using UDFs it often much faster to write the UDF in Scala and call it from Python.

Plotting COVID-19 Cases in the US - 2020-04-22 20:07:54

Data supplied by NYTimes Github. Beautiful (and easy) plots with Plotly Express. Maps Powered by MapBox (OpenStreetMaps) Full Python Code on Github Map View Note this was limited to the 50 states, DC, and PR. K-Means Clustering on COVID-19 Cases Thought it might be interesting to run K-Means on all the cases in the US. To do so, I exploded the county dataset from its aggregated form to each row representing a single case of COVID-19.

Massachusetts COVID Vaccination Rates - 2021-05-06 21:10:56

Data supplied by mass.gov. Full Python Code on Github MA Choropleth Map The following is a choropleth map (made with Folium) that shows the percent of each MA zip code with at least one dose of the available COVID vaccines. Mouse over each zip code to see additional information. Updated 7/15/21.