Cosine similarity between a static vector and each vector in a Spark data frame Ever want to calculate the cosine similarity between a static vector in Spark and each vector in a Spark data frame? Probably not, as this is an absurdly niche problem to solve but, if you ever have, here’s how to do it using spark.sql and a UDF.
# imports we'll need import numpy as np from pyspark.
Typically our data science AWS workflows follow this sequence:
Turn on EC2. Copy data from S3 via awscli to local machine file system. Code references local data via /path/to/data/. ??? Profit. However, if the data you need to reference is relatively small or you’re only passing over the data once, you can use s3a:// and stream the data direct from S3 into your code.
Say we have this script as visits_by_day.