Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Select an option

  • Save mkaranasou/766468b68799bd0fffddabb7fddc34cb to your computer and use it in GitHub Desktop.

Select an option

Save mkaranasou/766468b68799bd0fffddabb7fddc34cb to your computer and use it in GitHub Desktop.
VectorAssembler example - dense and sparse output
from pyspark import SparkConf
from pyspark.sql import SparkSession, functions as F
from pyspark.ml.feature import VectorAssembler, StandardScaler
from pyspark_iforest.ml.iforest import IForest, IForestModel
import tempfile
conf = SparkConf()
conf.set('spark.jars', '/full/path/to/spark-iforest-2.4.0.jar')
spark = SparkSession \
.builder \
.config(conf=conf) \
.appName("IForestExample") \
.getOrCreate()
temp_path = tempfile.mkdtemp()
iforest_path = temp_path + "/iforest"
model_path = temp_path + "/iforest_model"
data = [
{'feature1': 1., 'feature2': 0., 'feature3': 0.3, 'feature4': 0.01},
{'feature1': 10., 'feature2': 3., 'feature3': 0.9, 'feature4': 0.1},
{'feature1': 101., 'feature2': 13., 'feature3': 0.9, 'feature4': 0.91},
{'feature1': 111., 'feature2': 11., 'feature3': 1.2, 'feature4': 1.91},
{'feature1': 0., 'feature2': 0., 'feature3': 0., 'feature4': 0.1}, # issue happens when I add this line
]
# use a VectorAssembler to gather the features as Vectors (dense)
assembler = VectorAssembler(
inputCols=list(data[0].keys()),
outputCol="features"
)
df = spark.createDataFrame(data)
df.printSchema()
df = assembler.transform(df)
df.show()
# last line, features column: a sparse vector
# +--------+--------+--------+--------+--------------------+
# |feature1|feature2|feature3|feature4| features|
# +--------+--------+--------+--------+--------------------+
# | 1.0| 0.0| 0.3| 0.01| [1.0,0.0,0.3,0.01]|
# | 10.0| 3.0| 0.9| 0.1| [10.0,3.0,0.9,0.1]|
# | 101.0| 13.0| 0.9| 0.91|[101.0,13.0,0.9,0...|
# | 111.0| 11.0| 1.2| 1.91|[111.0,11.0,1.2,1...|
# | 0.0| 0.0| 0.0| 0.1| (4,[3],[0.1])|
# +--------+--------+--------+--------+--------------------+
@kai-zh-666
Copy link
Copy Markdown

how do you determine the ones tranformed are dense vectors?

@mkaranasou
Copy link
Copy Markdown
Author

mkaranasou commented Jun 8, 2021

The format and length of the feature vectors determines if they are sparse or dense. If the vector length is the same as the number of the features, it is dense. If not, it is sparse. This example is to showcase that the VectorAssembler does not handle all vectors the same, so, the last line here is a lot shorter than the previous ones. That's because if you take a look at the features here, all but one has a value equal to 0., so the VectorAssembler chooses to represent it as sparse - it makes more sense this way.
But the current Isolation Forest implementation does not handle sparse vectors.

If you are referring to this comment it is outdated. I will update it.
This gist is a part of this article: https://towardsdatascience.com/isolation-forest-and-pyspark-part-2-76f7cd9cee56 to showcase that Isolation Forest needs dense vectors only, so it is not possible to use VectorAssembler for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment