Created
March 24, 2020 14:30
-
-
Save mkaranasou/766468b68799bd0fffddabb7fddc34cb to your computer and use it in GitHub Desktop.
VectorAssembler example - dense and sparse output
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| from pyspark import SparkConf | |
| from pyspark.sql import SparkSession, functions as F | |
| from pyspark.ml.feature import VectorAssembler, StandardScaler | |
| from pyspark_iforest.ml.iforest import IForest, IForestModel | |
| import tempfile | |
| conf = SparkConf() | |
| conf.set('spark.jars', '/full/path/to/spark-iforest-2.4.0.jar') | |
| spark = SparkSession \ | |
| .builder \ | |
| .config(conf=conf) \ | |
| .appName("IForestExample") \ | |
| .getOrCreate() | |
| temp_path = tempfile.mkdtemp() | |
| iforest_path = temp_path + "/iforest" | |
| model_path = temp_path + "/iforest_model" | |
| data = [ | |
| {'feature1': 1., 'feature2': 0., 'feature3': 0.3, 'feature4': 0.01}, | |
| {'feature1': 10., 'feature2': 3., 'feature3': 0.9, 'feature4': 0.1}, | |
| {'feature1': 101., 'feature2': 13., 'feature3': 0.9, 'feature4': 0.91}, | |
| {'feature1': 111., 'feature2': 11., 'feature3': 1.2, 'feature4': 1.91}, | |
| {'feature1': 0., 'feature2': 0., 'feature3': 0., 'feature4': 0.1}, # issue happens when I add this line | |
| ] | |
| # use a VectorAssembler to gather the features as Vectors (dense) | |
| assembler = VectorAssembler( | |
| inputCols=list(data[0].keys()), | |
| outputCol="features" | |
| ) | |
| df = spark.createDataFrame(data) | |
| df.printSchema() | |
| df = assembler.transform(df) | |
| df.show() | |
| # last line, features column: a sparse vector | |
| # +--------+--------+--------+--------+--------------------+ | |
| # |feature1|feature2|feature3|feature4| features| | |
| # +--------+--------+--------+--------+--------------------+ | |
| # | 1.0| 0.0| 0.3| 0.01| [1.0,0.0,0.3,0.01]| | |
| # | 10.0| 3.0| 0.9| 0.1| [10.0,3.0,0.9,0.1]| | |
| # | 101.0| 13.0| 0.9| 0.91|[101.0,13.0,0.9,0...| | |
| # | 111.0| 11.0| 1.2| 1.91|[111.0,11.0,1.2,1...| | |
| # | 0.0| 0.0| 0.0| 0.1| (4,[3],[0.1])| | |
| # +--------+--------+--------+--------+--------------------+ |
Author
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
The format and length of the feature vectors determines if they are sparse or dense. If the vector length is the same as the number of the features, it is dense. If not, it is sparse. This example is to showcase that the VectorAssembler does not handle all vectors the same, so, the last line here is a lot shorter than the previous ones. That's because if you take a look at the features here, all but one has a value equal to
0., so the VectorAssembler chooses to represent it as sparse - it makes more sense this way.But the current Isolation Forest implementation does not handle sparse vectors.
If you are referring to this comment it is outdated. I will update it.
This gist is a part of this article: https://towardsdatascience.com/isolation-forest-and-pyspark-part-2-76f7cd9cee56 to showcase that Isolation Forest needs dense vectors only, so it is not possible to use VectorAssembler for this.