-
-
Save mkaranasou/766468b68799bd0fffddabb7fddc34cb to your computer and use it in GitHub Desktop.
| from pyspark import SparkConf | |
| from pyspark.sql import SparkSession, functions as F | |
| from pyspark.ml.feature import VectorAssembler, StandardScaler | |
| from pyspark_iforest.ml.iforest import IForest, IForestModel | |
| import tempfile | |
| conf = SparkConf() | |
| conf.set('spark.jars', '/full/path/to/spark-iforest-2.4.0.jar') | |
| spark = SparkSession \ | |
| .builder \ | |
| .config(conf=conf) \ | |
| .appName("IForestExample") \ | |
| .getOrCreate() | |
| temp_path = tempfile.mkdtemp() | |
| iforest_path = temp_path + "/iforest" | |
| model_path = temp_path + "/iforest_model" | |
| data = [ | |
| {'feature1': 1., 'feature2': 0., 'feature3': 0.3, 'feature4': 0.01}, | |
| {'feature1': 10., 'feature2': 3., 'feature3': 0.9, 'feature4': 0.1}, | |
| {'feature1': 101., 'feature2': 13., 'feature3': 0.9, 'feature4': 0.91}, | |
| {'feature1': 111., 'feature2': 11., 'feature3': 1.2, 'feature4': 1.91}, | |
| {'feature1': 0., 'feature2': 0., 'feature3': 0., 'feature4': 0.1}, # issue happens when I add this line | |
| ] | |
| # use a VectorAssembler to gather the features as Vectors (dense) | |
| assembler = VectorAssembler( | |
| inputCols=list(data[0].keys()), | |
| outputCol="features" | |
| ) | |
| df = spark.createDataFrame(data) | |
| df.printSchema() | |
| df = assembler.transform(df) | |
| df.show() | |
| # last line, features column: a sparse vector | |
| # +--------+--------+--------+--------+--------------------+ | |
| # |feature1|feature2|feature3|feature4| features| | |
| # +--------+--------+--------+--------+--------------------+ | |
| # | 1.0| 0.0| 0.3| 0.01| [1.0,0.0,0.3,0.01]| | |
| # | 10.0| 3.0| 0.9| 0.1| [10.0,3.0,0.9,0.1]| | |
| # | 101.0| 13.0| 0.9| 0.91|[101.0,13.0,0.9,0...| | |
| # | 111.0| 11.0| 1.2| 1.91|[111.0,11.0,1.2,1...| | |
| # | 0.0| 0.0| 0.0| 0.1| (4,[3],[0.1])| | |
| # +--------+--------+--------+--------+--------------------+ |
The format and length of the feature vectors determines if they are sparse or dense. If the vector length is the same as the number of the features, it is dense. If not, it is sparse. This example is to showcase that the VectorAssembler does not handle all vectors the same, so, the last line here is a lot shorter than the previous ones. That's because if you take a look at the features here, all but one has a value equal to 0., so the VectorAssembler chooses to represent it as sparse - it makes more sense this way.
But the current Isolation Forest implementation does not handle sparse vectors.
If you are referring to this comment it is outdated. I will update it.
This gist is a part of this article: https://towardsdatascience.com/isolation-forest-and-pyspark-part-2-76f7cd9cee56 to showcase that Isolation Forest needs dense vectors only, so it is not possible to use VectorAssembler for this.
how do you determine the ones tranformed are dense vectors?