We have the Iris dataset saved into a data package. There are two resources: the data and the code list for the ‘Species’ variable:
> library(datapackage)
> dp <- opendatapackage("iris")
> dp
[iris]
Location: <./>
Resources:
[iris] The Iris Dataset
[Species-codelist] Iris Species Codes
Get the data. We have the numeric codes in the ‘Species’ field:
> iris <- dpresource(dp, "iris") |> dpgetdata()
> head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 1
2 4.9 3.0 1.4 0.2 1
3 4.7 3.2 1.3 0.2 1
4 4.6 3.1 1.5 0.2 1
5 5.0 3.6 1.4 0.2 1
6 5.4 3.9 1.7 0.4 1
We can see there is a codelist associated with ‘Species’
> iris$Species
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[38] 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[75] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3
[112] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
[149] 3 3
attr(,"fielddescriptor")
Field Descriptor:
name :"Species"
type :"integer"
codelist:"Species-codelist"
And also get that codelist. The field descriptor info in the column has a link back to the datapackage
> dpcodelist(iris$Species)
code label
1 1 setosa
2 2 versicolor
3 3 virginica
We can use this codelist to convert the field from its numeric values to
factor
:
> dptofactor(iris$Species) |> head()
[1] setosa setosa setosa setosa setosa setosa
Levels: setosa versicolor virginica
We can also specify that all fields with an associated codelist should be converted to factor:
> dpresource(dp, "iris") |> dpgetdata(to_factor = TRUE) |> head()
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
One annoying little thing in R is in some operations remove attributes. For example when subsetting, the link to the datapackage and codelist is lost and the following no longer works:
> tmp <- iris[c(1, 51, 101), ]
> dptofactor(tmp$Species)
Warning: Field does not have an associated code list. Returning original vector.
[1] 1 2 3
In that case we have to get the codelist ourselves when we want to convert to factor:
> codel <- dp |> dpresource("iris") |> dpfield("Species") |> dpcodelist()
> # or
> # codel <- dp |> dpresource("Species-codelist") |> dpgetdata()
> dptofactor(tmp$Species, codelist = codel)
[1] setosa versicolor virginica
Levels: setosa versicolor virginica
And even if we don’t have any facilities for automatically use the codelist (.e.g. just the facilities offered by version 1 of the datapackage spec), we can use existing functionality to see what the codelist is, read the codelist and apply the codelist:
> dp |> dpresource("iris") |> dpfield("Species")
Field Descriptor:
name :"Species"
type :"integer"
codelist:"Species-codelist"
> codel <- dp |> dpresource("Species-codelist") |> dpgetdata()
> tmp$Species <- factor(tmp$Species, levels = codel[[1]], labels = codel[[2]])
> tmp
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
51 7.0 3.2 4.7 1.4 versicolor
101 6.3 3.3 6.0 2.5 virginica