Last active
July 20, 2020 05:17
-
-
Save sainathadapa/72a2412f512dde220307b1f907dc62f6 to your computer and use it in GitHub Desktop.
Search and Download datasets from data.gov.in
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
--- | |
title: "Search and Download datasets from data.gov.in" | |
author: "Sainath Adapa" | |
date: "25 March 2017" | |
output: | |
ioslides_presentation: | |
highlight: pygments | |
widescreen: yes | |
smaller: true | |
df_print: paged | |
--- | |
```{r setup, include=FALSE} | |
library(knitr) | |
knitr::opts_chunk$set(message = FALSE, cache = TRUE) | |
``` | |
## What is data.gov.in | |
- Open Government Data (OGD) Platform India - data.gov.in - is a platform for supporting Open Data initiative of Government of India. | |
- The portal is intended to be used by Government of India Ministries/ Departments their organizations to publish datasets, documents, services, tools and applications collected by them for public use. | |
- It intends to increase transparency in the functioning of Government and also open avenues for many more innovative uses of Government Data to give different perspective. | |
## Searching and Downloading a dataset from data.gov.in | |
1. Enter keywords and search | |
2. Click on the relevant search result. This will take you to the catalog containing that dataset. | |
3. Go through pages in the catalog, to find the right dataset | |
4. Download the dataset | |
## Step 1 - Enter keywords and search | |
![](pic-1.png) | |
## Step 2 - Click on the relevant search result | |
![](pic-2.png) | |
## Step 3 - Find the dataset in the catalog | |
![](pic-3.png) | |
## Step 3 - Find the dataset in the catalog | |
**If you are lucky, you will find the dataset that you are looking for, on the first page** | |
![](pic-4.png) | |
## ogdindiar package | |
- OGD (Open Government Data) INDIA R | |
- Available at https://github.com/steadyfish/ogdindiar | |
- Not on CRAN yet. Install using the command `devtools::install_github("steadyfish/ogdindiar")` | |
- Provides functions to search and download datasets from data.gov.in | |
- Since there is no API available for searching datasets, these functions use `rvest` to do web scraping | |
- Also, ogdindiar provides a function (`fetch_data`) to download a dataset using the API. Note that not all datasets have API access. | |
- Refer to the vignettes for more information | |
1. [API access](https://github.com/steadyfish/ogdindiar/blob/master/vignettes/basic-usage-vignette.md) | |
2. [Search functionality](https://github.com/steadyfish/ogdindiar/blob/master/vignettes/search-functionality.md) | |
## Search for the right catalog | |
```{r} | |
library(dplyr) | |
library(ogdindiar) | |
catalogs_df <- search_for_datasets(search_terms = 'age sex population', | |
limit_catalog_pages = Inf, | |
limit_catalogs = 10, | |
return_catalog_list = TRUE) | |
catalogs_df %>% glimpse | |
``` | |
## Search for the right catalog | |
```{r} | |
catalogs_df$name | |
``` | |
## Get the list of datasets from the catalog | |
```{r} | |
datasets_df <- get_datasets_from_a_catalog(catalog_link = catalogs_df$link[1], | |
limit_dataset_pages = Inf, | |
limit_datasets = 5) | |
datasets_df %>% glimpse | |
``` | |
## Get the list of datasets from the catalog | |
```{r} | |
datasets_df$name | |
``` | |
## Download a dataset | |
```{r} | |
datasets_df %>% slice(6) %>% glimpse | |
``` | |
```{r} | |
download_dataset(urllink = datasets_df$excel[6], filepath = 'delhi.xls') | |
``` | |
## Reading the dataset | |
```{r} | |
delhi_data_df <- readxl::read_excel('delhi.xls') | |
delhi_data_df %>% glimpse | |
``` | |
## Exploring the dataset | |
```{r} | |
library(ggplot2) | |
gg <- delhi_data_df %>% | |
filter(`Age-Group` %in% 'All ages', | |
`Area Name` %in% 'UNION TERRITORY - DELHI 07') %>% | |
select(`Birth Place`, `Place of Enumeration - Total Persons`) %>% | |
setNames(c('var', 'val')) %>% | |
slice(8:42) %>% | |
arrange(desc(val)) %>% | |
head(n = 15) %>% | |
mutate(var = factor(var, levels = rev(var))) %>% | |
ggplot() + | |
geom_bar(aes(x = var, y = val), stat = 'identity') + | |
coord_flip(expand = FALSE) + | |
theme_bw() + | |
xlab('Population') + | |
ylab('State') + | |
scale_y_continuous(labels = scales::comma) + | |
ggtitle('Delhi Residents - Top states by Place of Birth', subtitle = 'Excluding Delhi') | |
``` | |
## Exploring the dataset | |
```{r echo=FALSE} | |
gg | |
``` | |
## Go play! | |
Download data for other states from this catalog, and explore inter-state migration patterns. | |
Take a look at the code in https://github.com/sainathadapa/population-pyramid-states-india, in case you are stuck. | |
## Thanks! | |
- If you are facing problems with the functions, or have ideas for contribution, please create issues on the Github page | |
- You can contact me via twitter @sainathadapa | |
- This presentation was written in RMarkdown. Code for this presentation is available at https://gist.github.com/sainathadapa/72a2412f512dde220307b1f907dc62f6 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment