seandavi · March 15, 2019 15:33
diff --git a/omicidx_intro.Rmd b/omicidx_intro.Rmd
 ---
 title: "Playing with OmicIDX"
 author: "Sean Davis"
 date: "3/14/2019"
 output: 
  BiocStyle::html_document:
    toc_float: True
 ---

 # Introduction to the OmicIDX API

 OmicIDX parses and then serves public genomics repository metadata. These metadata
 are growing quickly, updated often, and are now very large when taken as a whole.
 Because of the interrelated nature of the metadata and the myriad approaches and use cases that 
 exist, including search, bulk download, and even data mining, we serve the data via
 a GraphQL endpoint.

 Currently, OmicIDX contains the SRA and Biosample metadata sets. These overlap with 
 each other, but SRA metadata contains deeper metadata than Biosample data on the same
 samples. However, Biosample contains many more samples and currently includes metadata 
 about a subset of NCBI GEO, all SRA samples, and some additional samples from projects 
 like Genbank.

 # GraphQL for accessing OmicIDX data

 [GraphQL] is a query language for APIs and a runtime for fulfilling those queries with existing data. 
 GraphQL provides a complete and understandable description of the data in the API, gives clients 
 the power to ask for exactly what they need and nothing more, makes it easier to evolve APIs over time, 
 and enables powerful developer tools.

 GraphQL has only a *single url*, called the endpoint, which allows access to all data in the 
 API. GraphQL is also a *query language*. It is the GraphQL query that is submitted to the GraphQL
 endpoint that results in data being returned. 


 [GraphQL]: https://graphql.org/

 ## What is a GraphQL query?

 A GraphQL query looks a bit like JSON, except without quotes or commas. Here is an example
 GraphQL query for a fictitious GraphQL API.

 ```
 { 
  allCharacters {
    name
  }
 }
 ```

 If we had a server that contained Star Wars trivia, the response from the server might look like:

 ```
 { "data": {
    "allCharacters": [
      { 
        "name":"Luke"
      },
      { 
        "name": "Darth"
      },
      ...
    ]
  }
 }
 ```

 If we changed the query to:

 ```
 { 
  allCharacters {
    name
    mass
  }
 }
 ```

 the response would now look like:

 ```
 { "data": {
    "allCharacters": [
      { 
        "name":"Luke",
        "mass": 80
      },
      { 
        "name": "Darth",
        "mass": 140
      },
      ...
    ]
  }
 }
 ```

 ## How do I know what is in the GraphQL endpoint?

 The GraphQL **schema** describes the data model(s) contained in the GraphQL endpoint. GraphQL is 
 strongly typed, has the concept of relationships between data types, and is self-documenting. 
 In fact, one can use the GraphQL endpoint to discover what is in the endpoint. I will not go 
 into the details right now, but this *introspection* capability makes possible some powerful
 tooling. One of the most ubiguitous is the so-called **Graph*i*QL** (note the *i* in the name) 
 tool.

 ## Exercise 1

 Navigate to [Graph*i*QL] and follow along with the video (no sound).[^urlchange]

 [^urlchange]: The urls in this document *are subject to change*. 

 <iframe width="560" height="315" src="https://www.youtube.com/embed/1Zg_Fbt56kc" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

 [Graph*i*QL]: http://graphql-omicidx.cancerdatasci.org/graphiql

 # Querying OmicIDX programmatically

 GraphQL is quite easy to work with programmatically. All queries are made via a post request to the GraphQL endpoint. The POST request needs to be JSON encoded and must include the `"query"` key. A simple example of 
 the JSON for a basic query might look like:

 ```
 { "query": "{ heros { name weight } }"}
 ```

 Note that the query string (ie., `"{ heros { name weight } }"`) is just a string. It is not formated as JSON itself. 

 - Current OmicIDX GraphQL endpoint[^urlchange]: http://graphql-omicidx.cancerdatasci.org/graphql

 For example, let us get the first 500 SRA studies (500 is a limit to the number of results that we can retrieve in one go. 
 The GraphQL might look like this:

 ```
 {
  allSraStudies {
    edges {
      node {
        accession
        title
        abstract
      }
    }
  }
 }
 ```

 From exercise 1, you know that you can copy this query into the Graph*i*QL browser and get results. How about using [curl](https://curl.haxx.se/)? Note that I am doing some gymnastics below to make this work in one 
 command line. In practice, one would probably save the GraphQL query as a file (in json format) and then
 post that file[^curlnote]. 

 [^curlnote]: *For examples of how to post JSON using curl, see [this stackoverflow post](https://stackoverflow.com/questions/7172784/how-to-post-json-data-with-curl-from-terminal-commandline-to-test-spring-rest?rq=1)

 ```{bash}
 curl --silent \
  -X POST \
  -H "Content-Type: application/json" \
  http://graphql-omicidx.cancerdatasci.org/graphql \
  --data @- << EOF
  
  { "query":"{
  allSraStudies(first: 1) {
    edges {
      node {
        accession
        title
        abstract
      }
    }
  }
 }"
 }
 EOF
 ```

 Note that the return value is simply JSON. All our normal tools for working with JSON are available to us
 to process and manipulate the results. 

 # Using OmicIDX from R

 As a review, we know that:

 - GraphQL queries look like JSON, but are not quite the same.
 - We can use the [Graph*i*QL] graphical tool to help us write our query, including autocompleting fields, etc.
 - GraphQL queries are `POST`ed to the graphql endpoint and results are returned as JSON.

 In order to query from R, then, we need to:

 1. Define our query.
 2. `POST` the query, encoded as a simple JSON data structure.
 3. Deal with the resulting JSON data that comes back as a result of the post in #2.

 ## Defining the query

 Again, the easiest way to define the query is to experiment witht he [Graph*i*QL] query tool. Once a query works as expected, the query can be reused from R.

 In this case, we are going to continue with the query we worked with above:

 ```
 {
  allSraStudies(first: 1) {
    edges {
      node {
        accession
        title
        abstract
      }
    }
  }
 }
 ```


 ## Performing the query

 The `CRANpkg('httr')` package performs http requests, including `POST` requests.

 With our query defined, we need to prepare the POST body which will be sent to the server as JSON. In R, 
 basicc lists are the information equivalent of JSON objects.

 ```{r}
 post_body = list(query = "
 {
  allSraStudies(first: 20) {
    edges {
      node {
        accession
        title
        sraExperimentsByStudyAccession {
          totalCount
        }
      }
    }
  }
 }
 ")
 ```

 The URL for the OmicIDX endpoint is: `http://graphql-omicidx.cancerdatasci.org/graphql`.[^urlchange]

 ```{r}
 endpoint = "http://graphql-omicidx.cancerdatasci.org/graphql"
 ```

 I am going to lead the reader a bit here and jump to a "handler" function that will
 convert the returned JSON into convenient R data structures. 

 ```{r}
 handler = function(response) {
  jsonlite::fromJSON(httr::content(response, as='text'),flatten = TRUE)
 }
 ```

 Finally, we are ready to perform our query. 

 ```{r}
 resp = httr::POST(endpoint, body = post_body, encode = 'json')
 resp
 ```

 And finally, convert the response to convenient R data structure.

 ```{r}
 result = handler(resp)
 knitr::kable(result$data[[1]]$edges)
 ```

 We can change our query a bit to get some more details.

 ```{r}
 study_experiment_counts_query = "
 {
  allSraStudies(first: 10) {
    edges {
      node {
        accession
        title
        sraExperimentsByStudyAccession {
          totalCount
        }
      }
    }
  }
 }
 "
 ```

 And streamline our code a bit as a function.

 ```{r}
 # Borrow "handler" function from above
 # Borrow "endpoint" URL from above
 graphql_query = function(gql) {
  .handler = function(response) {
    jsonlite::fromJSON(httr::content(response, as='text'),flatten = TRUE)$data[[1]]$edges
  }
  post_body = list(query = gql)
  resp = httr::POST(endpoint, body = post_body, encode = 'json')
  return(.handler(resp))
 }
 ```

 ```{r}
 knitr::kable(graphql_query(study_experiment_counts_query))
 ```

 One might be interested in filtering the results, also. For example, to search study titles for "cancer" 
 (case insensitive) we can add a filter to the query.

 ```{r}
 cancer_studies_query = '
 {
  allSraStudies(first: 10 filter: {title: {includesInsensitive: "cancer"}}) {
    edges {
      node {
        accession
        title
      }
    }
  }
 }
 '
 knitr::kable(graphql_query(cancer_studies_query))
 ```

 To check the total count of available studies with "cancer" in the title, we can again write a query for that. Note that this query 

 ```{r}
 cancer_studies_query = '
 {
  allSraStudies(first: 10 filter: {title: {includesInsensitive: "cancer"}}) {
    totalCount
  }
 }
 '
 ```

 A keen eye will notice that there is not an `edges` component in this query. Our `handler` above returns
 the `edges` component, though. So, we are going to drop back to basic httr code here, again. 

 ```{r}
 httr::content(
  httr::POST(endpoint, body = list(query=cancer_studies_query), encode='json'),
  as="parsed"
 )
 ```

 ## Related objects

 One of the unique aspects of GraphQL is the ability to treat data as a graph, with data entities related
 to each other logically linked. For example, we can fetch SRA experiments and link them each to the study
 under which they were performed.

 ```{r}
 experiments_with_study = '
 {
  allSraExperiments(first: 5) {
    edges{
      node{
        accession
        title
        libraryStrategy
        sraStudyByStudyAccession{
          title
          accession
        }
      }
    }
  }
 }
 '
 ```

 The graphql query from R still works the same way, so we execute as before.

 ```{r}
 result = graphql_query(experiments_with_study)
 knitr::kable(result)
 ```

 ## Fetching all results

 **TODO**

 - Describe cursors

 ## Using variables

 **TODO**

 - Describe adding variables to POST body


 # Conclusion

 **TODO**
	---
	title: "Playing with OmicIDX"
	author: "Sean Davis"
	date: "3/14/2019"
	output:
	BiocStyle::html_document:
	toc_float: True
	---

	# Introduction to the OmicIDX API

	OmicIDX parses and then serves public genomics repository metadata. These metadata
	are growing quickly, updated often, and are now very large when taken as a whole.
	Because of the interrelated nature of the metadata and the myriad approaches and use cases that
	exist, including search, bulk download, and even data mining, we serve the data via
	a GraphQL endpoint.

	Currently, OmicIDX contains the SRA and Biosample metadata sets. These overlap with
	each other, but SRA metadata contains deeper metadata than Biosample data on the same
	samples. However, Biosample contains many more samples and currently includes metadata
	about a subset of NCBI GEO, all SRA samples, and some additional samples from projects
	like Genbank.

	# GraphQL for accessing OmicIDX data

	[GraphQL] is a query language for APIs and a runtime for fulfilling those queries with existing data.
	GraphQL provides a complete and understandable description of the data in the API, gives clients
	the power to ask for exactly what they need and nothing more, makes it easier to evolve APIs over time,
	and enables powerful developer tools.

	GraphQL has only a single url, called the endpoint, which allows access to all data in the
	API. GraphQL is also a query language. It is the GraphQL query that is submitted to the GraphQL
	endpoint that results in data being returned.


	[GraphQL]: https://graphql.org/

	## What is a GraphQL query?

	A GraphQL query looks a bit like JSON, except without quotes or commas. Here is an example
	GraphQL query for a fictitious GraphQL API.

	```
	{
	allCharacters {
	name
	}
	}
	```

	If we had a server that contained Star Wars trivia, the response from the server might look like:

	```
	{ "data": {
	"allCharacters": [
	{
	"name":"Luke"
	},
	{
	"name": "Darth"
	},
	...
	]
	}
	}
	```

	If we changed the query to:

	```
	{
	allCharacters {
	name
	mass
	}
	}
	```

	the response would now look like:

	```
	{ "data": {
	"allCharacters": [
	{
	"name":"Luke",
	"mass": 80
	},
	{
	"name": "Darth",
	"mass": 140
	},
	...
	]
	}
	}
	```

	## How do I know what is in the GraphQL endpoint?

	The GraphQL schema describes the data model(s) contained in the GraphQL endpoint. GraphQL is
	strongly typed, has the concept of relationships between data types, and is self-documenting.
	In fact, one can use the GraphQL endpoint to discover what is in the endpoint. I will not go
	into the details right now, but this introspection capability makes possible some powerful
	tooling. One of the most ubiguitous is the so-called *GraphiQL* (note the i in the name)
	tool.

	## Exercise 1

	Navigate to [GraphiQL] and follow along with the video (no sound).[^urlchange]

	[^urlchange]: The urls in this document are subject to change.

	<iframe width="560" height="315" src="https://www.youtube.com/embed/1Zg_Fbt56kc" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

	[GraphiQL]: http://graphql-omicidx.cancerdatasci.org/graphiql

	# Querying OmicIDX programmatically

	GraphQL is quite easy to work with programmatically. All queries are made via a post request to the GraphQL endpoint. The POST request needs to be JSON encoded and must include the `"query"` key. A simple example of
	the JSON for a basic query might look like:

	```
	{ "query": "{ heros { name weight } }"}
	```

	Note that the query string (ie., `"{ heros { name weight } }"`) is just a string. It is not formated as JSON itself.

	- Current OmicIDX GraphQL endpoint[^urlchange]: http://graphql-omicidx.cancerdatasci.org/graphql

	For example, let us get the first 500 SRA studies (500 is a limit to the number of results that we can retrieve in one go.
	The GraphQL might look like this:

	```
	{
	allSraStudies {
	edges {
	node {
	accession
	title
	abstract
	}
	}
	}
	}
	```

	From exercise 1, you know that you can copy this query into the GraphiQL browser and get results. How about using [curl](https://curl.haxx.se/)? Note that I am doing some gymnastics below to make this work in one
	command line. In practice, one would probably save the GraphQL query as a file (in json format) and then
	post that file[^curlnote].

	[^curlnote]: *For examples of how to post JSON using curl, see [this stackoverflow post](https://stackoverflow.com/questions/7172784/how-to-post-json-data-with-curl-from-terminal-commandline-to-test-spring-rest?rq=1)

	```{bash}
	curl --silent \
	-X POST \
	-H "Content-Type: application/json" \
	http://graphql-omicidx.cancerdatasci.org/graphql \
	--data @- << EOF

	{ "query":"{
	allSraStudies(first: 1) {
	edges {
	node {
	accession
	title
	abstract
	}
	}
	}
	}"
	}
	EOF
	```

	Note that the return value is simply JSON. All our normal tools for working with JSON are available to us
	to process and manipulate the results.

	# Using OmicIDX from R

	As a review, we know that:

	- GraphQL queries look like JSON, but are not quite the same.
	- We can use the [GraphiQL] graphical tool to help us write our query, including autocompleting fields, etc.
	- GraphQL queries are `POST`ed to the graphql endpoint and results are returned as JSON.

	In order to query from R, then, we need to:

	1. Define our query.
	2. `POST` the query, encoded as a simple JSON data structure.
	3. Deal with the resulting JSON data that comes back as a result of the post in #2.

	## Defining the query

	Again, the easiest way to define the query is to experiment witht he [GraphiQL] query tool. Once a query works as expected, the query can be reused from R.

	In this case, we are going to continue with the query we worked with above:

	```
	{
	allSraStudies(first: 1) {
	edges {
	node {
	accession
	title
	abstract
	}
	}
	}
	}
	```


	## Performing the query

	The `CRANpkg('httr')` package performs http requests, including `POST` requests.

	With our query defined, we need to prepare the POST body which will be sent to the server as JSON. In R,
	basicc lists are the information equivalent of JSON objects.

	```{r}
	post_body = list(query = "
	{
	allSraStudies(first: 20) {
	edges {
	node {
	accession
	title
	sraExperimentsByStudyAccession {
	totalCount
	}
	}
	}
	}
	}
	")
	```

	The URL for the OmicIDX endpoint is: `http://graphql-omicidx.cancerdatasci.org/graphql`.[^urlchange]

	```{r}
	endpoint = "http://graphql-omicidx.cancerdatasci.org/graphql"
	```

	I am going to lead the reader a bit here and jump to a "handler" function that will
	convert the returned JSON into convenient R data structures.

	```{r}
	handler = function(response) {
	jsonlite::fromJSON(httr::content(response, as='text'),flatten = TRUE)
	}
	```

	Finally, we are ready to perform our query.

	```{r}
	resp = httr::POST(endpoint, body = post_body, encode = 'json')
	resp
	```

	And finally, convert the response to convenient R data structure.

	```{r}
	result = handler(resp)
	knitr::kable(result$data[[1]]$edges)
	```

	We can change our query a bit to get some more details.

	```{r}
	study_experiment_counts_query = "
	{
	allSraStudies(first: 10) {
	edges {
	node {
	accession
	title
	sraExperimentsByStudyAccession {
	totalCount
	}
	}
	}
	}
	}
	"
	```

	And streamline our code a bit as a function.

	```{r}
	# Borrow "handler" function from above
	# Borrow "endpoint" URL from above
	graphql_query = function(gql) {
	.handler = function(response) {
	jsonlite::fromJSON(httr::content(response, as='text'),flatten = TRUE)$data[[1]]$edges
	}
	post_body = list(query = gql)
	resp = httr::POST(endpoint, body = post_body, encode = 'json')
	return(.handler(resp))
	}
	```

	```{r}
	knitr::kable(graphql_query(study_experiment_counts_query))
	```

	One might be interested in filtering the results, also. For example, to search study titles for "cancer"
	(case insensitive) we can add a filter to the query.

	```{r}
	cancer_studies_query = '
	{
	allSraStudies(first: 10 filter: {title: {includesInsensitive: "cancer"}}) {
	edges {
	node {
	accession
	title
	}
	}
	}
	}
	'
	knitr::kable(graphql_query(cancer_studies_query))
	```

	To check the total count of available studies with "cancer" in the title, we can again write a query for that. Note that this query

	```{r}
	cancer_studies_query = '
	{
	allSraStudies(first: 10 filter: {title: {includesInsensitive: "cancer"}}) {
	totalCount
	}
	}
	'
	```

	A keen eye will notice that there is not an `edges` component in this query. Our `handler` above returns
	the `edges` component, though. So, we are going to drop back to basic httr code here, again.

	```{r}
	httr::content(
	httr::POST(endpoint, body = list(query=cancer_studies_query), encode='json'),
	as="parsed"
	)
	```

	## Related objects

	One of the unique aspects of GraphQL is the ability to treat data as a graph, with data entities related
	to each other logically linked. For example, we can fetch SRA experiments and link them each to the study
	under which they were performed.

	```{r}
	experiments_with_study = '
	{
	allSraExperiments(first: 5) {
	edges{
	node{
	accession
	title
	libraryStrategy
	sraStudyByStudyAccession{
	title
	accession
	}
	}
	}
	}
	}
	'
	```

	The graphql query from R still works the same way, so we execute as before.

	```{r}
	result = graphql_query(experiments_with_study)
	knitr::kable(result)
	```

	## Fetching all results

	TODO

	- Describe cursors

	## Using variables

	TODO

	- Describe adding variables to POST body


	# Conclusion

	TODO
No results found