drorata/gist:146ce50807d16fd4a6aa

henrikno · 2017-09-27T15:05:05Z

You should also clear the scroll when done to free memory in Elasticsearch. Otherwise it will keep the memory until the scroll timeout.
E.g. es.clear_scroll(body={'scroll_id': [sid]}, ignore=(404, ))

NinaSalimi · 2017-10-21T02:42:07Z

Very helpful! thanks

alanwds · 2018-01-05T12:24:43Z

+1 work's like a charm. Thank you!

jjjbushjjj · 2018-01-18T11:22:26Z

+100 Thank you!

hardikgw · 2018-02-02T18:25:06Z

in 6.x no need to es.clear_scroll the default is clear_scroll=True

krishthotem · 2018-02-20T21:05:25Z

I have something like this in my python program:

response = requests.get('https://example.com/_search?q=@version:2&scroll=2m')

data = json.loads(response.text) 

sid = data['_scroll_id']
scroll_size = data['hits']['total']

// Until this line it works as expected

while (scroll_size > 0):
 data = {"scroll" : "2m", "scroll_id" :  sid}
 response = requests.post('https://example.com/_search/scroll', data=data)
 data = json.loads(response.text) 
 # do something
 sid = data['_scroll_id']

// But here in requests.post() line, I get error: {'message': 'Not Found', 'code': 404}

Any thoughts on what I am doing wrong and what should be changed? Thanks!

evrycollin · 2018-04-04T06:33:25Z

Working with python generator

works with ES >= 5

1. utility generator method

def scroll(index, doc_type, query_body, page_size=100, debug=False, scroll='2m'):
    page = es.search(index=index, doc_type=doc_type, scroll=scroll, size=page_size, body=query_body)
    sid = page['_scroll_id']
    scroll_size = page['hits']['total']
    total_pages = math.ceil(scroll_size/page_size)
    page_counter = 0
    if debug: 
        print('Total items : {}'.format(scroll_size))
        print('Total pages : {}'.format( math.ceil(scroll_size/page_size) ) )
    # Start scrolling
    while (scroll_size > 0):
        # Get the number of results that we returned in the last scroll
        scroll_size = len(page['hits']['hits'])
        if scroll_size>0:
            if debug: 
                print('> Scrolling page {} : {} items'.format(page_counter, scroll_size))
            yield total_pages, page_counter, scroll_size, page
        # get next page
        page = es.scroll(scroll_id = sid, scroll = '2m')
        page_counter += 1
        # Update the scroll ID
        sid = page['_scroll_id']

Usage :

index = 'cases_*'
doc_type = 'detail'
query = { "query": { "match_all": {} }, "_source": ['caseId'] }
page_size = 1000

for total_pages, page_counter, page_items, page_data in scroll(index, doc_type, query, page_size=page_size):
    print('total_pages={}, page_counter={}, page_items={}'.format(total_pages, page_counter, page_items))
    # do what you need with page_data

mikej165 · 2018-04-06T18:08:37Z

+1 Saved me a lot of time. Thanks!

abdulwahid24 · 2018-04-13T09:38:38Z

Thank you @evrycollin +1

AnalystNidhi · 2018-04-24T11:38:02Z

In my project requirement, I need to fetch more than 10k documents. I used ElasticSearch scroll api with python to do that. Here is my sample code -

url = 'http://hostname:portname/_search/scroll'
scroll_url='http://hostname:portname//_search?scroll=2m'
Query= {"query": {"bool": {"must": [{"match_all": { }},{ "range": { "@timestamp": { "gt": "now-24h", "lt": "now-1h", "time_zone": "-06:00" } } }],"must_not": [ ],"should": [ ]}},"from": 0,"size":10,"sort": [ ],"aggs": { }}
response=requests.post(scroll_url, json=query)
sid = response['_scroll_id']
hits=response['hits']
total=hits["total"]
while(total>0):
scroll = '2m'
scroll_query=json.dumps({"scroll" : scroll, "scroll_id" : sid })
response1=rquests.post(url,data=scroll_query)

sid = response1['_scroll_id']

hits=response1['hits']
total=len(response1['hits']['hits'])
for each in hits['hits']:

Scroll work perfect the way I wanted to, but later I was informed that because of this scroll elasticsearch schema got corrupted and it recreated the indexes.

Is it true that scroll modify the ES structure or something wrong with my code. Please let me know.

rain1024 · 2018-06-22T03:38:30Z

+1 helpful

NguyenHauHN · 2018-07-12T07:49:27Z

Thank you so much!

tade0726 · 2018-07-19T10:24:20Z

@muelli TKS, you are brilliant, hope more people checkout this api !

chamalis · 2018-08-08T18:01:19Z

ES 6.3. This example makes my Elasticsearch service to crash, trying to scroll 110k documents with size=10000, at somewhere between 5th-7th iterations.

systemctl status elasticsearch

 elasticsearch.service - Elasticsearch
   Loaded: loaded (/usr/lib/systemd/system/elasticsearch.service; disabled; vendor preset: enabled)
   Active: failed (Result: exit-code) since Wed 2018-08-08 20:58:10 EEST; 21s ago
     Docs: http://www.elastic.co
  Process: 5860 ExecStart=/usr/share/elasticsearch/bin/elasticsearch -p ${PID_DIR}/elasticsearch.pid --quiet (code=exited, status=127)
 Main PID: 5860 (code=exited, status=127)

Aug 08 20:57:18 myhost elasticsearch[5860]:         at org.elasticsearch.bootstrap.Elasticsearch.execute(Elasticsearch.java:112)
Aug 08 20:57:18 myhost elasticsearch[5860]:         at org.elasticsearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:86)
Aug 08 20:57:18 myhost elasticsearch[5860]:         at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:124)
Aug 08 20:57:18 myhost elasticsearch[5860]:         at org.elasticsearch.cli.Command.main(Command.java:90)
Aug 08 20:57:18 myhost elasticsearch[5860]:         at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:92)
Aug 08 20:57:18 myhost elasticsearch[5860]:         at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:85)
Aug 08 20:57:18 myhost elasticsearch[5860]: 2018-08-08 20:57:18,490 main ERROR Null object returned for RollingFile in Appenders.
Aug 08 20:57:18 myhost elasticsearch[5860]: 2018-08-08 20:57:18,491 main ERROR Unable to locate appender "rolling" for logger config "root"
Aug 08 20:58:10 myhost systemd[1]: elasticsearch.service: Main process exited, code=exited, status=127/n/a
Aug 08 20:58:10 myhost systemd[1]: elasticsearch.service: Failed with result 'exit-code'.

No logs in /var/log/elasticsearch/elasticsearch.log

venkz · 2018-09-28T21:01:54Z

Thanks for making a simple example, very useful.

For others who use this example, keep in mind that the initial es.search not only returns the first scroll_id that you'll use for scrolling, but also contains hits that you'll want to process before initiating your first scroll. For most people this is probably obvious, but for the 'challenged' (like me), be sure to do something like:
page = es.search(
.....
    })

sid = page['_scroll_id']
scroll_size = page['hits']['total']

#before you scroll, process your current batch of hits  
for hit in page['hits']['hits']:
    do_stuff

  # Start scrolling
while (scroll_size > 0)
...

Excellent! Important point to keep in mind 👍

sibblegp · 2018-10-15T22:14:29Z

This is extremely slow for me. I used elasticsearch.helpers.scan instead and not only did it not crash my server, but it was much faster.

tomaszhlawiczka · 2018-10-22T11:09:23Z

This is extremely slow for me. I used elasticsearch.helpers.scan instead and not only did it not crash my server, but it was much faster.

@sibblegp please see: https://www.elastic.co/guide/en/elasticsearch/reference/5.1/breaking_50_search_changes.html#_literal_search_type_scan_literal_removed

Scroll requests sorted by _doc have been optimized to more efficiently resume from where the previous request stopped, so this will have the same performance characteristics as the former scan search type.

vadirajjahagirdar · 2018-10-23T18:13:36Z

Thanks a lot!!

david0593112 · 2018-11-20T04:58:14Z

nice

croepke · 2018-12-08T10:50:06Z

Many thanks! Very handy!

akras-apixio · 2018-12-21T09:26:53Z

Warning. This code has a bug, it will throw out first search result (aka first 1000 items). A co-worker of mine copy pasted this causing us to waste a few hours.

mybluedog24 · 2019-03-08T22:06:14Z

This code doesn't work anymore in ES 6.4. I found another solution here: https://stackoverflow.com/questions/28537547/how-to-correctly-check-for-scroll-end

response = es.search(
    index='index_name',
    body=<your query here>,
    scroll='10m'
)
scroll_id = response['_scroll_id']

while len(response['hits']['hits']):
    # process results
    print([item["_id"] for item in response["hits"]["hits"]])
    response = es.scroll(scroll_id=scroll_id, scroll='10m')

Process the result right at the beginning of the while loop to avoid missing the first search result.

feydan · 2019-04-09T23:36:54Z

The scroll id can change: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-scroll.html

The initial search request and each subsequent scroll request each return a _scroll_id. While the _scroll_id may change between requests, it doesn’t always change — in any case, only the most recently received _scroll_id should be used.

Here is a simplified version that will work if the scroll id changes

response = es.search(
    index='index_name',
    body=<your query here>,
    scroll='10m'
)

while len(response['hits']['hits']):
    # process results
    print([item["_id"] for item in response["hits"]["hits"]])
    response = es.scroll(scroll_id=response['_scroll_id'], scroll='10m')

Bernardoow · 2019-12-06T17:32:58Z

Thanks @feydan !

flavienbwk · 2020-03-05T10:52:00Z

Great thanks @feydan !

manohar0079 · 2020-04-17T13:30:39Z

any such method in ruby?

omarcsejust · 2020-08-06T09:07:19Z

Thanks

eavilesmejia · 2021-02-17T02:50:32Z

For a huge query you can use:

from elasticsearch import Elasticsearch, helpers

es = Elasticsearch('http://localhost:9200')

# return a generator
response = helpers.scan(es, 
index='yourIndex', 
scroll='10m',
size=1000,
query={
    # query body
})

# iterate documents one by one
for row in response:
    print(row['_source'])

Tested in elasticsearch 7.8 and python3.9 with query hit of ~500k documents

derekangziying · 2021-02-22T10:11:53Z

For a huge query you can use:

from elasticsearch import Elasticsearch, helpers

es = Elasticsearch('http://localhost:9200')

# return a generator
response = helpers.scan(es, 
index='yourIndex', 
scroll='10m',
size=1000,
query={
    # query body
})

# iterate documents one by one
for row in response:
    print(row['_source'])

Tested in elasticsearch 7.8 and python3.9 with query hit of ~500k documents

Hi eavilesmejia,

I've used your code but am facing empty results when I am writing with a python open ().

I'm trying to basically extract out a portion of the source and write them into a text file.

The paths and configuration of the log file is all setup, when I use es.search as opposed to helpers.scan the code works and writes to my text file fine. But because of the 10k limit issue, I'm looking at helpers.scan.

g = open (LOG, 'a+')

for row in response:
     g.write ('$')
     g.write(row['_source']['messagetype']
     g.write ('$')
     g.write ('\n')
g.close

The following code snippet above returns nothing onto my text file.

Do you mind testing on your set up to write to a logfile and share your code?

eavilesmejia · 2021-02-22T15:09:55Z

For a huge query you can use:
from elasticsearch import Elasticsearch, helpers

es = Elasticsearch('http://localhost:9200')

# return a generator
response = helpers.scan(es, 
index='yourIndex', 
scroll='10m',
size=1000,
query={
    # query body
})

# iterate documents one by one
for row in response:
    print(row['_source'])
Tested in elasticsearch 7.8 and python3.9 with query hit of ~500k documents
Hi eavilesmejia,

I've used your code but am facing empty results when I am writing with a python open ().

I'm trying to basically extract out a portion of the source and write them into a text file.

The paths and configuration of the log file is all setup, when I use es.search as opposed to helpers.scan the code works and writes to my text file fine. But because of the 10k limit issue, I'm looking at helpers.scan.
g = open (LOG, 'a+')

for row in response:
     g.write ('$')
     g.write(row['_source']['messagetype']
     g.write ('$')
     g.write ('\n')
g.close
The following code snippet above returns nothing onto my text file.

Do you mind testing on your set up to write to a logfile and share your code?

Hi @derekangziying

I have tested it in a code very similar to:

import csv
from elasticsearch import Elasticsearch, helpers

def main()
    with open("/tmp/yesterday-all-events.csv", "w") as csvfile:
            writer = csv.DictWriter(csvfile, fieldnames=[], extrasaction='ignore')
            for i, row in enumerate(get_scrolled_query()):
                if i == 0:
                    writer.fieldnames = list(filter(lambda x: x.startswith('ca.'), row['_source'].keys()))
                    writer.writeheader()
                writer.writerow(row["_source"])


def get_scrolled_query():
    es = Elasticsearch('http://localhost:9200')
    return helpers.scan(es,
                        index='my-index', scroll='40m',
                        size=8000,
                        query={
                            "query": {
                                "range": {
                                    "@timestamp": {
                                        "gte": "now-1d/d",
                                        "lt": "now/d"
                                    }
                                }
                            }
                        })


if __name__ == '__main__':
    main()

In my case I am getting all yesterday events and write the result into a CSV file by using DictWriter Class , I wanted to filter the fields that starts with ca. to be used as CSV header, for example fields ca.version, ca.date_time and more that are own my index.

The 10k limit is handled by helpers.scan by doing scroll requests based on size (in this case 8000) until there's not more data to return and finally scroll is cleared by default at the end of the scan process (that's the reason I don't care using '40min' of TTL)

	# Initialize the scroll
	page = es.search(
	index = 'yourIndex',
	doc_type = 'yourType',
	scroll = '2m',
	search_type = 'scan',
	size = 1000,
	body = {
	# Your query's body
	})
	sid = page['_scroll_id']
	scroll_size = page['hits']['total']

	# Start scrolling
	while (scroll_size > 0):
	print "Scrolling..."
	page = es.scroll(scroll_id = sid, scroll = '2m')
	# Update the scroll ID
	sid = page['_scroll_id']
	# Get the number of results that we returned in the last scroll
	scroll_size = len(page['hits']['hits'])
	print "scroll size: " + str(scroll_size)
	# Do something with the obtained page

drorata/gist:146ce50807d16fd4a6aa

henrikno commented Sep 27, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NinaSalimi commented Oct 21, 2017

Uh oh!

alanwds commented Jan 5, 2018

Uh oh!

jjjbushjjj commented Jan 18, 2018

Uh oh!

hardikgw commented Feb 2, 2018

Uh oh!

krishthotem commented Feb 20, 2018

Uh oh!

evrycollin commented Apr 4, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Working with python generator

works with ES >= 5

1. utility generator method

Usage :

Uh oh!

mikej165 commented Apr 6, 2018

Uh oh!

abdulwahid24 commented Apr 13, 2018

Uh oh!

AnalystNidhi commented Apr 24, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rain1024 commented Jun 22, 2018

Uh oh!

NguyenHauHN commented Jul 12, 2018

Uh oh!

tade0726 commented Jul 19, 2018

Uh oh!

chamalis commented Aug 8, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

venkz commented Sep 28, 2018

Uh oh!

sibblegp commented Oct 15, 2018

Uh oh!

tomaszhlawiczka commented Oct 22, 2018

Uh oh!

vadirajjahagirdar commented Oct 23, 2018

Uh oh!

david0593112 commented Nov 20, 2018

Uh oh!

croepke commented Dec 8, 2018

Uh oh!

akras-apixio commented Dec 21, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mybluedog24 commented Mar 8, 2019

Uh oh!

feydan commented Apr 9, 2019

Uh oh!

Bernardoow commented Dec 6, 2019

Uh oh!

flavienbwk commented Mar 5, 2020

Uh oh!

manohar0079 commented Apr 17, 2020

Uh oh!

omarcsejust commented Aug 6, 2020

Uh oh!

eavilesmejia commented Feb 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

derekangziying commented Feb 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eavilesmejia commented Feb 22, 2021

Uh oh!

henrikno commented Sep 27, 2017 •

edited

Loading

evrycollin commented Apr 4, 2018 •

edited

Loading

AnalystNidhi commented Apr 24, 2018 •

edited

Loading

chamalis commented Aug 8, 2018 •

edited

Loading

akras-apixio commented Dec 21, 2018 •

edited

Loading

eavilesmejia commented Feb 17, 2021 •

edited

Loading

derekangziying commented Feb 22, 2021 •

edited

Loading