drincruz/db_row_generator.py

madanaman · 2021-09-04T20:35:19Z

This is really helpful. Thank you

drincruz · 2021-09-15T21:57:54Z

This is really helpful. Thank you

I'm glad you found it useful! 🙌

odravison · 2023-11-30T12:17:51Z

Hi @drincruz , very good code, nice contribution to public code community.

I'm learning about generators and its best usage of memory resource and I have a doubt:
About line 32: Doesn't .fetchall() conclude the SQL query and get all the result to a list?
If so, Isn't the same as using a list at all?

drincruz · 2023-12-29T23:16:18Z

@odravison thank you! 🙏 it's a bit old, but i'm glad it's useful.

About line 32: Doesn't .fetchall() conclude the SQL query and get all the result to a list?
If so, Isn't the same as using a list at all?

This is a very good question! 💯 I think the important parts for generators are that they get lazy loaded; meaning if we didn't have lines 44:45, they won't get called at all. a more better-versed python expert can probably explain it better than me, but here's a simple example i'll try to explain it with below.

Consider this piece of code, yield_data.py here:

yield_data.py

import pymysql.cursors

from cProfile import Profile

def connect():
    # Connect to the database
    connection = pymysql.connect(
        host='127.0.0.1',
        user='root',
        password='secret',
        port=3306,
        database='mydb'
    )
    return connection

def get_cursor(connection):
    # Create a cursor object
    cursor = connection.cursor()
    return cursor

def yield_all_rows(connection):
    cursor = connection.cursor()
    query = "SELECT * FROM my_table"
    cursor.execute(query)
    for row in cursor.fetchall():
        yield row
    cursor.close()


def commit(connection):
    # Commit the changes
    connection.commit()

def close(cursor, connection):
    # Close the cursor and connection
    cursor.close()
    connection.close()

def main():
    connection = connect()
    p = Profile()
    p.enable()
    yield_all_rows(connection)
    p.disable()
    p.print_stats(sort='ncalls')
    connection.close()

if __name__ == '__main__':
    main()

Since we're using a generator here, we're lazy loading the data (on demand), so we actually don't use much. We are caling yield_all_rows, but we're not doing anything with the data. So, our profiler gives us the following:

python src/yield_data.py
         1 function calls in 0.000 seconds

   Ordered by: call count

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

But if we update the call to actually print out values, we actually see the query calls go through. Consider this updated code for yield_data.py

def main():
    connection = connect()
    p = Profile()
    p.enable()
    for row in yield_all_rows(connection):
        print(row)
    p.disable()
    p.print_stats(sort='ncalls')
    connection.close()

If we run the program now with the profiler, we'll see the following

yield_data.py profiler results

$ python src/yield_data.py                                                                                                                     [24/228]
         760569 function calls in 0.203 seconds

   Ordered by: call count

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   100046    0.006    0.000    0.006    0.000 {built-in method builtins.len}
    80017    0.008    0.000    0.008    0.000 {method 'append' of 'list' objects}
    60025    0.010    0.000    0.016    0.000 protocol.py:147(read_length_encoded_integer)
    60025    0.006    0.000    0.006    0.000 protocol.py:114(read_uint8)
    60024    0.017    0.000    0.051    0.000 protocol.py:165(read_length_coded_string)
    60024    0.015    0.000    0.018    0.000 protocol.py:62(read)
    60018    0.008    0.000    0.008    0.000 {method 'decode' of 'bytes' objects}
    40017    0.016    0.000    0.016    0.000 {method 'settimeout' of '_socket.socket' objects}
    40016    0.007    0.000    0.019    0.000 {method 'read' of '_io.BufferedReader' objects}
    40016    0.018    0.000    0.055    0.000 connections.py:775(_read_bytes)
    20008    0.003    0.000    0.003    0.000 {built-in method _struct.unpack}
    20008    0.002    0.000    0.002    0.000 protocol.py:55(__init__)
    20008    0.002    0.000    0.002    0.000 protocol.py:208(is_error_packet)
    20008    0.025    0.000    0.087    0.000 connections.py:730(_read_packet)
    20005    0.002    0.000    0.002    0.000 protocol.py:187(is_eof_packet)
    20003    0.003    0.000    0.005    0.000 connections.py:1268(_check_packet_is_eof)
    20003    0.001    0.000    0.203    0.000 yield_data.py:21(yield_all_rows)
    20002    0.031    0.000    0.096    0.000 connections.py:1340(_read_row_from_packet)
       46    0.000    0.000    0.000    0.000 socket.py:732(readable)
       46    0.000    0.000    0.012    0.000 socket.py:693(readinto)
       46    0.012    0.000    0.012    0.000 {method 'recv_into' of '_socket.socket' objects}
       46    0.000    0.000    0.000    0.000 {method '_checkClosed' of '_io._IOBase' objects}
       46    0.000    0.000    0.000    0.000 {method '_checkReadable' of '_io._IOBase' objects}
        6    0.000    0.000    0.000    0.000 protocol.py:271(get_column_length)
        5    0.000    0.000    0.000    0.000 cursors.py:65(_get_db)
        5    0.000    0.000    0.000    0.000 {method 'get' of 'dict' objects}
        4    0.000    0.000    0.000    0.000 {method 'unpack_from' of '_struct.Struct' objects}
        4    0.000    0.000    0.000    0.000 protocol.py:177(read_struct)
        3    0.000    0.000    0.000    0.000 protocol.py:234(__init__)
        3    0.000    0.000    0.000    0.000 protocol.py:259(description)
        3    0.000    0.000    0.000    0.000 protocol.py:238(_parse_field_descriptor)
        2    0.000    0.000    0.000    0.000 cursors.py:83(_nextset)
        2    0.000    0.000    0.000    0.000 cursors.py:97(nextset)
        2    0.000    0.000    0.000    0.000 {built-in method builtins.isinstance}
        1    0.000    0.000    0.203    0.203 connections.py:810(_read_query_result)
        1    0.000    0.000    0.203    0.203 connections.py:1198(read)
        1    0.000    0.000    0.000    0.000 connections.py:539(cursor)
        1    0.000    0.000    0.203    0.203 connections.py:552(query)
        1    0.000    0.000    0.000    0.000 connections.py:1178(__init__)
        1    0.000    0.000    0.000    0.000 connections.py:800(_write_bytes)
        1    0.013    0.013    0.191    0.191 connections.py:1327(_read_rowdata_packet)
        1    0.000    0.000    0.191    0.191 connections.py:1281(_read_result_packet)
        1    0.000    0.000    0.000    0.000 {built-in method _struct.pack}
        1    0.000    0.000    0.000    0.000 {method 'sendall' of '_socket.socket' objects}
        1    0.000    0.000    0.000    0.000 cursors.py:294(fetchall)
        1    0.000    0.000    0.203    0.203 cursors.py:133(execute)
        1    0.000    0.000    0.000    0.000 cursors.py:336(_do_get_result)
        1    0.000    0.000    0.000    0.000 cursors.py:34(__init__)
        1    0.000    0.000    0.000    0.000 cursors.py:45(close)
        1    0.000    0.000    0.000    0.000 cursors.py:110(mogrify)
        1    0.000    0.000    0.000    0.000 protocol.py:183(is_ok_packet)
        1    0.000    0.000    0.203    0.203 cursors.py:319(_query)
        1    0.000    0.000    0.000    0.000 protocol.py:323(__init__)
        1    0.000    0.000    0.000    0.000 cursors.py:326(_clear_result)
        1    0.000    0.000    0.000    0.000 cursors.py:70(_check_executed)
        1    0.000    0.000    0.000    0.000 protocol.py:205(is_load_local_packet)
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    0.000    0.000    0.000    0.000 {method 'encode' of 'str' objects}
        1    0.000    0.000    0.000    0.000 {built-in method builtins.min}
        1    0.000    0.000    0.000    0.000 connections.py:834(_execute_command)
        1    0.000    0.000    0.000    0.000 connections.py:1359(_get_descriptions)

To get back to your point, this last snippet here, will look like a similar code call as if we were just using a list to store the data. So, I guess it really just depends on your needs for data if you wanted to store them in memory or on-demand.

cappecafinqi · 2025-02-06T12:59:37Z

for Python 3.12 I used:

print(next(subscriber_generator()))

rollmate · 2025-03-11T15:53:27Z

I think it's not the best use of memory resources

How about replacing this

for row in cursor.fetchall():
        yield row

with this

batch_size = 1000 # your size

while True:
    rows = cursor.fetchmany(batch_size)

    if not rows:
        break

    yield rows

odravison · 2025-03-11T16:52:35Z

Nice point @rollmate

And thank you, @drincruz, for this explanation; it was enlightening.

	#!/usr/bin/env python

	"""
	Here is a simple example of how you can
	use a generator to fetch a row from a database

	"""

	import MySQLdb


	def subscriber_generator():
	"""
	Simple generator example to return a row
	from a MySQL database

	"""

	# Simplicity purposes, we're connecting locally
	db = MySQLdb.connect(user='root',db='your_db_name')

	# Get a cursor
	cursor = db.cursor()

	# Execute your MySQL query
	cursor.execute(
	'SELECT id,first_name FROM subscribers_subscribers '
	+ 'WHERE id > %s AND id < %s '
	+ 'ORDER BY first_name ASC', (1234,1334))

	# Iterate through fetchall(), but yield the row
	for row in cursor.fetchall():
	yield row

	def main():
	"""
	Main

	"""
	# You can use the next() method to get a row
	print(subscriber_generator().next())

	# Or you can iterate through all with a for loop
	for row in subscriber_generator():
	print(row)


	if '__main__' == __name__:
	main()

drincruz/db_row_generator.py

madanaman commented Sep 4, 2021

Uh oh!

drincruz commented Sep 15, 2021

Uh oh!

odravison commented Nov 30, 2023

Uh oh!

drincruz commented Dec 29, 2023

Uh oh!

cappecafinqi commented Feb 6, 2025

Uh oh!

rollmate commented Mar 11, 2025

Uh oh!

odravison commented Mar 11, 2025

Uh oh!