-
-
Save drincruz/b3904cc338fc60a19f44 to your computer and use it in GitHub Desktop.
#!/usr/bin/env python | |
""" | |
Here is a simple example of how you can | |
use a generator to fetch a row from a database | |
""" | |
import MySQLdb | |
def subscriber_generator(): | |
""" | |
Simple generator example to return a row | |
from a MySQL database | |
""" | |
# Simplicity purposes, we're connecting locally | |
db = MySQLdb.connect(user='root',db='your_db_name') | |
# Get a cursor | |
cursor = db.cursor() | |
# Execute your MySQL query | |
cursor.execute( | |
'SELECT id,first_name FROM subscribers_subscribers ' | |
+ 'WHERE id > %s AND id < %s ' | |
+ 'ORDER BY first_name ASC', (1234,1334)) | |
# Iterate through fetchall(), but yield the row | |
for row in cursor.fetchall(): | |
yield row | |
def main(): | |
""" | |
Main | |
""" | |
# You can use the next() method to get a row | |
print(subscriber_generator().next()) | |
# Or you can iterate through all with a for loop | |
for row in subscriber_generator(): | |
print(row) | |
if '__main__' == __name__: | |
main() |
This is really helpful. Thank you
I'm glad you found it useful! 🙌
Hi @drincruz , very good code, nice contribution to public code community.
I'm learning about generators and its best usage of memory resource and I have a doubt:
About line 32: Doesn't .fetchall()
conclude the SQL query and get all the result to a list?
If so, Isn't the same as using a list at all?
@odravison thank you! 🙏 it's a bit old, but i'm glad it's useful.
About line 32: Doesn't .fetchall() conclude the SQL query and get all the result to a list?
If so, Isn't the same as using a list at all?
This is a very good question! 💯 I think the important parts for generators are that they get lazy loaded; meaning if we didn't have lines 44:45, they won't get called at all. a more better-versed python expert can probably explain it better than me, but here's a simple example i'll try to explain it with below.
Consider this piece of code, yield_data.py here:
yield_data.py
import pymysql.cursors
from cProfile import Profile
def connect():
# Connect to the database
connection = pymysql.connect(
host='127.0.0.1',
user='root',
password='secret',
port=3306,
database='mydb'
)
return connection
def get_cursor(connection):
# Create a cursor object
cursor = connection.cursor()
return cursor
def yield_all_rows(connection):
cursor = connection.cursor()
query = "SELECT * FROM my_table"
cursor.execute(query)
for row in cursor.fetchall():
yield row
cursor.close()
def commit(connection):
# Commit the changes
connection.commit()
def close(cursor, connection):
# Close the cursor and connection
cursor.close()
connection.close()
def main():
connection = connect()
p = Profile()
p.enable()
yield_all_rows(connection)
p.disable()
p.print_stats(sort='ncalls')
connection.close()
if __name__ == '__main__':
main()
Since we're using a generator here, we're lazy loading the data (on demand), so we actually don't use much. We are caling yield_all_rows
, but we're not doing anything with the data. So, our profiler gives us the following:
python src/yield_data.py
1 function calls in 0.000 seconds
Ordered by: call count
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
But if we update the call to actually print out values, we actually see the query calls go through. Consider this updated code for yield_data.py
def main():
connection = connect()
p = Profile()
p.enable()
for row in yield_all_rows(connection):
print(row)
p.disable()
p.print_stats(sort='ncalls')
connection.close()
If we run the program now with the profiler, we'll see the following
yield_data.py profiler results
$ python src/yield_data.py [24/228]
760569 function calls in 0.203 seconds
Ordered by: call count
ncalls tottime percall cumtime percall filename:lineno(function)
100046 0.006 0.000 0.006 0.000 {built-in method builtins.len}
80017 0.008 0.000 0.008 0.000 {method 'append' of 'list' objects}
60025 0.010 0.000 0.016 0.000 protocol.py:147(read_length_encoded_integer)
60025 0.006 0.000 0.006 0.000 protocol.py:114(read_uint8)
60024 0.017 0.000 0.051 0.000 protocol.py:165(read_length_coded_string)
60024 0.015 0.000 0.018 0.000 protocol.py:62(read)
60018 0.008 0.000 0.008 0.000 {method 'decode' of 'bytes' objects}
40017 0.016 0.000 0.016 0.000 {method 'settimeout' of '_socket.socket' objects}
40016 0.007 0.000 0.019 0.000 {method 'read' of '_io.BufferedReader' objects}
40016 0.018 0.000 0.055 0.000 connections.py:775(_read_bytes)
20008 0.003 0.000 0.003 0.000 {built-in method _struct.unpack}
20008 0.002 0.000 0.002 0.000 protocol.py:55(__init__)
20008 0.002 0.000 0.002 0.000 protocol.py:208(is_error_packet)
20008 0.025 0.000 0.087 0.000 connections.py:730(_read_packet)
20005 0.002 0.000 0.002 0.000 protocol.py:187(is_eof_packet)
20003 0.003 0.000 0.005 0.000 connections.py:1268(_check_packet_is_eof)
20003 0.001 0.000 0.203 0.000 yield_data.py:21(yield_all_rows)
20002 0.031 0.000 0.096 0.000 connections.py:1340(_read_row_from_packet)
46 0.000 0.000 0.000 0.000 socket.py:732(readable)
46 0.000 0.000 0.012 0.000 socket.py:693(readinto)
46 0.012 0.000 0.012 0.000 {method 'recv_into' of '_socket.socket' objects}
46 0.000 0.000 0.000 0.000 {method '_checkClosed' of '_io._IOBase' objects}
46 0.000 0.000 0.000 0.000 {method '_checkReadable' of '_io._IOBase' objects}
6 0.000 0.000 0.000 0.000 protocol.py:271(get_column_length)
5 0.000 0.000 0.000 0.000 cursors.py:65(_get_db)
5 0.000 0.000 0.000 0.000 {method 'get' of 'dict' objects}
4 0.000 0.000 0.000 0.000 {method 'unpack_from' of '_struct.Struct' objects}
4 0.000 0.000 0.000 0.000 protocol.py:177(read_struct)
3 0.000 0.000 0.000 0.000 protocol.py:234(__init__)
3 0.000 0.000 0.000 0.000 protocol.py:259(description)
3 0.000 0.000 0.000 0.000 protocol.py:238(_parse_field_descriptor)
2 0.000 0.000 0.000 0.000 cursors.py:83(_nextset)
2 0.000 0.000 0.000 0.000 cursors.py:97(nextset)
2 0.000 0.000 0.000 0.000 {built-in method builtins.isinstance}
1 0.000 0.000 0.203 0.203 connections.py:810(_read_query_result)
1 0.000 0.000 0.203 0.203 connections.py:1198(read)
1 0.000 0.000 0.000 0.000 connections.py:539(cursor)
1 0.000 0.000 0.203 0.203 connections.py:552(query)
1 0.000 0.000 0.000 0.000 connections.py:1178(__init__)
1 0.000 0.000 0.000 0.000 connections.py:800(_write_bytes)
1 0.013 0.013 0.191 0.191 connections.py:1327(_read_rowdata_packet)
1 0.000 0.000 0.191 0.191 connections.py:1281(_read_result_packet)
1 0.000 0.000 0.000 0.000 {built-in method _struct.pack}
1 0.000 0.000 0.000 0.000 {method 'sendall' of '_socket.socket' objects}
1 0.000 0.000 0.000 0.000 cursors.py:294(fetchall)
1 0.000 0.000 0.203 0.203 cursors.py:133(execute)
1 0.000 0.000 0.000 0.000 cursors.py:336(_do_get_result)
1 0.000 0.000 0.000 0.000 cursors.py:34(__init__)
1 0.000 0.000 0.000 0.000 cursors.py:45(close)
1 0.000 0.000 0.000 0.000 cursors.py:110(mogrify)
1 0.000 0.000 0.000 0.000 protocol.py:183(is_ok_packet)
1 0.000 0.000 0.203 0.203 cursors.py:319(_query)
1 0.000 0.000 0.000 0.000 protocol.py:323(__init__)
1 0.000 0.000 0.000 0.000 cursors.py:326(_clear_result)
1 0.000 0.000 0.000 0.000 cursors.py:70(_check_executed)
1 0.000 0.000 0.000 0.000 protocol.py:205(is_load_local_packet)
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
1 0.000 0.000 0.000 0.000 {method 'encode' of 'str' objects}
1 0.000 0.000 0.000 0.000 {built-in method builtins.min}
1 0.000 0.000 0.000 0.000 connections.py:834(_execute_command)
1 0.000 0.000 0.000 0.000 connections.py:1359(_get_descriptions)
To get back to your point, this last snippet here, will look like a similar code call as if we were just using a list to store the data. So, I guess it really just depends on your needs for data if you wanted to store them in memory or on-demand.
for Python 3.12 I used:
print(next(subscriber_generator()))
I think it's not the best use of memory resources
How about replacing this
for row in cursor.fetchall():
yield row
with this
batch_size = 1000 # your size
while True:
rows = cursor.fetchmany(batch_size)
if not rows:
break
yield rows
This is really helpful. Thank you