Techno Blender
Digitally Yours.

What Happens When you Import a Python Module? | by Xiaoxu Gao | Aug, 2022

0 108


Deep dive into the import system

Photo by Mike van den Bos from Unsplash

Reusability is one of the key metrics to measure the quality of the code. It is the extent to which code can be used in different programs with minimal change. In Python, we use import to import code from a module. But have you ever been curious about how import is implemented behind the scenes? In this article, we will deep dive into the import system of Python. We will also discuss an interesting problem: circular imports. Grab a tea, and let’s get straight to the article.

Module v.s. Package

Python is organized into modules and packages. A module is one Python file and a package is a collection of modules. Consider the following example of importing a module:

import random
random.randint(1,10)

random is a Python built-in module. In the first line, it imports random module and makes it available to use, and then it accesses randint(). If you open an IDE and debug the import, you will see the code sit in random.py file.

You can also import randint like this:

from random import randint
randint(1,10)

Let’s check out an example from a package:

import pandas
pandas.DataFrame()

At the first glance, you can’t really tell whether it’s a module or package. But if you debug the import, it will redirect you to pandas.__init__.py instead of pandas.py. A package contains submodules or recursively, sub-packages and __init__.py is the entry point of the package.

But it’s not the only way, functions like importlib.import_module() and built-in __import__() can also be used.

>>> import importlib
>>> importlib.import_module('random')
<module 'random' from '/Users/xiaoxu/.pyenv/versions/3.9.0/lib/python3.9/random.py'>
>>> __import__('random')
<module 'random' from '/Users/xiaoxu/.pyenv/versions/3.9.0/lib/python3.9/random.py'>

Package.__init__.py

So what is __init__.py?

A regular Python package contains a __init__.py file. When the package is imported, this __init__.py file is implicitly executed and the objects it defines are bound to names in the package’s namespace. This file can be left empty.

Let’s see an example. I have a folder structure like this. p1 is my package and m1 is a submodule.

folder structure (Created by Xiaoxu Gao)

Inside m1.py , I have a variable DATE that I want to use in the main.py . I will create several versions of __init__.py and see how it affects the import in main.py .

# m1.py
DATE = "2022-01-01"

Case1: empty __init__.py file.

Since __init__.py file is empty when we import p1 , no submodule is imported, thus it doesn’t know the existence of m1. If we import m1 explicitly using from p1 import m1 , then everything inside m1.py will be imported. But then, we are not actually importing a package, but importing a module. As you can imagine, if your package has a lot of submodules, you need to import every module explicitly which can be quite tedious.

# main.py
import p1
p1.m1.DATE
>> AttributeError: module 'p1' has no attribute 'm1'from p1 import m1
from p1 import m2, m3 ...# needs to explictly import every submodule
m1.DATEWorks!!

Case2: import submodules in __init__.py file

Instead of leaving it empty, we import everything from m1 in __init__.py file. Then, import p1 in the main.py file will recognize the variables in m1.py and you can directly call p1.DATE without knowing which module it comes from.

# __init__.py
from .m1 import * # or from p1.m1 import *
from .m2 import *
# main.py
import p1
p1.DATE

You might have noticed the dot before m1. It is a shortcut that tells it to search in the current package. It’s an example of a relative import. An equivalent absolute import will explicitly name the current package like from p1.m1 import * .

There is a caveat though. If another submodule in the package contains the same variable, the one that is imported later will overwrite the previous one.

The advantage of having a non-empty __init__.py is to make all the submodules already available for the client when they import the package, so the client code looks neater.

How does Python find modules and packages?

The system of finding modules and packages in Python is called Import Machinery which comprises of finders, loaders, caching, and an orchestrater.

Import Machinery (Created by Xiaoxu Gao)
  1. Search module in cached sys.modules

Every time you import a module, the first thing searched is sys.modules dictionary. The keys are module names and the values are the actual module itself. sys.modules is a cached dictionary, if the module is there, then it will be immediately returned, otherwise, it will be searched in the system.

Back to the previous example. When we import p1, two entries are added to sys.modules. The top-level module __init__.py and the submodule m1.py.

import p1
import sys
print(sys.modules)
{
'p1': <module 'p1' from '/xiaoxu/sandbox/p1/__init__.py'>,
'p1.m1': <module 'p1.m1' from '/xiaoxu/sandbox/p1/m1.py'>
...
}

If we import it twice, the second import will read from the cache. But if we deliberately delete the entry from sys.modules dictionary, then the second import will return a new module object.

# read from cache
import p1
import sys
old = p1
import p1
new = p1
assert old is new
# read from system
import p1
import sys
old = p1
del sys.modules['p1']
import p1
new = p1
assert not old is new

2. Search module spec

If the module is not in sys.modules dictionary, then it needs to be searched by a list of meta path finder objects that have their find_spec() methods to see if the module can be imported.

import sys
print(sys.meta_path)
[ <class '_frozen_importlib.BuiltinImporter'>,
<class '_frozen_importlib.FrozenImporter'>,
<class '_frozen_importlib_external.PathFinder'>]

The BuiltinImporter is used for built-in modules. The FronzenImporter is used to locate frozen modules. The PathFinder is responsible for finding modules that are located in one of these paths.

  • sys.path
  • sys.path_hooks
  • sys.path_importer_cache
  • __path__

Let’s check out what is in sys.path.

import sys
print(sys.path)
[ '/xiaoxu/sandbox',
'/xiaoxu/.pyenv/versions/3.9.0/lib/python39.zip',
'/xiaoxu/.pyenv/versions/3.9.0/lib/python3.9',
'/xiaoxu/.pyenv/versions/3.9.0/lib/python3.9/lib-dynload',
'/xiaoxu/.local/lib/python3.9/site-packages',
'/xiaoxu/.pyenv/versions/3.9.0/lib/python3.9/site-packages']

PathFinder will use find_spec method to look for __spec__ of the module. Each module has a specification object that is the metadata of the module. One of the attributes is the loader . The loader indicates to the import machinery which loader to use while creating the module.

import p1
print(p1.__spec__)
ModuleSpec(name='p1', loader=<_frozen_importlib_external.SourceFileLoader object at 0x1018b6ac0>, origin='/xiaoxu/sandbox/p1/__init__.py', submodule_search_locations=['/xiaoxu/sandbox/p1'])

3. Load the module

Once the module spec is found, the import machinery will use the loader attribute to initialize the module and store it in sys.modules dictionary. You can read this pseudo code to understand what happens during the loading portion of import.

Python Circular Imports

In the end, let’s look at an interesting problem of import: Circular Imports. A circular import occurs when two or more modules depend on each other. In this example, m2.py depends on m1.py and m1.py depends on m2.py .

module dependency (Created by Xiaoxu Gao)
# m1.py
import m2
m2.do_m2()
def do_m1():
print("m1")
# m2.py
import m1
m1.do_m1()
def do_m2():
print("m2")
# main.py
import m1
m1.do_m1()
AttributeError: partially initialized module 'm1' has no attribute 'do_m1' (most likely due to a circular import)

Python couldn’t find attribute do_m1 from module m1. So why does this happen? The graph illustrates the process. When import m1, Python goes through m1.py line by line. The first thing it finds is import m2 , so it goes to import m2.py . The first line is to import m1, but since Python didn’t go through everything in m1.py yet, we get a half-initialized object. When we call m1.do_m1() which python didn’t see it, it will raise an AttributeError exception.

Circular Imports (Created by Xiaoxu Gao)

So how to fix circular import? In general, circular imports are the result of bad design. Most of the time, the dependency isn’t actually required. A simple solution is to merge both functions into a single module.

# m.py
def do_m1():
print("m1")
def do_m2():
print("m2")
# main.py
import m
m.do_m1()
m.do_m2()

Sometimes, the merged module can become very large. Another solution is to defer the import of m2 to import it when it is needed. This can be done by placing the import m2 in the function def do_m1(). In this case, Python will load all the functions in m1.py and then load m2.py only when needed.

# m1.py
def do_m1():
import m2
m2.do_m2()
print("m1")
def do_m1_2():
print("m1_2")
# m2.py
import m1
def do_m2():
m1.do_m1_2()
print("m2")
# main.py
import m1
m1.do_m1()

Many code-bases use deferred importing not necessarily to solve circular dependency but to speed up the startup time. An example from Airflow is to not write top-level code which is not necessary to build DAGs. This is because of the impact the top-level code parsing speed on both performance and scalability of Airflow.

# example from Airflow docfrom airflow import DAG
from airflow.operators.python import PythonOperator

with DAG(
dag_id="example_python_operator",
schedule_interval=None,
start_date=pendulum.datetime(2021, 1, 1, tz="UTC"),
catchup=False,
tags=["example"],
) as dag:

def print_array():
import numpy as np
# <- THIS IS HOW NUMPY SHOULD BE IMPORTED IN THIS CASE

a = np.arange(15).reshape(3, 5)
print(a)
return a

run_this = PythonOperator(
task_id="print_the_context",
python_callable=print_array,
)

Conclusion

As always, I hope you find this article useful and inspiring. We take many things in Python for granted, but it gets interesting when discovering how it works internally. Hope you enjoyed it, Cheers!


Deep dive into the import system

Photo by Mike van den Bos from Unsplash

Reusability is one of the key metrics to measure the quality of the code. It is the extent to which code can be used in different programs with minimal change. In Python, we use import to import code from a module. But have you ever been curious about how import is implemented behind the scenes? In this article, we will deep dive into the import system of Python. We will also discuss an interesting problem: circular imports. Grab a tea, and let’s get straight to the article.

Module v.s. Package

Python is organized into modules and packages. A module is one Python file and a package is a collection of modules. Consider the following example of importing a module:

import random
random.randint(1,10)

random is a Python built-in module. In the first line, it imports random module and makes it available to use, and then it accesses randint(). If you open an IDE and debug the import, you will see the code sit in random.py file.

You can also import randint like this:

from random import randint
randint(1,10)

Let’s check out an example from a package:

import pandas
pandas.DataFrame()

At the first glance, you can’t really tell whether it’s a module or package. But if you debug the import, it will redirect you to pandas.__init__.py instead of pandas.py. A package contains submodules or recursively, sub-packages and __init__.py is the entry point of the package.

But it’s not the only way, functions like importlib.import_module() and built-in __import__() can also be used.

>>> import importlib
>>> importlib.import_module('random')
<module 'random' from '/Users/xiaoxu/.pyenv/versions/3.9.0/lib/python3.9/random.py'>
>>> __import__('random')
<module 'random' from '/Users/xiaoxu/.pyenv/versions/3.9.0/lib/python3.9/random.py'>

Package.__init__.py

So what is __init__.py?

A regular Python package contains a __init__.py file. When the package is imported, this __init__.py file is implicitly executed and the objects it defines are bound to names in the package’s namespace. This file can be left empty.

Let’s see an example. I have a folder structure like this. p1 is my package and m1 is a submodule.

folder structure (Created by Xiaoxu Gao)

Inside m1.py , I have a variable DATE that I want to use in the main.py . I will create several versions of __init__.py and see how it affects the import in main.py .

# m1.py
DATE = "2022-01-01"

Case1: empty __init__.py file.

Since __init__.py file is empty when we import p1 , no submodule is imported, thus it doesn’t know the existence of m1. If we import m1 explicitly using from p1 import m1 , then everything inside m1.py will be imported. But then, we are not actually importing a package, but importing a module. As you can imagine, if your package has a lot of submodules, you need to import every module explicitly which can be quite tedious.

# main.py
import p1
p1.m1.DATE
>> AttributeError: module 'p1' has no attribute 'm1'from p1 import m1
from p1 import m2, m3 ...# needs to explictly import every submodule
m1.DATEWorks!!

Case2: import submodules in __init__.py file

Instead of leaving it empty, we import everything from m1 in __init__.py file. Then, import p1 in the main.py file will recognize the variables in m1.py and you can directly call p1.DATE without knowing which module it comes from.

# __init__.py
from .m1 import * # or from p1.m1 import *
from .m2 import *
# main.py
import p1
p1.DATE

You might have noticed the dot before m1. It is a shortcut that tells it to search in the current package. It’s an example of a relative import. An equivalent absolute import will explicitly name the current package like from p1.m1 import * .

There is a caveat though. If another submodule in the package contains the same variable, the one that is imported later will overwrite the previous one.

The advantage of having a non-empty __init__.py is to make all the submodules already available for the client when they import the package, so the client code looks neater.

How does Python find modules and packages?

The system of finding modules and packages in Python is called Import Machinery which comprises of finders, loaders, caching, and an orchestrater.

Import Machinery (Created by Xiaoxu Gao)
  1. Search module in cached sys.modules

Every time you import a module, the first thing searched is sys.modules dictionary. The keys are module names and the values are the actual module itself. sys.modules is a cached dictionary, if the module is there, then it will be immediately returned, otherwise, it will be searched in the system.

Back to the previous example. When we import p1, two entries are added to sys.modules. The top-level module __init__.py and the submodule m1.py.

import p1
import sys
print(sys.modules)
{
'p1': <module 'p1' from '/xiaoxu/sandbox/p1/__init__.py'>,
'p1.m1': <module 'p1.m1' from '/xiaoxu/sandbox/p1/m1.py'>
...
}

If we import it twice, the second import will read from the cache. But if we deliberately delete the entry from sys.modules dictionary, then the second import will return a new module object.

# read from cache
import p1
import sys
old = p1
import p1
new = p1
assert old is new
# read from system
import p1
import sys
old = p1
del sys.modules['p1']
import p1
new = p1
assert not old is new

2. Search module spec

If the module is not in sys.modules dictionary, then it needs to be searched by a list of meta path finder objects that have their find_spec() methods to see if the module can be imported.

import sys
print(sys.meta_path)
[ <class '_frozen_importlib.BuiltinImporter'>,
<class '_frozen_importlib.FrozenImporter'>,
<class '_frozen_importlib_external.PathFinder'>]

The BuiltinImporter is used for built-in modules. The FronzenImporter is used to locate frozen modules. The PathFinder is responsible for finding modules that are located in one of these paths.

  • sys.path
  • sys.path_hooks
  • sys.path_importer_cache
  • __path__

Let’s check out what is in sys.path.

import sys
print(sys.path)
[ '/xiaoxu/sandbox',
'/xiaoxu/.pyenv/versions/3.9.0/lib/python39.zip',
'/xiaoxu/.pyenv/versions/3.9.0/lib/python3.9',
'/xiaoxu/.pyenv/versions/3.9.0/lib/python3.9/lib-dynload',
'/xiaoxu/.local/lib/python3.9/site-packages',
'/xiaoxu/.pyenv/versions/3.9.0/lib/python3.9/site-packages']

PathFinder will use find_spec method to look for __spec__ of the module. Each module has a specification object that is the metadata of the module. One of the attributes is the loader . The loader indicates to the import machinery which loader to use while creating the module.

import p1
print(p1.__spec__)
ModuleSpec(name='p1', loader=<_frozen_importlib_external.SourceFileLoader object at 0x1018b6ac0>, origin='/xiaoxu/sandbox/p1/__init__.py', submodule_search_locations=['/xiaoxu/sandbox/p1'])

3. Load the module

Once the module spec is found, the import machinery will use the loader attribute to initialize the module and store it in sys.modules dictionary. You can read this pseudo code to understand what happens during the loading portion of import.

Python Circular Imports

In the end, let’s look at an interesting problem of import: Circular Imports. A circular import occurs when two or more modules depend on each other. In this example, m2.py depends on m1.py and m1.py depends on m2.py .

module dependency (Created by Xiaoxu Gao)
# m1.py
import m2
m2.do_m2()
def do_m1():
print("m1")
# m2.py
import m1
m1.do_m1()
def do_m2():
print("m2")
# main.py
import m1
m1.do_m1()
AttributeError: partially initialized module 'm1' has no attribute 'do_m1' (most likely due to a circular import)

Python couldn’t find attribute do_m1 from module m1. So why does this happen? The graph illustrates the process. When import m1, Python goes through m1.py line by line. The first thing it finds is import m2 , so it goes to import m2.py . The first line is to import m1, but since Python didn’t go through everything in m1.py yet, we get a half-initialized object. When we call m1.do_m1() which python didn’t see it, it will raise an AttributeError exception.

Circular Imports (Created by Xiaoxu Gao)

So how to fix circular import? In general, circular imports are the result of bad design. Most of the time, the dependency isn’t actually required. A simple solution is to merge both functions into a single module.

# m.py
def do_m1():
print("m1")
def do_m2():
print("m2")
# main.py
import m
m.do_m1()
m.do_m2()

Sometimes, the merged module can become very large. Another solution is to defer the import of m2 to import it when it is needed. This can be done by placing the import m2 in the function def do_m1(). In this case, Python will load all the functions in m1.py and then load m2.py only when needed.

# m1.py
def do_m1():
import m2
m2.do_m2()
print("m1")
def do_m1_2():
print("m1_2")
# m2.py
import m1
def do_m2():
m1.do_m1_2()
print("m2")
# main.py
import m1
m1.do_m1()

Many code-bases use deferred importing not necessarily to solve circular dependency but to speed up the startup time. An example from Airflow is to not write top-level code which is not necessary to build DAGs. This is because of the impact the top-level code parsing speed on both performance and scalability of Airflow.

# example from Airflow docfrom airflow import DAG
from airflow.operators.python import PythonOperator

with DAG(
dag_id="example_python_operator",
schedule_interval=None,
start_date=pendulum.datetime(2021, 1, 1, tz="UTC"),
catchup=False,
tags=["example"],
) as dag:

def print_array():
import numpy as np
# <- THIS IS HOW NUMPY SHOULD BE IMPORTED IN THIS CASE

a = np.arange(15).reshape(3, 5)
print(a)
return a

run_this = PythonOperator(
task_id="print_the_context",
python_callable=print_array,
)

Conclusion

As always, I hope you find this article useful and inspiring. We take many things in Python for granted, but it gets interesting when discovering how it works internally. Hope you enjoyed it, Cheers!

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.
Leave a comment