API Reference¶

AzFileClient¶

class azfs.AzFileClient(credential: Union[str, azure.identity._credentials.default.DefaultAzureCredential, None] = None, connection_string: Optional[str] = None)¶

AzFileClient is

list files in blob (also with wildcard *),
check if file exists,
read csv as pd.DataFrame, and json as dict from blob,
write pd.DataFrame as csv, and dict as json to blob,

Examples

>>> import azfs
>>> from azure.identity import DefaultAzureCredential
credential is not required if your environment is on AAD
>>> azc = azfs.AzFileClient()
credential is required if your environment is not on AAD
>>> credential = "[your storage account credential]"
>>> azc = azfs.AzFileClient(credential=credential)
# or
>>> credential = DefaultAzureCredential()
>>> azc = azfs.AzFileClient(credential=credential)
connection_string will be also acceptted
>>> connection_string = "[your connection_string]"
>>> azc = azfs.AzFileClient(connection_string=connection_string)

get/download¶

azfs.AzFileClient.get(self, path: str, offset: int = None, length: int = None, **kwargs) → Union[bytes, str, _io.BytesIO, dict]¶

get data from Azure Blob Storage.

Parameters:	path – Azure Blob path URL format, ex: `https://testazfs.blob.core.windows.net/test_container/test1.csv` offset – length – **kwargs –
Returns:	some data

Examples

>>> import azfs
>>> azc = azfs.AzFileClient()
>>> csv_path = "https://testazfs.blob.core.windows.net/test_container/test1.csv"
you can read csv file in azure blob storage
>>> data = azc.get(path=csv_path)
`download()` is same method as `get()`
>>> data = azc.download(path=csv_path)

azfs.AzFileClient.read_line_iter(self, path: str) → iter¶

To read text file in each line with iterator.

Parameters:	path – Azure Blob path URL format, ex: `https://testazfs.blob.core.windows.net/test_container/test1.csv`
Returns:	get data of the path as iterator

Examples

>>> import azfs
>>> azc = azfs.AzFileClient()
>>> csv_path = "https://testazfs.blob.core.windows.net/test_container/test1.csv"
>>> for l in azc.read_line_iter(path=csv_path)
...     print(l.decode("utf-8"))

azfs.AzFileClient.read_csv(self, path: str, **kwargs) → pandas.core.frame.DataFrame¶

get csv data as pd.DataFrame from Azure Blob Storage. support csv and also csv.gz.

Parameters:	path – Azure Blob path URL format, ex: `https://testazfs.blob.core.windows.net/test_container/test1.csv` **kwargs – keywords to put df.read_csv(), such as `header`, `encoding`.
Returns:	pd.DataFrame

Examples

>>> import azfs
>>> azc = azfs.AzFileClient()
>>> csv_path = "https://testazfs.blob.core.windows.net/test_container/test1.csv"
you can read and write csv file in azure blob storage
>>> df = azc.read_csv(path=csv_path)
Using `with` statement, you can use `pandas`-like methods
>>> with azc:
>>>     df = pd.read_csv_az(path)

azfs.AzFileClient.read_table(self, path: str, **kwargs) → pandas.core.frame.DataFrame¶

get tsv data as pd.DataFrame from Azure Blob Storage. support tsv.

Parameters:	path – Azure Blob path URL format, ex: `https://testazfs.blob.core.windows.net/test_container/test1.tsv` **kwargs – keywords to put df.read_csv(), such as `header`, `encoding`.
Returns:	pd.DataFrame

Examples

>>> import azfs
>>> azc = azfs.AzFileClient()
>>> tsv_path = "https://testazfs.blob.core.windows.net/test_container/test1.tsv"
you can read and write csv file in azure blob storage
>>> df = azc.read_table(path=tsv_path)
Using `with` statement, you can use `pandas`-like methods
>>> with azc:
>>>     df = pd.read_table_az(tsv_path)

azfs.AzFileClient.read_pickle(self, path: str, compression='gzip') → pandas.core.frame.DataFrame¶

get pickled-pandas data as pd.DataFrame from Azure Blob Storage.

Parameters:	path – Azure Blob path URL format, ex: `https://testazfs.blob.core.windows.net/test_container/test1.pkl` compression – acceptable keywords are: gzip, bz2, xz. gzip is default value.
Returns:	pd.DataFrame

Examples

>>> import azfs
>>> azc = azfs.AzFileClient()
>>> pkl_path = "https://testazfs.blob.core.windows.net/test_container/test1.pkl"
you can read and write csv file in azure blob storage
>>> df = azc.read_pickle(path=pkl_path)
Using `with` statement, you can use `pandas`-like methods
>>> with azc:
>>>     df = pd.read_pickle_az(pkl_path)
you can use difference compression
>>> with azc:
>>>     df = pd.read_pickle_az(pkl_path, compression="bz2")

azfs.AzFileClient.read_json(self, path: str, **kwargs) → dict¶

read json file in Datalake storage.

Parameters:	path – Azure Blob path URL format, ex: `https://testazfs.blob.core.windows.net/test_container/test1.json` **kwargs – keywords to put json.loads(), such as `parse_float`.
Returns:	dict

Examples

>>> import azfs
>>> azc = azfs.AzFileClient()
>>> json_path = "https://testazfs.blob.core.windows.net/test_container/test1.json"
you can read and write csv file in azure blob storage
>>> azc.read_json(path=json_path)

pyspark-like method¶

You can read multiple files, using multiprocessing or filters,

azfs.AzFileClient.read(self, *, path: Union[str, List[str]] = None, use_mp: bool = False, cpu_count: Optional[int] = None, file_format: str = 'csv') → azfs.az_file_client.DataFrameReader¶

read csv, parquet, picke files in Azure Blob, like PySpark-method.

Parameters:	path – Azure Blob path URL format, ex: `https://testazfs.blob.core.windows.net/test_container/test1.csv` use_mp – Default, False cpu_count – Default, as same as mp.cpu_count() file_format – determined by which function you call
Returns:	pd.DataFrame

Examples

>>> import azfs
>>> azc = azfs.AzFileClient()
>>> blob_path = "https://testazfs.blob.core.windows.net/test_container/test1.csv"
>>> df = azc.read().csv(blob_path)
# result is as same as azc.read_csv(blob_path)
>>> blob_path_list = [
...     "https://testazfs.blob.core.windows.net/test_container/test1.csv",
...     "https://testazfs.blob.core.windows.net/test_container/test2.csv"
... ]
>>> df = azc.read().csv(blob_path_list)
# result is as same as pd.concat([each data-frame])
# in addition, you can use `*`
>>> blob_path_pattern = "https://testazfs.blob.core.windows.net/test_container/test*.csv"
>>> df = azc.read().csv(blob_path_pattern)
# you can use multiprocessing with `use_mp` argument
>>> df = azc.read(use_mp=True).csv(blob_path_pattern)
# if you want to filter or apply some method, you can use your defined function as below
>>> def filter_function(_df: pd.DataFrame, _id: str) -> pd.DataFrame:
...     return _df[_df['id'] == _id]
>>> df = azc.read(use_mp=True).apply(function=filter_function, _id="aaa").csv(blob_path_pattern)

put/upload¶

azfs.AzFileClient.put(self, path: str, data) → bool¶

upload data to blob or data_lake storage.

Parameters:	path – Azure Blob path URL format, ex: `https://testazfs.blob.core.windows.net/test_container/test1.csv` data – some data to upload.
Returns:	True if correctly uploaded

Examples

>>> import azfs
>>> azc = azfs.AzFileClient()
>>> csv_path = "https://testazfs.blob.core.windows.net/test_container/test1.csv"
you can write file in azure blob storage
>>> _data = azc.put(path=csv_path)
`download()` is same method as `get()`
>>> _data = azc.upload(path=csv_path)

azfs.AzFileClient.write_csv(self, path: str, df: pandas.core.frame.DataFrame, **kwargs) → bool¶

output pandas dataframe to csv file in Datalake storage.

Parameters:	path – Azure Blob path URL format, ex: `https://testazfs.blob.core.windows.net/test_container/test1.csv`. df – pd.DataFrame to upload. **kwargs – keywords to put df.to_csv(), such as `encoding`, `index`.
Returns:	True if correctly uploaded

Examples

>>> import azfs
>>> azc = azfs.AzFileClient()
>>> csv_path = "https://testazfs.blob.core.windows.net/test_container/test1.csv"
you can read and write csv file in azure blob storage
>>> azc.write_csv(path=csv_path, df=df)
Using `with` statement, you can use `pandas`-like methods
>>> with azc:
>>>     df.to_csv_az(csv_path)

azfs.AzFileClient.write_table(self, path: str, df: pandas.core.frame.DataFrame, **kwargs) → bool¶

output pandas dataframe to tsv file in Datalake storage.

Parameters:	path – Azure Blob path URL format, ex: `https://testazfs.blob.core.windows.net/test_container/test1.tsv`. df – pd.DataFrame to upload. **kwargs – keywords to put df.to_csv(), such as `encoding`, `index`.
Returns:	True if correctly uploaded

Examples

>>> import azfs
>>> azc = azfs.AzFileClient()
>>> tsv_path = "https://testazfs.blob.core.windows.net/test_container/test1.tsv"
you can read and write csv file in azure blob storage
>>> azc.write_table(path=tsv_path, df=df)
Using `with` statement, you can use `pandas`-like methods
>>> with azc:
>>>     df.to_table_az(tsv_path)

azfs.AzFileClient.write_pickle(self, path: str, df: pandas.core.frame.DataFrame, compression='gzip') → bool¶

output pandas dataframe to tsv file in Datalake storage.

Parameters:	path – Azure Blob path URL format, ex: `https://testazfs.blob.core.windows.net/test_container/test1.pkl` df – pd.DataFrame to upload. compression – acceptable keywords are: gzip, bz2, xz. gzip is default value.
Returns:	pd.DataFrame

Examples

>>> import azfs
>>> azc = azfs.AzFileClient()
>>> pkl_path = "https://testazfs.blob.core.windows.net/test_container/test1.pkl"
you can read and write csv file in azure blob storage
>>> azc.write_pickle(path=pkl_path, df=df)
Using `with` statement, you can use `pandas`-like methods
>>> with azc:
>>>     df.to_pickle_az(pkl_path)
you can use difference compression
>>> with azc:
>>>     df.to_pickle_az(pkl_path, compression="bz2")

azfs.AzFileClient.write_json(self, path: str, data: dict, **kwargs) → bool¶

output dict to json file in Datalake storage.

Parameters:	path – Azure Blob path URL format, ex: `https://testazfs.blob.core.windows.net/test_container/test1.json` data – dict to upload **kwargs – keywords to put json.loads(), such as `indent`.
Returns:	True if correctly uploaded

Examples

>>> import azfs
>>> azc = azfs.AzFileClient()
>>> json_path = "https://testazfs.blob.core.windows.net/test_container/test1.json"
you can read and write csv file in azure blob storage
>>> azc.write_json(path=json_path, data={"": ""})

file enumerating¶

azfs.AzFileClient.ls(self, path: str, attach_prefix: bool = False) → list¶

list blob file from blob or dfs.

Parameters:	path – Azure Blob path URL format, ex: https://testazfs.blob.core.windows.net/test_container attach_prefix – return full_path if True, return only name
Returns:	list of azure blob files

Examples

>>> import azfs
>>> azc = azfs.AzFileClient()
>>> csv_path = "https://testazfs.blob.core.windows.net/test_container"
>>> azc.ls(csv_path)
[
    "test1.csv",
    "test2.csv",
    "test3.csv",
    "directory_1",
    "directory_2"
]
>>> azc.ls(path=path, attach_prefix=True)
[
    "https://testazfs.blob.core.windows.net/test_container/test1.csv",
    "https://testazfs.blob.core.windows.net/test_container/test2.csv",
    "https://testazfs.blob.core.windows.net/test_container/test3.csv",
    "https://testazfs.blob.core.windows.net/test_container/directory_1",
    "https://testazfs.blob.core.windows.net/test_container/directory_2"
]

azfs.AzFileClient.glob(self, pattern_path: str) → List[str]¶

Currently only support * (wildcard) . By default, glob() lists specified files with formatted-URL.

Parameters:	pattern_path – ex: `https://<storage_account_name>.blob.core.windows.net/<container>//.csv`
Returns:	lists specified files filtered by wildcard

Examples

>>> import azfs
>>> azc = azfs.AzFileClient()
>>> path = "https://testazfs.blob.core.windows.net/test_container/some_folder"
ls() lists all files in some folder like
>>> azc.ls(path)
[
    "test1.csv",
    "test2.csv",
    "test3.csv",
    "test1.json",
    "test2.json",
    "directory_1",
    "directory_2"
]
glob() lists specified files according to the wildcard, and lists with formatted-URL by default
>>> csv_pattern_path = "https://testazfs.blob.core.windows.net/test_container/some_folder/*.csv"
>>> azc.glob(path=csv_pattern_path)
[
    "https://testazfs.blob.core.windows.net/test_container/some_folder/test1.csv",
    "https://testazfs.blob.core.windows.net/test_container/some_folder/test2.csv",
    "https://testazfs.blob.core.windows.net/test_container/some_folder/test3.csv"
]
glob() can use any path
>>> csv_pattern_path = "https://testazfs.blob.core.windows.net/test_container/some_folder/test1.*"
>>> azc.glob(path=csv_pattern_path)
[
    "https://testazfs.blob.core.windows.net/test_container/some_folder/test1.csv",
    "https://testazfs.blob.core.windows.net/test_container/some_folder/test1.json"
]
also deeper folders
>>> csv_pattern_path = "https://testazfs.blob.core.windows.net/test_container/some_folder/*/*.csv"
>>> azc.glob(path=csv_pattern_path)
[
    "https://testazfs.blob.core.windows.net/test_container/some_folder/directory_1/deeper_test1.csv",
    "https://testazfs.blob.core.windows.net/test_container/some_folder/directory_2/deeper_test2.csv"
]

Raises:	`AzfsInputError` – when `*` is used in root_flder under a container.

azfs.AzFileClient.exists(self, path: str) → bool¶

check if specified file exists or not.

Parameters:	path – Azure Blob path URL format, ex: `https://testazfs.blob.core.windows.net/test_container/test1.csv`
Returns:	`True` if files exists, otherwise `False`

Examples

>>> import azfs
>>> azc = azfs.AzFileClient()
>>> csv_path = "https://testazfs.blob.core.windows.net/test_container/test1.csv"
>>> azc.exists(path=csv_path)
True
>>> csv_path = "https://testazfs.blob.core.windows.net/test_container/not_exist_test1.csv"
>>> azc.exists(path=csv_path)
False

file manipulating¶

azfs.AzFileClient.info(self, path: str) → dict¶

get file properties, such as name, creation_time, last_modified_time, size, content_hash(md5).

Parameters:	path – Azure Blob path URL format, ex: `https://testazfs.blob.core.windows.net/test_container/test1.csv`
Returns:	dict info of some file

Examples

>>> import azfs
>>> azc = azfs.AzFileClient()
>>> csv_path = "https://testazfs.blob.core.windows.net/test_container/test1.csv"
>>> azc.info(path=csv_path)
{
    "name": "test1.csv",
    "size": "128KB",
    "creation_time": "",
    "last_modified": "",
    "etag": "etag...",
    "content_type": "",
    "type": "file"
}

azfs.AzFileClient.rm(self, path: str) → bool¶

delete the file in blob

Parameters:	path – Azure Blob path URL format, ex: `https://testazfs.blob.core.windows.net/test_container/test1.csv`
Returns:	True if target file is correctly removed.

azfs.AzFileClient.cp(self, src_path: str, dst_path: str, overwrite=False) → bool¶

copy the data from src_path to dst_path

Parameters:	src_path – Azure Blob path URL format, ex: `https://testazfs.blob.core.windows.net/test_container/test1.csv` dst_path – Azure Blob path URL format, ex: `https://testazfs.blob.core.windows.net/test_container/test2.csv` overwrite –

Returns:

TableStorage¶

class azfs.TableStorage(account_name: str, account_key: str, database_name: str)¶

A class for manipulating TableStorage in Storage Account The class provides simple methods below.

create
read
update
delete(not yet)

The class is intended to be used as delegation, not extend.

Parameters:	account_name – name of the Storage Account account_key – key for the Storage Account database_name – name of the StorageTable database

class azfs.TableStorageWrapper(account_name, account_key, database_name, partition_key: str, row_key_name: str = 'id_')¶

Wrapper for the TableStorage class.

Parameters:	account_name – name of the Storage Account account_key – key for the Storage Account database_name – name of the StorageTable database partition_key – row_key_name –

Examples

>>> import json
>>> from datetime import datetime
>>> from pytz import timezone
>>> tokyo = timezone('Asia/Tokyo')
>>> cons = {
...     "account_name": "{storage_account_name}",
...     "account_key": "{credential}",
...     "database_name": "{database_name}"
... }
# you can manipulate data through `simple_table_client`
>>> simple_table_client = TableStorageWrapper(partition_key="simple_table", **cons)
# store data according to the keyword-arguemnt you put
# by default, `id_` is converted to `RowKey`, then `id_` is not stored
>>> simple_table_client.put(id_="1", message="hello_world")
... {'PartitionKey': 'simple_table', 'message': 'hello_world', 'RowKey': '1'}
# can get all data, simply call
>>> simple_table_client.get()
... ...
# or filter with specific value, like
# `id_` is configured as `RowKey` by default
>>> simple_table_client.get(id_="1")
... [
...     {
...         'PartitionKey': 'simple_table',
...         'RowKey': '1',
...         'Timestamp': datetime.datetime(2020, 10, 10, 3, 15, 57, 874427, tzinfo=tzutc()),
...         'message': 'hello_world',
...         'etag': 'W/"datetime'2020-10-10T03%3A15%3A57.8744271Z'"'
...     }
... ]
# In addition, you can store data in different way
>>> complex_client = TableStorageWrapper(partition_key="complex_table", **cons)
>>> @complex_client.overwrite_pack_data_to_put()
... def modify_put_data(id_: str, message: str):
...     alt_message = json.dumps({datetime.now(tz=tokyo).isoformat(): message}, ensure_ascii=False)
...     return {"id_": id_, "message": alt_message}
# you can store data in a different way
>>> complex_client.put(id_="2", message="hello_world")
... {
...     'PartitionKey': 'complex_table',
...     'message': '{"2020-10-10T12:26:57.442718+09:00": "hello_world"}',
...     'RowKey': '2'
... }
# you can also modify update function, with restriction example
>>> @complex_client.overwrite_pack_data_to_update(allowed={"message": ["ERROR", "RUNNING", "SUCCESS"]})
... def modify_update_data(id_: str, message: str):
...     d = complex_client.get(id_=id_)
...     message_dict = json.loads(d[0]['message'])
...     if type(message_dict) is dict:
...         message_dict[datetime.now(tz=tokyo).isoformat()] = message
...     else:
...         message_dict = {datetime.now(tz=tokyo).isoformat(): message}
...
...     data = {
...         "id_": id_,
...         "message": json.dumps(message_dict, ensure_ascii=False)
...     }
...     return data
>>> complex_client.update(id_="2", message="RUNNING")
... {
...     'PartitionKey': 'complex_table',
...     'RowKey': '2',
...     'message': '{"2020-10-10T12:26:57.442718+09:00": "hello_world", "2020-10-10T13:00:23.602943+09:00": "RUNNING"}'
... }

BlobPathDecoder¶

class azfs.BlobPathDecoder(path: Union[None, str] = None)¶

Decode Azure Blob Storage URL format class

Examples

>>> import azfs
>>> path = "https://testazfs.blob.core.windows.net/test_container/test1.csv"
>>> blob_path_decoder = azfs.BlobPathDecoder()
>>> blob_path_decoder.decode(path=path).get()
(testazfs, blob, test_container, test1.csv)
>>> blob_path_decoder.decode(path=path).get_with_url()
(https://testazfs.blob.core.windows.net", blob, test_container, test1.csv)