API Reference¶
AzFileClient¶
-
class
azfs.
AzFileClient
(credential: Union[str, azure.identity._credentials.default.DefaultAzureCredential, None] = None, connection_string: Optional[str] = None)¶ AzFileClient is
- list files in blob (also with wildcard
*
), - check if file exists,
- read csv as pd.DataFrame, and json as dict from blob,
- write pd.DataFrame as csv, and dict as json to blob,
Examples
>>> import azfs >>> from azure.identity import DefaultAzureCredential credential is not required if your environment is on AAD >>> azc = azfs.AzFileClient() credential is required if your environment is not on AAD >>> credential = "[your storage account credential]" >>> azc = azfs.AzFileClient(credential=credential) # or >>> credential = DefaultAzureCredential() >>> azc = azfs.AzFileClient(credential=credential) connection_string will be also acceptted >>> connection_string = "[your connection_string]" >>> azc = azfs.AzFileClient(connection_string=connection_string)
- list files in blob (also with wildcard
get/download¶
-
azfs.AzFileClient.
get
(self, path: str, offset: int = None, length: int = None, **kwargs) → Union[bytes, str, _io.BytesIO, dict]¶ get data from Azure Blob Storage.
Parameters: - path – Azure Blob path URL format, ex:
https://testazfs.blob.core.windows.net/test_container/test1.csv
- offset –
- length –
- **kwargs –
Returns: some data
Examples
>>> import azfs >>> azc = azfs.AzFileClient() >>> csv_path = "https://testazfs.blob.core.windows.net/test_container/test1.csv" you can read csv file in azure blob storage >>> data = azc.get(path=csv_path) `download()` is same method as `get()` >>> data = azc.download(path=csv_path)
- path – Azure Blob path URL format, ex:
-
azfs.AzFileClient.
read_line_iter
(self, path: str) → iter¶ To read text file in each line with iterator.
Parameters: path – Azure Blob path URL format, ex: https://testazfs.blob.core.windows.net/test_container/test1.csv
Returns: get data of the path as iterator Examples
>>> import azfs >>> azc = azfs.AzFileClient() >>> csv_path = "https://testazfs.blob.core.windows.net/test_container/test1.csv" >>> for l in azc.read_line_iter(path=csv_path) ... print(l.decode("utf-8"))
-
azfs.AzFileClient.
read_csv
(self, path: str, **kwargs) → pandas.core.frame.DataFrame¶ get csv data as pd.DataFrame from Azure Blob Storage. support
csv
and alsocsv.gz
.Parameters: - path – Azure Blob path URL format, ex:
https://testazfs.blob.core.windows.net/test_container/test1.csv
- **kwargs – keywords to put df.read_csv(), such as
header
,encoding
.
Returns: pd.DataFrame
Examples
>>> import azfs >>> azc = azfs.AzFileClient() >>> csv_path = "https://testazfs.blob.core.windows.net/test_container/test1.csv" you can read and write csv file in azure blob storage >>> df = azc.read_csv(path=csv_path) Using `with` statement, you can use `pandas`-like methods >>> with azc: >>> df = pd.read_csv_az(path)
- path – Azure Blob path URL format, ex:
-
azfs.AzFileClient.
read_table
(self, path: str, **kwargs) → pandas.core.frame.DataFrame¶ get tsv data as pd.DataFrame from Azure Blob Storage. support
tsv
.Parameters: - path – Azure Blob path URL format, ex:
https://testazfs.blob.core.windows.net/test_container/test1.tsv
- **kwargs – keywords to put df.read_csv(), such as
header
,encoding
.
Returns: pd.DataFrame
Examples
>>> import azfs >>> azc = azfs.AzFileClient() >>> tsv_path = "https://testazfs.blob.core.windows.net/test_container/test1.tsv" you can read and write csv file in azure blob storage >>> df = azc.read_table(path=tsv_path) Using `with` statement, you can use `pandas`-like methods >>> with azc: >>> df = pd.read_table_az(tsv_path)
- path – Azure Blob path URL format, ex:
-
azfs.AzFileClient.
read_pickle
(self, path: str, compression='gzip') → pandas.core.frame.DataFrame¶ get pickled-pandas data as pd.DataFrame from Azure Blob Storage.
Parameters: - path – Azure Blob path URL format, ex:
https://testazfs.blob.core.windows.net/test_container/test1.pkl
- compression – acceptable keywords are: gzip, bz2, xz. gzip is default value.
Returns: pd.DataFrame
Examples
>>> import azfs >>> azc = azfs.AzFileClient() >>> pkl_path = "https://testazfs.blob.core.windows.net/test_container/test1.pkl" you can read and write csv file in azure blob storage >>> df = azc.read_pickle(path=pkl_path) Using `with` statement, you can use `pandas`-like methods >>> with azc: >>> df = pd.read_pickle_az(pkl_path) you can use difference compression >>> with azc: >>> df = pd.read_pickle_az(pkl_path, compression="bz2")
- path – Azure Blob path URL format, ex:
-
azfs.AzFileClient.
read_json
(self, path: str, **kwargs) → dict¶ read json file in Datalake storage.
Parameters: - path – Azure Blob path URL format, ex:
https://testazfs.blob.core.windows.net/test_container/test1.json
- **kwargs – keywords to put json.loads(), such as
parse_float
.
Returns: dict
Examples
>>> import azfs >>> azc = azfs.AzFileClient() >>> json_path = "https://testazfs.blob.core.windows.net/test_container/test1.json" you can read and write csv file in azure blob storage >>> azc.read_json(path=json_path)
- path – Azure Blob path URL format, ex:
pyspark-like method¶
You can read multiple files, using multiprocessing or filters,
-
azfs.AzFileClient.
read
(self, *, path: Union[str, List[str]] = None, use_mp: bool = False, cpu_count: Optional[int] = None, file_format: str = 'csv') → azfs.az_file_client.DataFrameReader¶ read csv, parquet, picke files in Azure Blob, like PySpark-method.
Parameters: - path – Azure Blob path URL format, ex:
https://testazfs.blob.core.windows.net/test_container/test1.csv
- use_mp – Default, False
- cpu_count – Default, as same as mp.cpu_count()
- file_format – determined by which function you call
Returns: pd.DataFrame
Examples
>>> import azfs >>> azc = azfs.AzFileClient() >>> blob_path = "https://testazfs.blob.core.windows.net/test_container/test1.csv" >>> df = azc.read().csv(blob_path) # result is as same as azc.read_csv(blob_path) >>> blob_path_list = [ ... "https://testazfs.blob.core.windows.net/test_container/test1.csv", ... "https://testazfs.blob.core.windows.net/test_container/test2.csv" ... ] >>> df = azc.read().csv(blob_path_list) # result is as same as pd.concat([each data-frame]) # in addition, you can use `*` >>> blob_path_pattern = "https://testazfs.blob.core.windows.net/test_container/test*.csv" >>> df = azc.read().csv(blob_path_pattern) # you can use multiprocessing with `use_mp` argument >>> df = azc.read(use_mp=True).csv(blob_path_pattern) # if you want to filter or apply some method, you can use your defined function as below >>> def filter_function(_df: pd.DataFrame, _id: str) -> pd.DataFrame: ... return _df[_df['id'] == _id] >>> df = azc.read(use_mp=True).apply(function=filter_function, _id="aaa").csv(blob_path_pattern)
- path – Azure Blob path URL format, ex:
put/upload¶
-
azfs.AzFileClient.
put
(self, path: str, data) → bool¶ upload data to blob or data_lake storage.
Parameters: - path – Azure Blob path URL format, ex:
https://testazfs.blob.core.windows.net/test_container/test1.csv
- data – some data to upload.
Returns: True if correctly uploaded
Examples
>>> import azfs >>> azc = azfs.AzFileClient() >>> csv_path = "https://testazfs.blob.core.windows.net/test_container/test1.csv" you can write file in azure blob storage >>> _data = azc.put(path=csv_path) `download()` is same method as `get()` >>> _data = azc.upload(path=csv_path)
- path – Azure Blob path URL format, ex:
-
azfs.AzFileClient.
write_csv
(self, path: str, df: pandas.core.frame.DataFrame, **kwargs) → bool¶ output pandas dataframe to csv file in Datalake storage.
Parameters: - path – Azure Blob path URL format, ex:
https://testazfs.blob.core.windows.net/test_container/test1.csv
. - df – pd.DataFrame to upload.
- **kwargs – keywords to put df.to_csv(), such as
encoding
,index
.
Returns: True if correctly uploaded
Examples
>>> import azfs >>> azc = azfs.AzFileClient() >>> csv_path = "https://testazfs.blob.core.windows.net/test_container/test1.csv" you can read and write csv file in azure blob storage >>> azc.write_csv(path=csv_path, df=df) Using `with` statement, you can use `pandas`-like methods >>> with azc: >>> df.to_csv_az(csv_path)
- path – Azure Blob path URL format, ex:
-
azfs.AzFileClient.
write_table
(self, path: str, df: pandas.core.frame.DataFrame, **kwargs) → bool¶ output pandas dataframe to tsv file in Datalake storage.
Parameters: - path – Azure Blob path URL format, ex:
https://testazfs.blob.core.windows.net/test_container/test1.tsv
. - df – pd.DataFrame to upload.
- **kwargs – keywords to put df.to_csv(), such as
encoding
,index
.
Returns: True if correctly uploaded
Examples
>>> import azfs >>> azc = azfs.AzFileClient() >>> tsv_path = "https://testazfs.blob.core.windows.net/test_container/test1.tsv" you can read and write csv file in azure blob storage >>> azc.write_table(path=tsv_path, df=df) Using `with` statement, you can use `pandas`-like methods >>> with azc: >>> df.to_table_az(tsv_path)
- path – Azure Blob path URL format, ex:
-
azfs.AzFileClient.
write_pickle
(self, path: str, df: pandas.core.frame.DataFrame, compression='gzip') → bool¶ output pandas dataframe to tsv file in Datalake storage.
Parameters: - path – Azure Blob path URL format, ex:
https://testazfs.blob.core.windows.net/test_container/test1.pkl
- df – pd.DataFrame to upload.
- compression – acceptable keywords are: gzip, bz2, xz. gzip is default value.
Returns: pd.DataFrame
Examples
>>> import azfs >>> azc = azfs.AzFileClient() >>> pkl_path = "https://testazfs.blob.core.windows.net/test_container/test1.pkl" you can read and write csv file in azure blob storage >>> azc.write_pickle(path=pkl_path, df=df) Using `with` statement, you can use `pandas`-like methods >>> with azc: >>> df.to_pickle_az(pkl_path) you can use difference compression >>> with azc: >>> df.to_pickle_az(pkl_path, compression="bz2")
- path – Azure Blob path URL format, ex:
-
azfs.AzFileClient.
write_json
(self, path: str, data: dict, **kwargs) → bool¶ output dict to json file in Datalake storage.
Parameters: - path – Azure Blob path URL format, ex:
https://testazfs.blob.core.windows.net/test_container/test1.json
- data – dict to upload
- **kwargs – keywords to put json.loads(), such as
indent
.
Returns: True if correctly uploaded
Examples
>>> import azfs >>> azc = azfs.AzFileClient() >>> json_path = "https://testazfs.blob.core.windows.net/test_container/test1.json" you can read and write csv file in azure blob storage >>> azc.write_json(path=json_path, data={"": ""})
- path – Azure Blob path URL format, ex:
file enumerating¶
-
azfs.AzFileClient.
ls
(self, path: str, attach_prefix: bool = False) → list¶ list blob file from blob or dfs.
Parameters: - path – Azure Blob path URL format, ex: https://testazfs.blob.core.windows.net/test_container
- attach_prefix – return full_path if True, return only name
Returns: list of azure blob files
Examples
>>> import azfs >>> azc = azfs.AzFileClient() >>> csv_path = "https://testazfs.blob.core.windows.net/test_container" >>> azc.ls(csv_path) [ "test1.csv", "test2.csv", "test3.csv", "directory_1", "directory_2" ] >>> azc.ls(path=path, attach_prefix=True) [ "https://testazfs.blob.core.windows.net/test_container/test1.csv", "https://testazfs.blob.core.windows.net/test_container/test2.csv", "https://testazfs.blob.core.windows.net/test_container/test3.csv", "https://testazfs.blob.core.windows.net/test_container/directory_1", "https://testazfs.blob.core.windows.net/test_container/directory_2" ]
-
azfs.AzFileClient.
glob
(self, pattern_path: str) → List[str]¶ Currently only support
* (wildcard)
. By default,glob()
lists specified files with formatted-URL.Parameters: pattern_path – ex: https://<storage_account_name>.blob.core.windows.net/<container>/*/*.csv
Returns: lists specified files filtered by wildcard Examples
>>> import azfs >>> azc = azfs.AzFileClient() >>> path = "https://testazfs.blob.core.windows.net/test_container/some_folder" ls() lists all files in some folder like >>> azc.ls(path) [ "test1.csv", "test2.csv", "test3.csv", "test1.json", "test2.json", "directory_1", "directory_2" ] glob() lists specified files according to the wildcard, and lists with formatted-URL by default >>> csv_pattern_path = "https://testazfs.blob.core.windows.net/test_container/some_folder/*.csv" >>> azc.glob(path=csv_pattern_path) [ "https://testazfs.blob.core.windows.net/test_container/some_folder/test1.csv", "https://testazfs.blob.core.windows.net/test_container/some_folder/test2.csv", "https://testazfs.blob.core.windows.net/test_container/some_folder/test3.csv" ] glob() can use any path >>> csv_pattern_path = "https://testazfs.blob.core.windows.net/test_container/some_folder/test1.*" >>> azc.glob(path=csv_pattern_path) [ "https://testazfs.blob.core.windows.net/test_container/some_folder/test1.csv", "https://testazfs.blob.core.windows.net/test_container/some_folder/test1.json" ] also deeper folders >>> csv_pattern_path = "https://testazfs.blob.core.windows.net/test_container/some_folder/*/*.csv" >>> azc.glob(path=csv_pattern_path) [ "https://testazfs.blob.core.windows.net/test_container/some_folder/directory_1/deeper_test1.csv", "https://testazfs.blob.core.windows.net/test_container/some_folder/directory_2/deeper_test2.csv" ]
Raises: AzfsInputError
– when*
is used in root_flder under a container.
-
azfs.AzFileClient.
exists
(self, path: str) → bool¶ check if specified file exists or not.
Parameters: path – Azure Blob path URL format, ex: https://testazfs.blob.core.windows.net/test_container/test1.csv
Returns: True
if files exists, otherwiseFalse
Examples
>>> import azfs >>> azc = azfs.AzFileClient() >>> csv_path = "https://testazfs.blob.core.windows.net/test_container/test1.csv" >>> azc.exists(path=csv_path) True >>> csv_path = "https://testazfs.blob.core.windows.net/test_container/not_exist_test1.csv" >>> azc.exists(path=csv_path) False
file manipulating¶
-
azfs.AzFileClient.
info
(self, path: str) → dict¶ get file properties, such as
name
,creation_time
,last_modified_time
,size
,content_hash(md5)
.Parameters: path – Azure Blob path URL format, ex: https://testazfs.blob.core.windows.net/test_container/test1.csv
Returns: dict info of some file Examples
>>> import azfs >>> azc = azfs.AzFileClient() >>> csv_path = "https://testazfs.blob.core.windows.net/test_container/test1.csv" >>> azc.info(path=csv_path) { "name": "test1.csv", "size": "128KB", "creation_time": "", "last_modified": "", "etag": "etag...", "content_type": "", "type": "file" }
-
azfs.AzFileClient.
rm
(self, path: str) → bool¶ delete the file in blob
Parameters: path – Azure Blob path URL format, ex: https://testazfs.blob.core.windows.net/test_container/test1.csv
Returns: True if target file is correctly removed.
-
azfs.AzFileClient.
cp
(self, src_path: str, dst_path: str, overwrite=False) → bool¶ copy the data from src_path to dst_path
Parameters: - src_path – Azure Blob path URL format, ex:
https://testazfs.blob.core.windows.net/test_container/test1.csv
- dst_path – Azure Blob path URL format, ex:
https://testazfs.blob.core.windows.net/test_container/test2.csv
- overwrite –
Returns:
- src_path – Azure Blob path URL format, ex:
TableStorage¶
-
class
azfs.
TableStorage
(account_name: str, account_key: str, database_name: str)¶ A class for manipulating TableStorage in Storage Account The class provides simple methods below.
- create
- read
- update
- delete(not yet)
The class is intended to be used as delegation, not extend.
Parameters: - account_name – name of the Storage Account
- account_key – key for the Storage Account
- database_name – name of the StorageTable database
-
class
azfs.
TableStorageWrapper
(account_name, account_key, database_name, partition_key: str, row_key_name: str = 'id_')¶ Wrapper for the TableStorage class.
Parameters: - account_name – name of the Storage Account
- account_key – key for the Storage Account
- database_name – name of the StorageTable database
- partition_key –
- row_key_name –
Examples
>>> import json >>> from datetime import datetime >>> from pytz import timezone >>> tokyo = timezone('Asia/Tokyo') >>> cons = { ... "account_name": "{storage_account_name}", ... "account_key": "{credential}", ... "database_name": "{database_name}" ... } # you can manipulate data through `simple_table_client` >>> simple_table_client = TableStorageWrapper(partition_key="simple_table", **cons) # store data according to the keyword-arguemnt you put # by default, `id_` is converted to `RowKey`, then `id_` is not stored >>> simple_table_client.put(id_="1", message="hello_world") ... {'PartitionKey': 'simple_table', 'message': 'hello_world', 'RowKey': '1'} # can get all data, simply call >>> simple_table_client.get() ... ... # or filter with specific value, like # `id_` is configured as `RowKey` by default >>> simple_table_client.get(id_="1") ... [ ... { ... 'PartitionKey': 'simple_table', ... 'RowKey': '1', ... 'Timestamp': datetime.datetime(2020, 10, 10, 3, 15, 57, 874427, tzinfo=tzutc()), ... 'message': 'hello_world', ... 'etag': 'W/"datetime'2020-10-10T03%3A15%3A57.8744271Z'"' ... } ... ] # In addition, you can store data in different way >>> complex_client = TableStorageWrapper(partition_key="complex_table", **cons) >>> @complex_client.overwrite_pack_data_to_put() ... def modify_put_data(id_: str, message: str): ... alt_message = json.dumps({datetime.now(tz=tokyo).isoformat(): message}, ensure_ascii=False) ... return {"id_": id_, "message": alt_message} # you can store data in a different way >>> complex_client.put(id_="2", message="hello_world") ... { ... 'PartitionKey': 'complex_table', ... 'message': '{"2020-10-10T12:26:57.442718+09:00": "hello_world"}', ... 'RowKey': '2' ... } # you can also modify update function, with restriction example >>> @complex_client.overwrite_pack_data_to_update(allowed={"message": ["ERROR", "RUNNING", "SUCCESS"]}) ... def modify_update_data(id_: str, message: str): ... d = complex_client.get(id_=id_) ... message_dict = json.loads(d[0]['message']) ... if type(message_dict) is dict: ... message_dict[datetime.now(tz=tokyo).isoformat()] = message ... else: ... message_dict = {datetime.now(tz=tokyo).isoformat(): message} ... ... data = { ... "id_": id_, ... "message": json.dumps(message_dict, ensure_ascii=False) ... } ... return data >>> complex_client.update(id_="2", message="RUNNING") ... { ... 'PartitionKey': 'complex_table', ... 'RowKey': '2', ... 'message': '{"2020-10-10T12:26:57.442718+09:00": "hello_world", "2020-10-10T13:00:23.602943+09:00": "RUNNING"}' ... }
BlobPathDecoder¶
-
class
azfs.
BlobPathDecoder
(path: Union[None, str] = None)¶ Decode Azure Blob Storage URL format class
Examples
>>> import azfs >>> path = "https://testazfs.blob.core.windows.net/test_container/test1.csv" >>> blob_path_decoder = azfs.BlobPathDecoder() >>> blob_path_decoder.decode(path=path).get() (testazfs, blob, test_container, test1.csv) >>> blob_path_decoder.decode(path=path).get_with_url() (https://testazfs.blob.core.windows.net", blob, test_container, test1.csv)