Using the Azure Machine Learning Python SDK, learn how to construct Azure Machine Learning datasets to access data for your local or remote experiments. See the Securely access data article to see how datasets fit into Azure Machine Learning's entire data access procedure.
AzureML provides two basic assets for working with data:
Each Azure ML workspace comes with a default datastore:
from azureml.core import Workspace
ws = Workspace.from_config()
datastore = ws.get_default_datastore()
It is also accessible through the Azure Portal (under the same resource group as your Azure ML Workspace).
Datastores are linked to workspaces and are used to store connection information to Azure storage services so that you may refer to them by name rather than having to memorise the connection information and secret.Use this class to perform management operations, including register, list, get, and remove datastores.
A data store is a repository for storing and managing collections of data that includes not just databases, but also simpler store types such basic files, emails, and so on. A database is a collection of bytes that a database management system keeps track of.
A data set is a set of information. A data set corresponds to one or more database tables in the case of tabular data, where each column of a table represents a specific variable and each row represents a specific record of the data set in question. Wikipedia \sCite
Each worspace comes with a default datastore
datastore = ws.get_default_datastore()
Connect to, or create, a datastore backed by one of the multiple data-storage options that Azure provides. For example:
To register a datastore via an account key
datastores = Datastore.register_azure_blob_container(
workspace=ws,
datastore_name='<datastore-name>',
container_name='<container-name>',
account_name='<account-name>',
account_key='<account-key>',
)
by using SAS token
datastores = Datastore.register_azure_blob_container(
workspace=ws,
datastore_name='<datastore-name>',
container_name='<container-name>',
account_name='<account-name>',
sas_token='<sas-token>',
)
The workspace object ws
has access to its datastores via
ws.datastores: Dict[str, Datastore]
Any datastore that is registered to workspace can thus be accessed by name.
datastore = ws.datastores['<name-of-registered-datastore>']
from azureml.core import Workspace
ws = Workspace.from_config()
datastore = ws.datastores['<name-of-datastore>']
For a datastore that was created using an account key we can use:
account_name, account_key = datastore.account_name, datastore.account_key
For a datastore that was created using a SAS token we can use:
sas_token = datastore.sas_token
The account name and account key may then be used to connect to the Datastore directly in Azure Storage Explorer.
datastore.upload(
src_dir='./data',
target_path='<path/on/datastore>',
overwrite=True,
)
You may use use if you're dealing with numerous files in various places.
datastore.upload_files(
files, # List[str] of absolute paths of files to upload
target_path='<path/on/datastore>',
overwrite=False,
)
To Download the data from the Blob DataStorage
datastore.download(
target_path, # str: local directory to download to
prefix='<path/on/datastore>',
overwrite=False,
)
from azureml.core import Workspace
ws: Workspace = Workspace.from_config()
compute_target: ComputeTarget = ws.compute_targets['<compute-target-name>']
ds: Datastore = ws.get_default_datastore()
Creating datareference as a mount
data_ref = ds.path('<path/on/datastore>').as_mount()
The Above step is done after connecting to the basic resources like the Workspace, ComputeTarget and Datastore.
or else you can acess the datareference as a Download
data_ref = ds.path('<path/on/datastore>').as_download()
Add this DataReference to a ScriptRunConfig as follows.
config = ScriptRunConfig(
source_directory='.',
script='script.py',
arguments=[str(data_ref)], #returns environment variable $AZUREML_DATAREFERENCE_example_data
compute_target=compute_target, #the Compute resources taken in the instance
)
config.run_config.data_references[data_ref.data_reference_name] = data_ref.to_config()
The environment variable $AZUREML_DATAREFERENCE_example_data
example data is returned by the command-line parameter str(data_ref)
.
Finally, data_ref.to_config()
tells the run to mount the data on the compute target and set the environment variable to the correct value.
Specify a path_on_compute
to reference your data without the need for CLI arguments.
data_ref = ds.path('<path/on/datastore>').as_mount()
data_ref.path_on_compute = '/tmp/data'
config = ScriptRunConfig(
source_directory='.',
script='script.py',
compute_target=compute_target,
)
config.run_config.data_references[data_ref.data_reference_name] = data_ref.to_config()
A dataset might be created and registered from a folder on your own workstation. Note that src_dir
must point to a folder, not file.
Warning
Method upload_directory
: This is a test technique that might change at any time. For additional information, see to https://aka.ms/azuremlexperimental.
from azureml.core import Dataset
# upload the data to datastore and create a FileDataset from it
folder_data = Dataset.File.upload_directory(src_dir="path/to/folder", target=(datastore, "self-defined/path/on/datastore"))
dataset = folder_data.register(workspace=ws, name="<dataset_name>")
datastore
¶The following code snippet demonstrates how to make a Dataset
given a relative path on datastore
.It's worth noting that the path might link to a folder or a file. (e.g. local/test/
) or a single file (e.g. local/test/data.tsv
).
from azureml.core import Dataset
# create input dataset
data = Dataset.File.from_files(path=(datastore, "path/on/datastore"))
dataset = data.register(workspace=ws, name="<dataset_name>")
OutputFileDatasetConfig
¶from azureml.core import ScriptRunConfig
from azureml.data import OutputFileDatasetConfig
output_data = OutputFileDatasetConfig(
destination=(datastore, "path/on/datastore"),
name="<output_name>",
)
config = ScriptRunConfig(
source_directory=".",
script="run.py",
arguments=["--output_dir", output_data.as_mount()],
)
# register your OutputFileDatasetConfig as a dataset
output_data_dataset = output_data.register_on_complete(name="<dataset_name>", description = "<dataset_description>")
To upload a local directory ./data/
:
datastore = ws.get_default_datastore()
datastore.upload(src_dir='./data', target_path='<path/on/datastore>', overwrite=True)
This will copy the entire directory to your computer. ./data
from your local datastore to the workspace ws
default datastore.
To make a dataset from a directory <path/on/datastore>
on a datastore, follow these steps:
datastore = ws.get_default_datastore()
dataset = Dataset.File.from_files(path=(datastore, '<path/on/datastore>'))
You may either mount or download a dataset to use in a ScriptRunConfig to reference data from it:
dataset.as_mount(path_on_compute)
: mount dataset to a remote rundataset.as_download(path_on_compute)
: download the dataset to a remote runPath on compute Both as_mount
and as_download
accept an (optional) parameter path_on_compute
.
This specifies where the data is made available on the compute target.
None
, the data will be downloaded into a temporary directory.path_on_compute
starts with a /
it will be treated as an absolute path. (If you have
If you specify an absolute path, ensure sure the task has write permission to that directory.)This data can be used in a remote run, such as in mount-mode:
use the following code in the run.py
arguments=[dataset.as_mount()]
config = ScriptRunConfig(source_directory='.', script='train.py', arguments=arguments)
experiment.submit(config)
The following code can be used in the train.py
import sys
data_dir = sys.argv[1]
print("===== DATA =====")
print("DATA PATH: " + data_dir)
print("LIST FILES IN DATA DIR...")
print(os.listdir(data_dir))
print("================")