Develop a model using JupyterLab
The Splunk App for Data Science and Deep Learning (DSDL) leverages predefined JupyterLab Notebook workflows so you can build, test, and operationalize customized models with the Splunk platform.
The DSDL model building workflow includes processes that occur outside of the Splunk platform ecosystem, leveraging third-party infrastructure such as Docker, Kubernetes, and JupyterLab. Any third-party infrastructure processes are not supported by Splunk.
After installing and configuring DSDL, you can perform the following high-level steps to build a model using the Splunk App for Data Science and Deep Learning with JupyterLab:
- Preprocess and select your data
- Create a new Notebook in JupyterLab
- Send sample data to a container
- Develop Notebook code in JupyterLab
- Train the model
- Inspect and validate the model
- Test the model
- Iterate or retrain the model in JupyterLab
- Operationalize the model
Preprocess and select your data
DSDL works best when you provide a clean matrix of data as the foundation for building your model. Complete any data preprocessing using SPL to take full advantage of the Assistant's capabilities. To learn more, see Preparing your data for machine learning in the MLTK User Guide.
Perform the following steps:
- In the Splunk App for Data Science and Deep Learning, select Configuration > Containers and make sure that the
__dev__ container
is running, and that you can access JupyterLab. - Use the Search tab to identify the data from which you want to develop a model.
Create a new Notebook in JupyterLab
Complete these steps in JupyterLab:
- From Configuration > Containers, select the JupyterLab button. This opens JupyterLab in a new tab.
- Open the barebone_template.ipynb file from the Notebooks folder. This file is pre-populated with helpful Notebook steps you can leverage or edit to match your use case.
- From the JupyterLab main menu click File > Save Notebook As. Save this copy of the Notebook with a well-defined naming convention.
Names must not include spaces but can include underscores.
- As soon as you save your Jupyter Notebook, all relevant cells that contain the predefined functions are automatically exported into a Python module. This file is located in the /app/model folder and is then called dynamically from your
fit
,apply
, andsummary
command.
Send sample data to a container
- Select a subset of the data you want to use for your model. Optionally you can use the
sample
command as shown in the following example:
.... | sample 1000
- Send the sample data to the container. Include the features that you are looking to model in the search command as shown in the following example:
| fit MLTKContainer mode=stage algo=my_notebook features_* into app:MyFirstModel
The mode=stage
setting does not actually fit the data. This command indicates that you want to transfer a dataset over to the container to be worked on in the JupyterLab environment. In most cases you can start with a simple, small dataset of a given structure of features. This helps you to speed up the typical data science iteration cycle.
Develop Notebook code in JupyterLab
- Write your code in the JupyterLab Notebook you created. Test that the Notebook operates on the data sent over to the container in the previous task. Take advantage of JupyterLab Notebooks to execute parts of your code and rapidly develop your modeling ideas.
- You may choose to resend a different subset of your data during this code development phase if the original sample does not contain enough records or features for testing purposes
- After you are satisfied that your code is operating as expected, make sure to save the Notebook.
- Check that the corresponding .py module file in the
/app/model/
directory correctly reflects the code of your Notebook and does not contain any Python indentation or spelling errors which can break your code.
Changes in model code are available at the next call of your model. This call can be triggered by a scheduled search or when users work with your model.
Use robust naming conventions that clearly separate models in development from those in production. Additionally you can use GIT or MLflow for enhanced version control or model lifecycle management.
Train the model
- Split your data into a training set and a testing set. Optionally you can use the
sample
command for this split or Jupyter if preferred. You can use partitions and seed values for consistent sampling as shown in the following example:
... | sample partitions=10 seed=42 | where partition_number<7
- Run the
fit
command for your algorithm on the training dataset as shown in the following example:
... | fit MLTKContainer algo=my_notebook epochs=100 features_* into app:MyFirstModel
. - Ensure that you are passing in the correct features and parameters. You can pass any
key=value
based parameters which are exposed in the parameters of thefit
andapply
functions in your Notebook. - If training runs successfully, the training results are returned to you.
If you receive errors, check the job inspector and the search.log
. You might need to return to the Jupyter Notebook to update the code.
Inspect and validate the model
You can use TensorBoard to check and validate how your neural network model evolved over its training epochs. Review histograms and other insights provided in TensorBoard to further improve or tune your model.
Select the TensorBoard button on the Containers page.
Test the model
- Select the test dataset by partitioning the data again as shown in the following example:
... | sample partitions=10 seed=42 | where partition_number>=7
- Apply the model on the test dataset. as shown in the following example:
... | apply MyFirstModel
- Use the
score
command to evaluate the accuracy of the model using relevant metrics. For example, a classification report or confusion matrix for a classification algorithm, or R squared and RMSE for a regression or forecast algorithm.
If you receive errors, check the job inspector and the search.log
. You might need to return to the Jupyter Notebook to update the code.
Iterate or retrain the model in JupyterLab
After the model is tested and achieves good results, revisit your model to adapt to new data or updated business requirements. Revisiting the model is an opportunity to improve your model code or retrain it with new data.
Operationalize the model
After the model is performing to an acceptable standard, create an alert, report, or dashboard to monitor new data as it comes into the Splunk platform. Use the apply
command on the data features identified with the trained model.
You can use the Splunk platform to monitor your model performance and results, including keeping track of metrics and alerting on model degradation.
DSDL models can be shared by changing permission of the model to global if the model needs to be served from a dedicated container.
Splunk App for Data Science and Deep Learning workflow | Using multi-GPU computing for heavily parallelled processing |
This documentation applies to the following versions of Splunk® App for Data Science and Deep Learning: 5.0.0, 5.1.0, 5.1.1, 5.1.2
Feedback submitted, thanks!