Creating a Scikit Learn Layer for AWS Lambda

The last week of November 2018 Amazon Web Services announced a new feature for AWS Lambda functions called layers. Layers are a handy way to share common libraries and dependencies between your lambda functions without having to add them to your package. This is useful when you are using libraries that are large and can take considerable time to upload or update.

I have had a tough time trying to package my code plus scikit-learn and associated dependencies to under 50MB which is the AWS Lambda max limit from a compressed package.

Amazon provides one of the most useful layers out of the box: ScyPy and NumPy, but I use and generate many ML models using scikit-learn, and I would like to be able to create a custom layer for it!

Creating the layer

Depending on how you install scikit-learn it may or may not run on lambda. To be safe, it is better to create a package in an Amazon Linux instance or container. That way we can create and build the package for the target OS.

I have created a docker image that can be helpful to create a layer for scikit-learn, just run the following command to create the Docker container:

docker run -it --rm -v $HOME/Code/python:/app onema/amazonlinux4lambda bash

Where ~/Code/python is where you will be installing the scikit-learn package in your host machine. We must put the library in the python directory, as this is the correct location for a python layer.

cd /app
mkdir -p scikitlearn/python
cd scikitlearn/
pip3 install --ignore-installed --target=python scikit-learn
rm -rf python/numpy* python/scipy*
zip -r ../scikitlearn.zip .

Notice that I removed the NumPy and SciPy packages form the installation, this is because we are going to be using the existing layer that AWS provides for these libraries.

At this point upload your scikitlearn.zip package to an s3 bucket you have access to, for example, YOUR_BUCKET_NAME.

Publishing the layer

Now we are going to publish the new layer, exit the container and navigate to where the zip file was generated $HOME/Code/python/ and then run the following commands (assuming you have the AWS CLI installed and AWS credentials in place):

BUCKET_NAME=YOUR BUCKET NAME
aws lambda publish-layer-version  \
    --layer-name Python36-SciKitLearnTest  \
    --description "Latest version of scikit learn for python 3.6"  \
    --license-info "BSD"  \
    --content S3Bucket=$BUCKET_NAME,S3Key=scikitlearn.zip  \
    --compatible-runtimes python3.6

Create a new function

To test the layer, I have decided to use a modified version of the MLReview Topic Modeling with Scikit Learn code.

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.datasets import fetch_20newsgroups
from sklearn.decomposition import NMF, LatentDirichletAllocation

def display_topics(model, feature_names, no_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print(f"Topic :{topic_idx}")
        print(" ".join([feature_names[i] for i in topic.argsort()[:-no_top_words - 1:-1]]))
        
def lambda_handler(event, context):
    dataset = fetch_20newsgroups(data_home='/tmp/', shuffle=True, random_state=1, remove=('headers', 'footers', 'quotes'))
    documents = dataset.data
    
    no_features = 1000
    
    # NMF is able to use tf-idf
    tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, max_features=no_features, stop_words='english')
    tfidf = tfidf_vectorizer.fit_transform(documents)
    tfidf_feature_names = tfidf_vectorizer.get_feature_names()
    
    # LDA can only use raw term counts for LDA because it is a probabilistic graphical model
    tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, max_features=no_features, stop_words='english')
    tf = tf_vectorizer.fit_transform(documents)
    tf_feature_names = tf_vectorizer.get_feature_names()
    
    no_topics = 20
    
    # Run NMF
    nmf = NMF(n_components=no_topics, random_state=1, alpha=.1, l1_ratio=.5, init='nndsvd').fit(tfidf)
    
    # Run LDA
    lda = LatentDirichletAllocation(n_topics=no_topics, max_iter=5, learning_method='online', learning_offset=50.,random_state=0).fit(tf)
    
    no_top_words = 10
    display_topics(nmf, tfidf_feature_names, no_top_words)
    display_topics(lda, tf_feature_names, no_top_words)

NOTE:

Please note, that under normal circumstances, you would not load your data and create your model in a lambda function, rather you would create the model and store it in an S3 bucket, and use the function to make predictions on new data. This code is only for the sake of testing the scikit-learn layer.

Create a new lambda function using the AWS console and follow these steps:

  • Author from scratch
  • Name: SciKitLearnLayerTest
  • Runtime: Python 3.6
  • Role: Create or use an existing one, we only need basic lambda execution permissions
  • Click on create

Now you have a new lambda function, in this screen follow these steps:

  • Paste the code above in the lambda_function code editor
  • From the Basic Settings increase the memory to 512 and the timeout to 5 min
  • Click on the layer button
  • In the Referenced layers section select add layer
  • Select AWSLambda-Python36-SciPy1x and the latest available version and add it
  • Once again click on add layer
  • Select Python36-SciKitLernTest and pick the latest version and add it (this is our layer)
  • From the main lambda screen click on Save

Now we are ready to test!

  • Click on the test button
  • Give the event a name, e.g. test
  • Click on Create and then Test again

For a 512MB function, the code takes about 3 minutes to run

Conclusion

There you have it! Now the code consists of a single file, and we have managed to set scikit-learn as a standalone layer.

I hope this helps you experiment and work with scikit-learn in AWS lambda without the hassle of having to package your code and dependencies keeping it under 50MB.

References

  1. AWS Lambda Layers — Path
  2. AWS Lambda Layers — Manage
  3. Topic Modeling with Scikit Learn
  4. amazonlinux4lambda Docker Image