Wednesday 28 April 2021

iOS 11 Machine Learning with Core ML Kit + AVPhotoCapture Explored


Originally published in 2017 in IBM blog website. Entire project initiative failed and all the articles were removed including this.  Republishing the archive here just to keep my content alive. In 2021, this content is not so useful. 


 
It was indeed a WoW moment for many developers when Apple announced the new iOS 11 and Xcode 9 additions in the  WWDC 2017 event. I was one among them to get excited about it and truly wanted to explore few. Core ML for Machine Learning capability is one such feature that I tried out. Today, the enterprise's focus is not just on any specific technology like cloud or mobile or ML , but more on how to gel these entities together and bring out the best scalable solution for real world and enterprise complex problems. I feel that it is very important for we developers to get some hands on with the cool bubbles such as Machine Learning, Virtual Reality in the trend we are moving towards. 

What is CoreML Kit all about? 
Core ML is a framework that enables the device to run & process the machine learning models.The heart of the machine learning execution process is the trained model i.e mlmodel. Today, Apple's developer documentation site https://developer.apple.com/machine-learning/ includes links to some of the the Core ML compatible format trained machine learning model files, for us to download and try out. These models are predominantly Image classifiers & object recognizers, that means to find an object in the given image, identify the image type, etc.  It makes  a good sense for any developer to start with an Image classifier or Image analysis model as it is  simple and easy to understand the flow, perhaps doesn't need complex sample data as input. All it requires is a simple image as input to process, predicts and output the the descriptive result. 

I created a simple single view iOS App , where the Core ML model implementation takes the camera captured image as input and outputs the prediction / classification description as the results.To get a comparative results of various models and to understand the significance of ml models here , I added two different sample ml files - ResNet50  & Places205-GoogLeNet. As I progressed with implementation, I understood that these models do not take the processed compressed image or UIImage directly as Input, instead takes CVPixelBuffer. 

What is CVPixelBuffer & why ML Models use it?
Per Apple definition, Core Video pixel buffer is an image buffer that holds pixels in main memory or so called physical memory RAM. It is important to understand the the core logic behind the Core ML Image Classification & Object recognition methodology.
As I mentioned above, the trained model & its implementation used by Core ML framework is the key things behind the ML.  Machine Learning for Object Recognition predominantly uses artificial Neural Networking and Deep Learning algorithms with trained data set as inputs to predict the output. I will skip the details on Deep Learning as it is a vast topic.
To keep it simple and precise to our use case of object prediction, these generic ML algorithms takes input data set like numbers & executes the algorithm along with enormous other number data sets to predict the most probable and matching result. A digital Image is a rectangular grid of Pixels and exist as an array of bytes in RAM or any storage device. Each Pixel typically consist of 8 bits or 1 byte and can be represented as number from 0-255
image














Image: Representation of an Image in terms of Pixel - Array Numbers

When the generic algorithm processes the input image data in terms of numbers against other lot numbers, they arrive at a decision on- Most likely that the given image is 'Object X' or Most likely that the given image is 'not Object X'. In our case, a CVPixelArray is the memory space in the device's RAM where the images are raw pixel arrays represented in terms of numbers (Refer the above Image). Hence, our example Core ML models are configured to take the unprocessed & original pixel array for most accurate results. As I mentioned, this is one set of input to the ml model where in it actually requires other similar data set to execute the deep learning algorithms. So what we use here as an example ml model is a package of trained machine learning model built with popular python based tools like Keras, Caffe, scikit-learn etc. [Ref below links]
One may argue that why do not we just pass the UIImage to these models & let the ML model implementation takes care of reconstructing the Pixel Array from UIImage. Technically it is possible but the downside is on the accuracy of results. The images that are downloaded from the internet, available in our devices or local storages are most likely the processed images ( like Jpeg), in which the raw images are compressed with some loss of information that may be not be visually noticeable by a human eye,  perhaps could make a lot of difference for an algorithm to consume and process.
So, there are two factors here that contributes to the accuracy of the prediction output.
1. The type of trained ML Models.
2. The Input Image Quality in Pixels.

When working with image based ML concepts, it is important to use the AVFoundation Kit as it  supports and enables the app to utilize the core instance of media such as camera images, microphone recordings etc. iOS 11 has introduced considerable changes in API to the 'Camera & Media Capture' such as AVCapturePhotoCaptureDelegate methods, AVCapturePhoto etc. CVPixelBuffer in now available as instance property of AVCapturePhoto Object. 
In the sample PoC App I have created, an utility class "CameraUtility" has been added that implements the new AVCapturePhotoCaptureDelegate Protocol method(beta). Modules are also added to play around with new API's & AVPhotoCaptureSettings.
For eg . 
//To get only Compressed Image like JPEG
func simpleCapture(){
    let jpegFormat:[String : Any] = [AVVideoCodecKey:AVVideoCodecType.jpeg]
    let settings = AVCapturePhotoSettings(format: jpegFormat)
    photoOutput?.capturePhoto(with: settings, delegate: self)
  }

Finally , when I implemented everything and passed the CVPixelBuffer instance to the MLModel input, Xcode did throw a strange run time error informing about height & width of Pixel Buffer mismatch. This is yet another important thing to focus on mlmodel file's Model Evaluation Parameters section that describes the input & output specification as originally defined in the trained python based models but in the below format.  This is a read only property and we cannot change the specifications in the Xcode Settings. Xcode auto generates the swift equivalent class for this model specification to enable the App modules to consume.
image







As per spec, the mlmodel accepts the input image as 224x224 dimension. But the AVCaptureSession Preset uses standard 16:9 ratio resolution that can scale up/down proportionally but never matches the square dimension of the model file requirement. Luckily, I could manage to find some utility method in a developer's blog that takes the standard UIImage as input and returns the output CVPixelBuffer in 224x224 dimension.  As mentioned above, this is a reverse conversion of processed UIImage to CVPixelBuffer which may not give results as accurate as compared to an input from a raw camera pixel buffer. I would continue to do my research to crack it down soon, but for now going with the reverse lossy conversion. This could be a complex implementation but gives a better learning and understanding.

The other simple alternative and most recommended solution is to use the Vision Kit 's VNCoreMLRequest Class Apis for ML analysis that returns the required CVPixelBuffer. Thanks to Apple. 

Below is a quick demo video recording of the PoC App which uses two different models to identify the images and object. This is perhaps the best result to understand why it is important to choose the right mlmodel. ResNet50 is trained to identify the object in the image whereas Places205-GoogLeNet is trained to identify the image as a whole scene( for eg. Airport , beach etc). Hence, the former gives favorable result, whereas the latter is mostly inaccurate as the knowledge of the model is trained to identify the image as a scene and not as a collection of objects.


Source Code - GIT Link:
I have uploaded the entire project for reference in my personal GitHub - https://github.com/sangy05/CoreML-WithAVCapture-iOS11. It is bundled with the coreml file as well. Install iOS 11 beta on your device and install Xcode 9 in Mac. Clone the source, clean build and run on the iPhone device.

Now that I have tried out the Core ML Kit successfully is just a start to explore Machine Learning capability and never on deep insights of ML. Image Recognition is just one feature of Machine Learning Algs. Amazon, the e-commerce giant suggest that ML focuses on important areas such as,
  • Forecasting & Pricing
  • Recommendations and Product Search
  • eCommerce Fraud Detection
  • Predictive Help for Sellers
  • Inventory Management
  • Visual Search/Computer Vision
  • Text Search, NLP, Summarization. 
Way to Go!!! We as developers should focus more on understanding the core concepts and importantly try to gel these entities to provide solution to the enterprise community. I would continue to explore more and share my thoughts. I would appreciate any queries, comments and recommendations on my blog .


References:
2. https://medium.com/@ageitgey/machine-learning-is-fun-part-3-deep-learning-and-convolutional-neural-networks-f40359318721

Blog Disclaimer:
This is not a peer reviewed article.The above content is written based on my understanding with an intention to share my learnings and findings, with my fellow developers and also to motivate the young developers :) Perhaps  the information is not 100% accurate on fact basis. 

Happy Coding!!!!

No comments:

Post a Comment