2022年01月25日

ビデオ会議iOSアプリにバーチャル背景機能を実装する

諏訪重貴／Front-end Web Developer

こんにちは。すわくんです。

自分のことをただの Web サイト作るマンだと思っていたのですが、最近は WebRTC や iOS アプリまでやることになってしまいました。

圧倒的成長 💪

本記事では、 Google ML Kit を使って iOS のカメラ映像にバーチャル背景を適用し Twilio Programmable Video で利用可能にする 実装と、その周辺知識について紹介します。

※ 現在 Google ML Kit は Swift Packages に対応しておらず、 iOS/macOS アプリ内で使用するのはオススメしません。可能なら Core ML や VNGeneratePersonSegmentationRequest を用いたほうが良いと思います。

できるだけ上から順に読んで理解しやすいように書いていますが、適宜順番を入れ替えてお読みください。

開発環境

本記事では、以下の開発環境であることを前提としています。環境の差異により記事内のスニペットが動作しない可能性があります。

Xcode 13.2
ターゲット iOS 14.0
pods
- GoogleMLKit/SegmentationSelfie 2.3.0
- TwilioVideo 4.6.0
動作確認済み環境 iPad Pro (12.9インチ, 第3世代, iPadOS 15.3)

カメラ映像のキャプチャ

まずはカメラ映像を取得することから始めます。
iOS のカメラ映像は AVFoundation フレームワークを使って取得します。Apple のドキュメントに概要があります。

カメラデバイスを表す AVCaptureDevice を探し、AV 入力を抽象化した AVCaptureDeviceInput を作成します。

import AVFoundation

class MyVideoProcessor {
    init?() {
        guard let input = createCaptureDeviceInput(.front) else {
            return nil
        }
    }
}

private func createCaptureDeviceInput(position: AVCaptureDevice.Position) -> AVCaptureDeviceInput? {
    let deviceDiscoverySession = AVCaptureDevice.DiscoverySession(
        deviceTypes: [.builtInWideAngleCamera],
        mediaType: .video,
        position: position
    )

    guard let device = deviceDiscoverySession.devices.first else {
        return nil
    }

    return try? AVCaptureDeviceInput(device: device)
}

映像出力を表す AVCaptureVideoDataOutput を作成し、デリゲートで映像データを受け取ります。

class MyVideoProcessor {
    init?() {
        guard let input = createCaptureDeviceInput(.front) else {
            return nil
        }
        let output = AVCaptureVideoDataOutput()
        let dispatchQueue = DispatchQueue.global(qos: .userInteractive)
        output.setSampleBufferDelegate(self, queue: dispatchQueue)
    }
}

extension MyVideoProcessor: AVCaptureVideoDataOutputSampleBufferDelegate {
    func captureOutput(_ output: AVCaptureOutput, didOutput sampleBuffer: CMSampleBuffer, from connection: AVCaptureConnection) {
        /* ここからバーチャル背景処理をはじめます。後述。 */
    }
}

ML Kit (iOS) のドキュメントに、パフォーマンス向上のため

output.alwaysDiscardsLateVideoFrames = true

にしろ、と書いてあるので従います。

入力と出力を接続する AVCaptureSession を作成します。

class MyVideoProcessor {
    let captureSession: AVCaptureSession

    init?() {
        /* 省略 */

        captureSession = AVCaptureSession()
        captureSession.addInput(input)
        captureSession.addOutput(output)
        captureSession.startRunning()
    }
}

これでカメラからの映像取得は完了です。
AVCaptureSession が正しく動作しているか確認するために、 AVCaptureVideoPreviewLayer を使って画面に表示することが出来ますが、本記事では割愛します。こちらの記事が参考になるかと思います。

カメラ映像の向きの修正

iOS デバイス本体を回転すると当然カメラの映像も回転してくれると思いきや、そうではないようです。
アプリの UI の向きに連動してカメラ映像の向きを修正する処理を追加します。

UIInterfaceOrientation から AVCaptureVideoOrientation にマップする関数を作ります。

func getVideoOrientation(from uiInterfaceOrientation: UIInterfaceOrientation) -> AVCaptureVideoOrientation? {
    switch uiInterfaceOrientation {
    case .portrait:
        return .portrait
    case .portraitUpsideDown:
        return .portraitUpsideDown
    case .landscapeLeft:
        return .landscapeLeft
    case .landscapeRight:
        return .landscapeRight
    case .unknown:
        return nil
    @unknown default:
        return nil
    }
}

UIInterfaceOrientation を受け取り、 AVCaptureConnection に映像の向きを設定する関数を MyVideoProcessor に実装します。

class MyVideoProcessor {
    func setVideoOrientation(with uiInterfaceOrientation: UIInterfaceOrientation) {
        guard let videoOrientation = getVideoOrientation(from: uiInterfaceOrientation) else {
            return
        }

        for connection in captureSession.connections {
            connection.videoOrientation = videoOrientation
        }
    }
}

UIWindowScene から UIInterfaceOrientation を取得し、 MyVideoProcessor に渡します。以下は SwiftUI を利用している場合の例です。

import SwiftUI

@main
struct App: SwiftUI.App {
    @UIApplicationDelegateAdaptor var appDelegate: AppDelegate
}

class AppDelegate: NSObject, UIApplicationDelegate, ObservableObject {
    func application(_ application: UIApplication, configurationForConnecting connectingSceneSession: UISceneSession, options: UIScene.ConnectionOptions) -> UISceneConfiguration {
        let config = UISceneConfiguration(name: nil, sessionRole: connectingSceneSession.role)
        config.delegateClass = SceneDelegate.self
        return config
    }
}

class SceneDelegate: NSObject, UIWindowSceneDelegate, ObservableObject {
    @Published var interfaceOrientation: UIInterfaceOrientation = .unknown

    func sceneDidBecomeActive(_ scene: UIScene) {
        if let windowScene = scene as? UIWindowScene {
            interfaceOrientation = windowScene.interfaceOrientation
        }
    }

    func sceneWillEnterForeground(_ scene: UIScene) {
        if let windowScene = scene as? UIWindowScene {
            interfaceOrientation = windowScene.interfaceOrientation
        }
    }

    func windowScene(_ windowScene: UIWindowScene, didUpdate previousCoordinateSpace: UICoordinateSpace, interfaceOrientation previousInterfaceOrientation: UIInterfaceOrientation, traitCollection previousTraitCollection: UITraitCollection) {
        interfaceOrientation = windowScene.interfaceOrientation
    }
}

struct MyView: View {
    @EnvironmentObject private var sceneDelegate: SceneDelegate

    var body: some View {
        MyAnotherView()
            .onChange(of: sceneDelegate.interfaceOrientation, perform: setVideoOrientation)
    }

    private func setVideoOrientation(_ uiInterfaceOrientation: UIInterfaceOrientation) {
        myVideoProcessorInstance.setVideoOrientation(with: uiInterfaceOrientation)
    }
}

インカメラとアウトカメラの切り替え

インカメラとアウトカメラを切り替える場合は、 AVCaptureDeviceInput を作り直します。このとき AVCaptureSession にカメラ映像の向きを再設定してあげます。

class MyVideoProcessor {
    func changeCameraPosition(position: AVCaptureDevice.Position) {
        if let previousInput = input {
            captureSession.removeInput(previousInput)
        }

        if let newInput = createCaptureDeviceInput(position: position) {
            input = newInput
            captureSession.addInput(newInput)
        }

        for connection in captureSession.connections {
            // 直前に `setVideoOrientation(with:)` で設定された値をストアドプロパティに記憶しておき、ここで適用しなおす。
            connection.videoOrientation = videoOrientation
        }
    }
}

ML Kit でマスク画像を作る

いよいよ ML Kit を使います。
今回は画像から人物部分を検出するのが目的ですので、Selfie segmentation を利用します。インストール方法や細部のガイドについては公式ドキュメントを参照してください。

まずは Segmenter を作成します。
後述のマスク処理を簡単にするため、 options.shouldEnableRawSizeMask は有効にしません。

import MLKit

class MyVideoProcessor {
    private let segmenter: Segmenter = {
        let options = SelfieSegmenterOptions()
        options.segmenterMode = .stream
        return Segmenter.segmenter(options: options)
    }()
}

カメラ映像の各フレームから人物部分を判定し、マスクデータを生成します。 Tips に従い、同期処理とします。

class MyVideoProcessor {
    private func processFrame(sampleBuffer: CMSampleBuffer) {
        guard let mask = createSegmentationMask(of: sampleBuffer) else {
            return
        }
    }

    private func createSegmentationMask(of sampleBuffer: CMSampleBuffer) -> SegmentationMask? {
        let visionImage = VisionImage(buffer: sampleBuffer)
        return try? segmenter.results(in: visionImage)
    }
}

extension MyVideoProcessor: AVCaptureVideoDataOutputSampleBufferDelegate {
    func captureOutput(_ output: AVCaptureOutput, didOutput sampleBuffer: CMSampleBuffer, from connection: AVCaptureConnection) {
        processFrame(sampleBuffer: sampleBuffer)
    }
}

ここで mask.buffer は CVPixelBuffer ですが、このピクセルデータは RGBA の 4 値のように色を表すものではなく、人物かそうでないかを表す 0〜1 の 1 値のデータになります。特別な処理が必要かと思いましたが、後述の Core Image フレームワークがうまくハンドリングしているようで、この後の加工でそのまま使えます。

Core Image でカメラ映像を加工する

Core Image フレームワークの CIImage, CIFilter を使ってフレーム毎にバーチャル背景を適用します。

まずは、カメラ映像とマスクそれぞれの CIImage を作成します。

class MyVideoProcessor {
    private func processFrame(sampleBuffer: CMSampleBuffer) {
        /* 省略 */

        let fgBuffer = CMSampleBufferGetImageBuffer(sampleBuffer)
        let fgImage = CIImage(cvImageBuffer: fgBuffer)

        let maskImage = CIImage(cvPixelBuffer: mask.buffer)
    }
}

背景に画像を使う場合

背景画像の CIImage を用意します。 CIImage は、様々な方法でイニシャライズできます。写真アプリから画像を選択する方法は後述します。

画像ファイルから CIImage の生成を毎フレーム行ってしまうと、恐らくパフォーマンスに影響を及ぼすと思いますので、背景画像の CIImage はオンメモリで保持しておきます。

以下は、画像ファイルを いい感じ に（カメラ映像のサイズにあわせて拡大・縮小してセンタリング）しています。

class MyVideoProcessor {
    private let bgImageRaw = CIImage(/* ... */).orientationCorrected()

    private func processFrame(sampleBuffer: CMSampleBuffer) {
        /* 省略 */

        guard let bgImage = bgImageRaw
            .scaled(toCover: fgImage.extent.size)?
            .centerCropped(to: fgImage.extent.size) else {
                return
            }
    }
}

extension CIImage {
    var exifOrientation: Int32? {
        let exif = properties["{Exif}"] as? [String: Any]
        let tiff = properties["{TIFF}"] as? [String: Any]
        let orientation = exif?["Orientation"] ?? tiff?["Orientation"]
        return orientation as? Int32
    }

    /// EXIF情報をもとに画像の向きを修正
    func orientationCorrected() -> CIImage {
        if let exifOrientation = exifOrientation {
            return oriented(forExifOrientation: exifOrientation)
        } else {
            return self
        }
    }

    /// size を覆うように、アスペクト比を維持して拡大・縮小する。
    func scaled(toCover size: CGSize) -> CIImage? {
        let widthRatio = size.width / extent.width
        let heightRatio = size.height / extent.height
        let ratio = max(widthRatio, heightRatio)
        return scaled(toRatio: ratio)
    }

    func scaled(toRatio ratio: CGFloat) -> CIImage? {
        let filter = CIFilter(name: "CILanczosScaleTransform")
        filter?.setDefaults()
        filter?.setValue(self, forKey: "inputImage")
        filter?.setValue(ratio, forKey: "inputScale")
        return filter?.outputImage
    }

    /// 中央を基準に、 size の大きさでトリミング
    func centerCropped(to size: CGSize) -> CIImage? {
        let x = (extent.width - size.width) / 2
        let y = (extent.height - size.height) / 2
        let origin = CGPoint(x: x, y: y)
        let rect = CGRect(origin: origin, size: size)
        // 画像の左上を (0, 0) に合わせるための変形
        let correction = CGAffineTransform(translationX: -x, y: -y)
        return cropped(to: rect).transformed(by: correction)
    }
}

背景ぼかし等の加工をする場合

fgImage を加工して CIImage を用意します。

以下のコードは、カメラ映像を単純にぼかしたものを用意しています。
これを前景と合成することで、背景だけをぼかしたようになります。

class MyVideoProcessor {
    private func processFrame(sampleBuffer: CMSampleBuffer) {
        /* 省略 */

        let bgImage = fgImage.clampedToExtent().blur(radius: 15)
    }
}

extension CIImage {
    func blur(radius: Double) -> CIImage? {
        let filter = CIFilter(name: "CIGaussianBlur")
        filter?.setDefaults()
        filter?.setValue(self, forKey: "inputImage")
        filter?.setValue(radius, forKey: "inputRadius")
        return filter?.outputImage
    }
}

前景・マスク・背景を合成

フィルタ CIBlendWithMask を使って、前景(カメラ映像)・マスク・背景を合成します。
マスクに従って前景を切り抜き、切り抜いて捨てた部分を背景で埋める、という処理になります。

class MyVideoProcessor {
    private func processFrame(sampleBuffer: CMSampleBuffer) {
        /* 省略 */

        let blendedImage = fgImage.blendWithMask(mask: maskImage, background: bgImage)
    }
}

extension CIImage {
    func blendWithMask(mask: CIImage, background: CIImage) -> CIImage? {
        let filter = CIFilter(name: "CIBlendWithMask")
        filter?.setDefaults()
        filter?.setValue(self, forKey: "inputImage")
        filter?.setValue(mask, forKey: "inputMaskImage")
        filter?.setValue(background, forKey: "inputBackgroundImage")
        return filter?.outputImage
    }
}

得られた blendedImage は、バーチャル背景が適用された画像になります。

CIImage を TwilioVideo の映像ソースとする

バーチャル背景を適用した映像を、 TwilioVideo で利用できる形にします。 TwilioVideo および Programmable Video の使い方については割愛します。

CVPixelBuffer にレンダリング

blendedImage: CIImage を CVPixelBuffer に変換します。 fgBuffer: CVPixelBuffer のメモリ領域に上書きします（※本当は良くないかも）。

class MyVideoProcessor {
    private let ciContext = CIContext()

    private func processFrame(sampleBuffer: CMSampleBuffer) {
        /* 省略 */

        let outputBuffer = fgBuffer
        ciContext.render(blendedImage, to: outputBuffer)
    }
}

CameraSource プロトコルを実装する

MyVideoProcessor にプロトコル TwilioVideo.CameraSource を実装します。
詳しくは Programmable Video のドキュメントを参照してください。

import TwilioVideo

class MyVideoProcessor: TwilioVideo.CameraSource {
    var isScreencast: Bool = false
    weak var sink: VideoSink?

    func requestOutputFormat(_ outputFormat: VideoFormat) {
        self.outputFormat = outputFormat

        if let sink = self.sink {
            sink.onVideoFormatRequest(outputFormat)
        }
    }
}

VideoSink に流す

CameraSource の sink: VideoSink に映像のフレームデータを流します。

class MyVideoProcessor: TwilioVideo.CameraSource {
    private func processFrame(sampleBuffer: CMSampleBuffer) {
        /* 省略 */

        let timestamp = CMSampleBufferGetPresentationTimeStamp(sampleBuffer)

        // バーチャル背景処理の前にカメラ映像を向きは修正したので、 `orientation: .up`　で固定。
        guard let videoFrame = VideoFrame(timestamp: timestamp, buffer: outputBuffer, orientation: .up) else {
            return
        }

        sink.onVideoFrame(videoFrame)
    }
}

LocalVideoTrack のソースとして利用する

CameraSource を実装したことで、 MyVideoProcessor は LocalVideoTrack の映像ソースとして利用可能になります。

guard let source = MyVideoProcessor() else {
    return
}
let localVideoTrack = LocalVideoTrack(source: source)

おまけ：周辺知識

前節までは、カメラ映像を加工してバーチャル背景を適用する処理そのものについてでした。
本節では、実際にアプリにバーチャル背景機能を実装する際に必要とした周辺知識について述べます。

CVPixelBuffer とピクセルフォーマット

前述のデリゲーションメソッド AVCaptureVideoDataOutputSampleBufferDelegate.captureOutput で得られる CMSampleBuffer から、映像の1フレームを表す CVImageBuffer を取得できます。

extension MyVideoProcessor: AVCaptureVideoDataOutputSampleBufferDelegate {
    func captureOutput(_ output: AVCaptureOutput, didOutput sampleBuffer: CMSampleBuffer, from connection: AVCaptureConnection) {
        let cvImageBuffer = CMSampleBufferGetImageBuffer(sampleBuffer)
    }
}

これは、画像のピクセル毎の色を表す値（≒ カラーコード）の集合と思ってもらうと分かりやすいかもしれません。
RGBA (Red, Green, Blue, Alpha) の 4 値を順に並べたフォーマットは人間には理解しやすいですが、今回の環境で得られるデータは YUV 形式になっていました。

※ CVPixelBuffer (= CVImageBuffer) のピクセルフォーマットを判別するスニペット 👉 Gist

ピクセルデータを直接触るようなコードを書く場合、YUV だと扱いづらいため変換が必要ですが、 Core Image がそのあたりの操作を抽象化しているので、本記事の実装方法ではピクセルフォーマットの変換は不要でした。

特定のピクセルフォーマットで取得したい場合は、 AVCaptureVideoDataOutput に以下のような設定を行うことで、 CVImageBuffer のピクセルフォーマットが指定のものになります。

output.videoSettings = [
    kCVPixelBufferPixelFormatTypeKey as String : kCVPixelFormatType_32BGRA
]

※ CVPixelBuffer, CVImageBuffer の定義を見てみると以下のようになっており、すべて同一のものである事がわかります。

public typealias CVPixelBuffer = CVImageBuffer

public typealias CVImageBuffer = CVBuffer

public class CVBuffer {}

Metal と GPU

ML Kit のドキュメントに、画像と Selfie segmentation で生成したマスクとを合成するサンプルコードが挙げられています。これを読んでみると、 CVImageBuffer の各ピクセルの値について浮動小数点演算を行い、ピクセルデータを直接加工するものでした。これを参考にバーチャル背景処理を実装したところ、約 6FPS のカクカク映像になってしまいました 🥺 。
Swift で CVImageBuffer を直接いじると CPU 実行となり、今回のような映像のリアルタイム処理には向いていません。

ここで登場するのが Metal です。Metal は、iOS/macOS におけるグラフィクス処理用の API です。つまり、GPU によって実行したい処理は、Metal を使ったコードを書くことになります。(※OpenGL 等は非推奨となったようです。)
ただし、Core Image の内部で Metal が使われているようで、Core Image ないし CIFilter を利用することで、GPU 実行の画像処理を手軽に実現できます。

本当に “込み入った” グラフィクス処理を実装する場合は、 C++ をベースとした Metal Shading Language という言語を使って、シェーダー関数を .metal ファイルに記述してコンパイルし、Swift でそれを呼び出すコードを記述し…… となります。
しかし、今回のバーチャル背景のようなケースでは、前述のように、Swift で Core Image を使ったコードを記述するだけで十分でした。

写真アプリから画像を選択

UIImagePickerController を使います。SwiftUI の場合は、これを UIViewControllerRepresentable でラップして使います。

UIImagePickerControllerDelegate.imagePickerController を実装し、以下のようにデータを受け取ることができます。 CIImage(image: UIImage), CIImage(contentsOf: URL), CIImage(data: Data) 等を使って CIImage をイニシャライズし、バーチャル背景に利用します。

func imagePickerController(_ picker: UIImagePickerController, didFinishPickingMediaWithInfo info: [UIImagePickerController.InfoKey : Any]) {
    let uiImage = info[.originalImage] as? UIImage
    let tmpFileURL = info[.imageURL] as? URL
    let data = try? Data(contentsOf: tmpFileURL!)
}

ファイルシステムとファイルの永続化

Apple のドキュメントアーカイブ File System Basics に、アプリが利用できるストレージのファイルシステムについて記述があります。
これによると、アプリ本体 AppName.app の他に、

Documents/
Library/
tmp/

というディレクトリがあります。

写真アプリ等から選択する画像は、URL を確認するとわかりますが、 tmp/ 配下に画像ファイルとしてコピーされます。
アプリを終了すると tmp/ はシステムによって削除される可能性があるので、次回起動時にも保持しておきたいファイルは Documents/ や iCloud、後述の Core Data 等に移動しておくべきです。

Core Data によるバイナリデータの永続化

Core Data は、アプリに必要なデータの永続化のために利用可能なフレームワークです。内部では SQLite 等が利用されています。
通常、RDB に画像のようなバイナリデータを格納することは良くないですが、バイナリのプロパティに External Storage というオプションを設定することで、うまく抽象化され、Core Data 経由で手軽にバイナリデータを扱えるようになります。（参考）

SwiftUI アプリケーションにおいては、 @FetchRequest() や FetchedResults<> による、通常の Core Data アクセスによって、バイナリデータを永続化することが出来るようになります。

おわりに

こんな記事を本ブログに掲載している時点で察しが付くかと思いますが、オロでは、iOS アプリの開発も行っています。
iOS 含め、様々なフロントエンドアプリケーションを一緒に開発していくエンジニアを募集しています！ 🥳

諏訪重貴, 🛠 フロントエンドエンジニア
Shigeki Suwa, 🛠 Front-end Web Developer
Twitter🐦 @ztrehagem
GitHub🐙 @ztrehagem