It depends entirely on whether you’re looking to do simultaneous capture or whether you’re open to SLAM-based systems that convert video / image input into pointcloud data.
At worst, you’d use one camera and some generic ML-based depth conversion model to turn a single, flat photo of someone’s face into a 3D STL file. You could use two cameras in sync for a stereographic approach (depending on which pi you get, you can connect more than one camera – ten seconds on google will help you out with that, or you can just go the sensible route and get a handful of USB webcams since the Pis have at least 4 USB ports). Or you could use any number of the SLAM-based tracking + mapping approaches out there.
Though, given the proliferation of things like iphones, kinect, etc, and the fact that rpis are extremely overpriced right now due to chip shortage, the real question is why you’re wanting to do this with an rpi specifically?