Technically this whole task can be accomplished with one query, but I have found that separating identification from evaluation produces better, more consistent results.

Define evaluation criteria and describe a schema to structure the output. It seems that GPT-4 reliably outputs valid JSON when prompted to do so.

The built-in Python JSON module will only deserialize a string containing a valid JSON document, so it can be used to validate the response.