1 year ago
#384379
Nyrb
Python - Apache-beam outputs an empty file using dataflow runner, works fine with direct runner. Dataflow does not raise any errors
I've been trying to run this apache-beam script. This script runs nightly through an airflow DAG, and works perfectly fine that way so I'm (reasonably) confident that the script is correct. I think the relevant part of my apache script is summarized with this.
def configure_pipeline(p, opt):
"""Specify PCollection and transformations in pipeline."""
read_input_source = beam.io.ReadFromText(opt.input_path,
skip_header_lines=1,
strip_trailing_newlines=True)
_ = (p
| 'Read input' >> read_input_source
| 'Get prediction' >> beam.ParDo(PredictionFromFeaturesDoFn())
| 'Save to disk' >> beam.io.WriteToText(
opt.output_path, file_name_suffix='.ndjson.gz'))
And to execute the script, I run this:
python beam_process.py \
--project=\my-project \
--region=us-central1 \
--runner=DataflowRunner \
--temp_location=gs://staging/location \
--job_name=beam-process-test \
--max_num_workers=5 \
--runner=DataflowRunner \
--input_path="gs://path/to/file/input-000000000000.jsonl.gz" \
--output_path="gs://path/to/output"
The job in dataflow runs with no errors, however the output file is completely empty other than the file name. Running this with direct runner and using local directories, the process runs as expected and I have fully workable outputs. I've tried using different inputs, as well as trying different cloud buckets. The only thing I could think of is a permissions problem that I'm unaware of. I can post the dataflow job details (or at least, what I'm able to see of them) if needed.
EDIT
For the few who end up seeing this, I ended up fixing it but the reason is still unknown to me. By adding quotes around the entire input field:
--runner=DataflowRunner \
'--input_path="gs://path/to/file/input-000000000000.jsonl.gz" \'
--output_path="gs://path/to/output"
allows dataflow to read the input stream.
python
google-cloud-platform
apache-beam
dataflow
gcs
0 Answers
Your Answer