PDF check
List of tasks
This event acquires a lock on the data before executing, renewing it after every task is completed.
Following is the list of exploited tasks:
- Sessions and Materials Collection
- Contributions Data Collection
- Proceedings Data Object Creation
- Download of the Papers
- Papers Report
- Papers Validation
In the end, the event will return a list of dictionaries with metadata and errors.
Sessions and Materials Collection
This task collects sessions and materials related to the conference based on the provided information. In summary, it starts two parallel subtasks:
download_sessions
: Retrieves sessions related to the event from Indico and appends them to thesessions
list.download_materials
: Retrieves materials associated with the event from Indico and appends them to thematerials
list.
The two lists are then returned by the task.
Contributions Data Collection
This task collects contributions and files associated with a conference based on the provided information. In summary, it starts parallel subtasks download_contributions
that retrieves the contributions from Indico and appends them to the contributions
list.
The list is then returned by the task.
Proceedings Data Object Creation
This task builds a proceeding's object and then returns it. In particular, for each contribution, it updates the flag is_included_in_pdf_check
to True
if and only if that contribution has the green
or yellow
state in Indico.
Download of the Papers
This task is responsible for downloading contribution papers associated with the proceedings data object. Here's an overview of what happens in this function:
-
Extracting Papers: It extracts a list of
FileData
objects representing contribution papers whose contributions have thegreen
oryellow
state. -
Download: For each file data, in parallel, a download subtask is started, that retrieves the PDF file and caches it.
Once all files have been downloaded, the task returns a list containing two sublists: the first sublist contains the updated proceedings object, and the second sublist contains the reference to the downloaded PDFs.
Papers Report
This task operates on the previously created proceedings object. Here's an overview of what happens:
-
Paper Selection: It builds a list of
ContributionPaperData
objects representing conference papers that meet thegreen
oryellow
state condition. -
Papers Processing: For each paper, a subtask is initiated to perform the following steps:
- Data Collection: The subtask collects various information from the PDF, such as text content, page details, and font information.
- Keyword Extraction: It extracts keywords from the PDF's text content.
- Data Organization: All the gathered information is structured into an object.
-
Proceedings Update: The proceedings object is updated by incorporating all the extracted information.
Papers Validation
This task validates data based on certain criteria. Here's an overview of what it does:
-
Initialization: It retrieves page width and height settings from the settings.
-
Data Collection and Validation Loop: For each contribution in the proceedings data:
- Metadata is extracted and stored.
- Error checks are performed:
- Page size is checked against the specified PDF page width and height.
- Font embedding is verified.
- If errors are detected, they are appended to the
errors
list along with contribution details.
-
Results: The function returns a list containing both the collected metadata and any errors encountered during validation.