ubii.processing_modules.ocr package

Subpackages

Submodules

ubii.processing_modules.ocr.tesseract_ocr module

Consider setting your locale to C, e.g. by exporting LC_CTYPE=C since tesseract code sometimes segfaults if the locale is set incorrectly.

Seems to be related to https://github.com/sirfz/tesserocr/issues/165 which seems to be fixed in tesseract but not in the leptonica library see https://github.com/DanBloomberg/leptonica/issues/591

class ubii.processing_modules.ocr.tesseract_ocr.BaseModule

Bases: ProcessingRoutine

All OCR Modules inherit from this processing module. It supplies the basic functionality of loading images and transforming them, and defines the protobuf specs

id

Field of type STRING – inherited from ProcessingModule – inherited from ProcessingRoutine

Type:

proto.fields.Field

name

Field of type STRING – inherited from ProcessingModule – inherited from ProcessingRoutine

Type:

proto.fields.Field

authors

RepeatedField of type STRING – inherited from ProcessingModule – inherited from ProcessingRoutine

Type:

proto.fields.RepeatedField

tags

RepeatedField of type STRING – inherited from ProcessingModule – inherited from ProcessingRoutine

Type:

proto.fields.RepeatedField

description

Field of type STRING – inherited from ProcessingModule – inherited from ProcessingRoutine

Type:

proto.fields.Field

node_id

Field of type STRING – inherited from ProcessingModule – inherited from ProcessingRoutine

Type:

proto.fields.Field

session_id

Field of type STRING – inherited from ProcessingModule – inherited from ProcessingRoutine

Type:

proto.fields.Field

status

Field of type Status – inherited from ProcessingModule – inherited from ProcessingRoutine

Type:

proto.fields.Field

processing_mode

Field of type ProcessingMode – inherited from ProcessingModule – inherited from ProcessingRoutine

Type:

proto.fields.Field

inputs

RepeatedField of type ModuleIO – inherited from ProcessingModule – inherited from ProcessingRoutine

Type:

proto.fields.RepeatedField

outputs

RepeatedField of type ModuleIO – inherited from ProcessingModule – inherited from ProcessingRoutine

Type:

proto.fields.RepeatedField

language

Field of type Language – inherited from ProcessingModule – inherited from ProcessingRoutine

Type:

proto.fields.Field

on_processing_stringified

Field of type STRING – inherited from ProcessingModule – inherited from ProcessingRoutine

Type:

proto.fields.Field

on_created_stringified

Field of type STRING – inherited from ProcessingModule – inherited from ProcessingRoutine

Type:

proto.fields.Field

on_halted_stringified

Field of type STRING – inherited from ProcessingModule – inherited from ProcessingRoutine

Type:

proto.fields.Field

on_destroyed_stringified

Field of type STRING – inherited from ProcessingModule – inherited from ProcessingRoutine

Type:

proto.fields.Field

__init__(context, mapping: Dict[str, Any] = None, eval_strings: bool = False, api_variables: Dict[str, Any] = None, api_args: Dict[str, Any] = None, filter_empty_boxes: bool = False, ocr_confidence: int = 70, result_fmt: str | Callable = '{text}', **kwargs)

Create a processing module

Parameters:
  • context – Client context, contains broker constants definitions

  • mapping – passed to protobuf initialization

  • eval_strings – evaluate protobuf method definitions

  • api_args – arguments passed to TesserOCR API initialization

  • api_variables – mapping of variable names and values set in th api after initialization

  • filter_empty_boxes – if the module should return bounding boxes where no text was detected

  • ocr_confidence – cutoff for ocr detection, characters with lower confidence will be discarded

  • result_fmt – This format string or callable defines how the ‘id’ field of the result will be formatted. Set it to a custom callable with kwargs to see which values will be passed to the callable (normally ocr result, confidence etc.)

  • **kwargs – passed to protobuf initialization

on_init(context) None

Create TesserOCR API instance

read_image(image: Image2D, conversion=bgr_conversions.get)

Read image data from protobuf message and set the image data used by the API instance. The image used by the API instance is always a grayscale version of the input image, as the Tesseract OCR performs better on grayscale images.

Parameters:
  • image – protobuf image

  • conversion – method to get conversions from the ubii.proto.Image2D.data_format of the passed image to some format which is recognized by cv2.cvtColor(). You can pass None to not convert the image.

Returns:

result of the conversion

log_performance(context)

Just save some info of the module performance. Get’s logged when module is destroyed since logging during processing would decrease performance significantly.

property api: PyTessBaseAPI | None

Reference to the loaded TesserOCR API instance

property image: ub.Image2D | None

Reference to protobuf image if an image was received as input, else None

on_processing(context)

Load image from input, also save performance info

on_destroyed(context)

Unload API and log performance

ocr_in_box(box: Tuple[int, int, int, int], min_confidence=70) str

Perform OCR task using the tesseract api in a box inside the loaded image

Parameters:
  • box – region of interest

  • min_confidence – cutoff to discard the OCR result

Returns:

recognized text in box

static object_2d_from_box(box: Tuple[int, int, int, int]) Object2D

Create protobuf message from integer tuple :Parameters: box – box definition as (x, y, width, height)

Returns:

Protobuf message representing the box

static to_image_space(dimensions: Tuple[int, int], box: Tuple[int, int, int, int]) Tuple[float, float, float, float]

Convert absolute pixel coordinates to image space (i.e. coordinate in range [0, 1])

Parameters:
  • dimensions – image dimensions

  • box – box coordinates

Returns:

box in image coordinates

static box2rec(box: Tuple[int, int, int, int]) Tuple[int, int, int, int]

Convert box in x, y, width, height format to coordinates of bottom left and top right corner points :Parameters: box – tuple of x, y, width, height

Returns:

tuple of x1, y1, x2, y2

static rec2box(rec: Tuple[int, int, int, int]) Tuple[int, int, int, int]

Convert box in coordinates of bottom left and top right corner point to x, y, width, height :Parameters: rec – tuple of x1, y1, x2, y2

Returns:

tuple of x, y, width, height

static padded(box: PixelBox, padding: int = 0, dimensions: Tuple[int, int] | None = None)

Expand box by padding in all directions

Parameters:
  • box – tuple of x, y, width, height

  • padding – amount of pixels that should be padded around box.

  • dimensions – used for clipping

Returns:

box – tuple of x, y, width, height for expanded box

class ubii.processing_modules.ocr.tesseract_ocr.TesseractOCR_PURE

Bases: BaseModule

This module uses pure Tesseract OCR functionality without preprocessing to extract text bounding boxes and perform OCR in them

id

Field of type STRING – inherited from ProcessingModule – inherited from ProcessingRoutine – inherited from BaseModule

Type:

proto.fields.Field

name

Field of type STRING – inherited from ProcessingModule – inherited from ProcessingRoutine – inherited from BaseModule

Type:

proto.fields.Field

authors

RepeatedField of type STRING – inherited from ProcessingModule – inherited from ProcessingRoutine – inherited from BaseModule

Type:

proto.fields.RepeatedField

tags

RepeatedField of type STRING – inherited from ProcessingModule – inherited from ProcessingRoutine – inherited from BaseModule

Type:

proto.fields.RepeatedField

description

Field of type STRING – inherited from ProcessingModule – inherited from ProcessingRoutine – inherited from BaseModule

Type:

proto.fields.Field

node_id

Field of type STRING – inherited from ProcessingModule – inherited from ProcessingRoutine – inherited from BaseModule

Type:

proto.fields.Field

session_id

Field of type STRING – inherited from ProcessingModule – inherited from ProcessingRoutine – inherited from BaseModule

Type:

proto.fields.Field

status

Field of type Status – inherited from ProcessingModule – inherited from ProcessingRoutine – inherited from BaseModule

Type:

proto.fields.Field

processing_mode

Field of type ProcessingMode – inherited from ProcessingModule – inherited from ProcessingRoutine – inherited from BaseModule

Type:

proto.fields.Field

inputs

RepeatedField of type ModuleIO – inherited from ProcessingModule – inherited from ProcessingRoutine – inherited from BaseModule

Type:

proto.fields.RepeatedField

outputs

RepeatedField of type ModuleIO – inherited from ProcessingModule – inherited from ProcessingRoutine – inherited from BaseModule

Type:

proto.fields.RepeatedField

language

Field of type Language – inherited from ProcessingModule – inherited from ProcessingRoutine – inherited from BaseModule

Type:

proto.fields.Field

on_processing_stringified

Field of type STRING – inherited from ProcessingModule – inherited from ProcessingRoutine – inherited from BaseModule

Type:

proto.fields.Field

on_created_stringified

Field of type STRING – inherited from ProcessingModule – inherited from ProcessingRoutine – inherited from BaseModule

Type:

proto.fields.Field

on_halted_stringified

Field of type STRING – inherited from ProcessingModule – inherited from ProcessingRoutine – inherited from BaseModule

Type:

proto.fields.Field

on_destroyed_stringified

Field of type STRING – inherited from ProcessingModule – inherited from ProcessingRoutine – inherited from BaseModule

Type:

proto.fields.Field

class ubii.processing_modules.ocr.tesseract_ocr.TesseractOCR_MSER

Bases: BaseModule

This module uses the MSER algorithm to perform preprocessing and extract _character_ bounding boxes and performs OCR using Tesseract for the characters in those boxes

id

Field of type STRING – inherited from ProcessingModule – inherited from ProcessingRoutine – inherited from BaseModule

Type:

proto.fields.Field

name

Field of type STRING – inherited from ProcessingModule – inherited from ProcessingRoutine – inherited from BaseModule

Type:

proto.fields.Field

authors

RepeatedField of type STRING – inherited from ProcessingModule – inherited from ProcessingRoutine – inherited from BaseModule

Type:

proto.fields.RepeatedField

tags

RepeatedField of type STRING – inherited from ProcessingModule – inherited from ProcessingRoutine – inherited from BaseModule

Type:

proto.fields.RepeatedField

description

Field of type STRING – inherited from ProcessingModule – inherited from ProcessingRoutine – inherited from BaseModule

Type:

proto.fields.Field

node_id

Field of type STRING – inherited from ProcessingModule – inherited from ProcessingRoutine – inherited from BaseModule

Type:

proto.fields.Field

session_id

Field of type STRING – inherited from ProcessingModule – inherited from ProcessingRoutine – inherited from BaseModule

Type:

proto.fields.Field

status

Field of type Status – inherited from ProcessingModule – inherited from ProcessingRoutine – inherited from BaseModule

Type:

proto.fields.Field

processing_mode

Field of type ProcessingMode – inherited from ProcessingModule – inherited from ProcessingRoutine – inherited from BaseModule

Type:

proto.fields.Field

inputs

RepeatedField of type ModuleIO – inherited from ProcessingModule – inherited from ProcessingRoutine – inherited from BaseModule

Type:

proto.fields.RepeatedField

outputs

RepeatedField of type ModuleIO – inherited from ProcessingModule – inherited from ProcessingRoutine – inherited from BaseModule

Type:

proto.fields.RepeatedField

language

Field of type Language – inherited from ProcessingModule – inherited from ProcessingRoutine – inherited from BaseModule

Type:

proto.fields.Field

on_processing_stringified

Field of type STRING – inherited from ProcessingModule – inherited from ProcessingRoutine – inherited from BaseModule

Type:

proto.fields.Field

on_created_stringified

Field of type STRING – inherited from ProcessingModule – inherited from ProcessingRoutine – inherited from BaseModule

Type:

proto.fields.Field

on_halted_stringified

Field of type STRING – inherited from ProcessingModule – inherited from ProcessingRoutine – inherited from BaseModule

Type:

proto.fields.Field

on_destroyed_stringified

Field of type STRING – inherited from ProcessingModule – inherited from ProcessingRoutine – inherited from BaseModule

Type:

proto.fields.Field

COLOR_ONLY_MSER_OPTS = ['min_diversity', 'max_evolution', 'area_threshold', 'min_margin', 'edge_blur_size']

These options are only usable with colored input images, if any are present in the dictionary passed as mser_args during initialization, the module will not use a grayscale conversion for the MSER algorithm which will result in longer processing times

__init__(context, mser_grayscale: bool = True, padding=2, mapping=None, eval_strings=False, api_args=None, mser_args=None, **kwargs)

Create a MSER preprocessing module

Parameters:
  • mser_grayscale – if True, run MSER algorithm on grayscale version of input image

  • padding – the value used to pad all extracted regions before OCR is performed

  • mser_args – passed to cv2.MSER_create(), default {‘max_variation’: 0.25}

classmethod contained(boxes: Iterable[Tuple[int, int, int, int]]) Iterable[bool]

Compute if box is completely contained inside a box in passed boxes

Parameters:
  • box – box that should be queried

  • boxes – List of boxes in x, y, width, height format

Returns:

items in list are True if box is contained inside another box, else False

class ubii.processing_modules.ocr.tesseract_ocr.TesseractOCR_EAST

Bases: BaseModule

This module uses the EAST algorithm to preprocess the image and extract text bounding boxes, then uses Tesseract for OCR inside the boxes

id

Field of type STRING – inherited from ProcessingModule – inherited from ProcessingRoutine – inherited from BaseModule

Type:

proto.fields.Field

name

Field of type STRING – inherited from ProcessingModule – inherited from ProcessingRoutine – inherited from BaseModule

Type:

proto.fields.Field

authors

RepeatedField of type STRING – inherited from ProcessingModule – inherited from ProcessingRoutine – inherited from BaseModule

Type:

proto.fields.RepeatedField

tags

RepeatedField of type STRING – inherited from ProcessingModule – inherited from ProcessingRoutine – inherited from BaseModule

Type:

proto.fields.RepeatedField

description

Field of type STRING – inherited from ProcessingModule – inherited from ProcessingRoutine – inherited from BaseModule

Type:

proto.fields.Field

node_id

Field of type STRING – inherited from ProcessingModule – inherited from ProcessingRoutine – inherited from BaseModule

Type:

proto.fields.Field

session_id

Field of type STRING – inherited from ProcessingModule – inherited from ProcessingRoutine – inherited from BaseModule

Type:

proto.fields.Field

status

Field of type Status – inherited from ProcessingModule – inherited from ProcessingRoutine – inherited from BaseModule

Type:

proto.fields.Field

processing_mode

Field of type ProcessingMode – inherited from ProcessingModule – inherited from ProcessingRoutine – inherited from BaseModule

Type:

proto.fields.Field

inputs

RepeatedField of type ModuleIO – inherited from ProcessingModule – inherited from ProcessingRoutine – inherited from BaseModule

Type:

proto.fields.RepeatedField

outputs

RepeatedField of type ModuleIO – inherited from ProcessingModule – inherited from ProcessingRoutine – inherited from BaseModule

Type:

proto.fields.RepeatedField

language

Field of type Language – inherited from ProcessingModule – inherited from ProcessingRoutine – inherited from BaseModule

Type:

proto.fields.Field

on_processing_stringified

Field of type STRING – inherited from ProcessingModule – inherited from ProcessingRoutine – inherited from BaseModule

Type:

proto.fields.Field

on_created_stringified

Field of type STRING – inherited from ProcessingModule – inherited from ProcessingRoutine – inherited from BaseModule

Type:

proto.fields.Field

on_halted_stringified

Field of type STRING – inherited from ProcessingModule – inherited from ProcessingRoutine – inherited from BaseModule

Type:

proto.fields.Field

on_destroyed_stringified

Field of type STRING – inherited from ProcessingModule – inherited from ProcessingRoutine – inherited from BaseModule

Type:

proto.fields.Field

__init__(context, mapping=None, eval_strings=False, api_args=None, nms_threshold: float = 0.5, min_detection_confidence: float = 0.7, merge_bounding_boxes: bool = False, **kwargs)
Parameters:
  • context – Client context, contains broker constants definitions

  • mapping – passed to protobuf initialization

  • eval_strings – evaluate protobuf method definitions

  • api_args – arguments passed to TesserOCR API initialization

  • nms_threshold – Threshold for non-maximum suppression of detected bounding boxes

  • min_detection_confidence – Minimum confidence for EAST detection

  • merge_bounding_boxes – determines if non-suppressed boxes get merged

  • **kwargs – passed to BaseModule initialization

property nms_threshold

Threshold for EAST detection

property min_detection_confidence

Confidence cutoff for detection