Techno Blender
Digitally Yours.

Advanced Guide: Avoiding Max Character Limits on the Microsoft Translator API by Auto-Batching Inputs | by Yousef Nami | Jan, 2023

0 49


Photo from Unsplash courtesy of Edurne.

The Microsoft Translator API [1] is one of the easiest translator services to set up, and it’s quite powerful, giving you access to translators for a multitude of low and high resource languages for free. However, it has a 50000 max character limit per request [2] and a 2 million max character limit per hour of use (on the free version). So while the service itself is very easy to setup and use, using it reliably is quite difficult.

In this article, I will explore methods for ensuring that:

  • Any arbitrary number of texts can be translated to any number of target languages while adhering to the max character limit by autobatching inputs
  • Consecutive requests are delayed such that they adhere to the max character limit per hour

At worst (free subscription), these methods will decrease wasted characters on partial translations and at best (paid subscription) they’ll save you cash.

The tutorial is targeted towards those with working knowledge of the Microsoft Translator API. If you are unfamiliar and would like a beginner friendly-guide on setting it up, check my introduction article below:

The original translation code I wrote was simply a wrapper over requests that took a list of texts, a list of target languages, and optionally a source language as parameters, and called the translate endpoint of the translator API. The code snippet of the translation function is shown below:

def translate_text(
text: Union[str, list],
target_language: Union[str, list],
source_language: Optional[str] = None,
api_version: str = '3.0') -> tuple:
"""translates txt using the microsoft translate API

:param text: text to be translated. Either single or multiple (stored in a list)
:param target_language: ISO format of target translation languages
:param source_language: ISO format of source language. If not provided is inferred by the translator, defaults to None
:param api_version: api version to use, defaults to "3.0"
:return: for successful response, (status_code, [{"translations": [{"text": translated_text_1, "to": lang_1}, ...]}, ...]))
"""

url = f'{MICROSOFT_TRANSLATE_URL}/translate?api-version={api_version}'

if isinstance(target_language, str):
target_language = [target_language]

# extend url with array parameters, e.g. f"{url}&to=de&to=ru"
url = add_array_api_parameters(
url,
param_name='to',
param_values=target_language
)

if source_language:
url = f'{url}&from={source_language}'

if isinstance(text, str):
text = [text]

body = [{'text': text_} for text_ in text]

LOGGER.info(f'Translating {len(text)} texts to {len(target_language)} languages')
resp = requests.post(url, headers=HEADERS, json=body)
status_code = resp.status_code

if is_request_valid(status_code):
return resp.json(), status_code

return resp.text, status_code

Upon use, I ran into the following challenges that caused my translation requests to fail:

  • Challenge 1: The total size of the texts exceeded the max character limit when translated to all target languages
  • Challenge 2: At least one of my texts exceeded the max character limit when translated to all target languages
  • Challenge 3: At least one of my texts exceeded the max character limit
  • Challenge 4: The total size of my texts was negligible but because of very frequent calls to the endpoint I was hitting a max number of requests limit

The next section will run through a first implementation of a translation function to deal with the 4 challenges above.

Challenge 1 — Too Many Texts, Too Many Languages

The first issue occurs when you have a list of texts such that no single text exceeds the max character limit when translated to its target languages, but that the texts together do.

To put it concretely, this means that you can’t send a single request of n texts whose total size S is greater than 50000. From Microsoft’s documentation [2], we know that the size of a text is calculated as the number of characters in the text multiplied by the number of languages it’s being translated to. In Python:

texts = ["Hi this is a dummy text", "Oh no dummy text"]
target_languages = ["de", "ru"]
n_target_langs = len(target_languages)
text_sizes = [len(text) for text in texts]
total_size = sum(text_sizes) * n_target_langs
assert total_size <= 50000

As such, we need a function to bucket all input texts into batches whose total size does not exceed 50000. My first implementation resulted in the following code:

from typing import Dict, List

def batch_by_size(sizes: Dict[int, int], limit: int, sort_docs: bool = False) -> List[Dict[str, Union[int, List[int]]]]:
"""Given a size mapping such {document_id: size_of_document}, batches documents such that the total size of a batch of documents does not exceed pre-specified limit

:param sizes: mapping that gives document size for each document_id
:param limit: size limit for each batch
:sort_doc: if True sorts `sizes` in descending order
:return: [{'idx': [ids_for_batch], 'total_size': total_size_of_documents_in_batch}, ...]

Example:
>>> documents = ['Joe Smith is cool', 'Django', 'Hi']
>>> sizes = {i: len(doc) for i, doc in enumerate(documents)}
>>> limit = 10
>>> batch_by_size(sizes, limit)
[{'idx': [0], 'total_size': 17}, {'idx': [1, 2], 'total_size': 8}]
"""
if sort_docs:
sizes = {key: size for key, size in sorted(sizes.items(), key=lambda x: x[1], reverse=True)}

batched_items = []
sizes_iter = iter(sizes)
key = next(sizes_iter)
while key is not None:
if not batched_items:
batched_items.append({
'idx': [key],
'total_size': sizes[key]
})
else:
size = sizes[key]
if size > limit:
LOGGER.warning(f'Document {key} exceeds max limit size: {size}>{limit}')
total_size = batched_items[-1]['total_size'] + size
if total_size > limit:
batched_items.append({
'idx': [key],
'total_size': size
})
else:
batched_items[-1]['idx'].append(key)
batched_items[-1]['total_size'] = total_size
key = next(sizes_iter, None)

return batched_items

The translate_text function from above can now be modified to include this functionality:

MAX_CHARACTER_LIMITS = 50000
def translate_text(
text: Union[str, list],
target_language: Union[str, list],
source_language: Optional[str] = None,
api_version: str = '3.0') -> tuple:
"""translates txt using the microsoft translate API

:param text: text to be translated. Either single or multiple (stored in a list)
:param target_language: ISO format of target translation languages
:param source_language: ISO format of source language. If not provided is inferred by the translator, defaults to None
:param api_version: api version to use, defaults to "3.0"
:return: for successful response, (status_code, [{"translations": [{"text": translated_text_1, "to": lang_1}, ...]}, ...]))
"""

url = f'{MICROSOFT_TRANSLATE_URL}/translate?api-version={api_version}'

if isinstance(target_language, str):
target_language = [target_language]

# extend url with array parameters, e.g. f"{url}&to=de&to=ru"
url = add_array_api_parameters(
url,
param_name='to',
param_values=target_language
)

if source_language:
url = f'{url}&from={source_language}'

if isinstance(text, str):
text = [text]

### start of new code --------------------------------------------
n_target_langs = len(target_language)

# get maximum size of each document

sizes_dict = {i: len(text_)*n_target_langs for i, text_ in enumerate(text)}
batched_texts = batch_by_size(sizes_dict, MAX_CHARACTER_LIMITS)

# for each batch, translate the texts. If successful append to list
translation_outputs = []
for batch in batched_texts:
doc_ids = batch['idx']
batch_texts = [text[doc_id] for doc_id in doc_ids]
body = [{'text': text_} for text_ in batch_texts]
LOGGER.info(f'Translating {len(text)} texts to {len(target_language)} languages')
resp = requests.post(url, headers=HEADERS, json=body)
status_code = resp.status_code
if not is_request_valid(status_code):
raise Exception(f'Translation failed for texts {doc_ids}')

translation_output = resp.json()
translation_outouts += translation_output

return translation_outouts, status_code
### end of new code ----------------------------------------------

Note some key considerations that were made:

  • A failure in translating a single batch would fail the whole request
  • The batching function is O(n)
  • The translation loop is O(n) in the worst case (e.g. each text size hits the max limit). Realistically it will be less, however the algorithm will not try to minimise the number of batches, since it only linearly loops through the texts

Challenge 2 — Single Text, Too Many Languages

While the above code works for most cases, you will inevitably run into a case where you have a single text that alone will not exceed the max character limit, but would do if translated to all. In Python:

max_limit = 10
text = "Hello!"
text_size = len(text) # 6
target_languages = ["de", "ru"]
n_target_langs = len(target_languages)
total_size = text_size * n_target_langs # 12
assert total_size <= max_limit # 12 <= 10, raises error!

In this case, the problematic text must be translated separately for a batch of target languages, and at worst, for each language separately. At this point though, we realise that there are two batching strategies that we can consider:

  • Batching where size is defined by len(text) * n_target_langs : batch items with the solution for Challenge 1. For texts whose size S = len(text) * n_target_langs > max_limit , further batch by target language
  • Batching where size is defined by the optimal combination of text lengths and target languages: batch texts such that each batch can have an arbitrary number of languages associated with it, provided that sum(texts_in_batch) * n_target_langs_for_batch <= max_limit

Below is a pictorial representation of an example where the second method leads to solution with less batches overall:

While the second batching strategy is an interesting algorithm design problem, in this use case problems outweigh any possible benefits:

  • Firstly, Microsoft’s limits are based on the total characters consumed, and not on the number of requests. So the only theoretical benefit from decreasing the number of requests is speed
  • Secondly, it is hard to design an efficient algorithm for it so any speed gains that you could get from shaving the number of requests are negligible, it at all existent
  • Thirdly, and most importantly, breaking up texts by target language makes mapping them back very difficult. If there is a failure during the translation process you will have partially translated texts*

*Note: While you will also run into this problem if you use the first batching strategy, the frequency of such failures will be minimal. For most cases that I can think of, it’s more important to have complete translations (e.g. texts translated to all target languages) while dropping a finite number of failed ones than having partial translations (e.g. texts translated to a subset of the target languages) for all texts. It is also easier to re-translate the failed examples because you can extract the IDs of the failed texts, whereas filling gaps in partial translations is much more difficult.

As such, I continued with the first batching strategy. This resulted in some modifications of the translation function:

MAX_CHARACTER_LIMITS = 50000
def translate_text(
text: Union[str, list],
target_language: Union[str, list],
source_language: Optional[str] = None,
api_version: str = '3.0') -> tuple:
"""translates txt using the microsoft translate API

:param text: text to be translated. Either single or multiple (stored in a list)
:param target_language: ISO format of target translation languages
:param source_language: ISO format of source language. If not provided is inferred by the translator, defaults to None
:param api_version: api version to use, defaults to "3.0"
:return: for successful response, (status_code, [{"translations": [{"text": translated_text_1, "to": lang_1}, ...]}, ...]))
"""

url = f'{MICROSOFT_TRANSLATE_URL}/translate?api-version={api_version}'

if isinstance(target_language, str):
target_language = [target_language]

if source_language:
url = f'{url}&from={source_language}'

if isinstance(text, str):
text = [text]

n_target_langs = len(target_language)

# get maximum size of each document

sizes_dict = {i: len(text_)*n_target_langs for i, text_ in enumerate(texts)}
batched_texts = batch_by_size(sizes_dict, MAX_CHARACTER_LIMITS)

# for each batch, translate the texts. If successful append to list
translation_outputs = []
for batch in batched_texts:
doc_ids = batch['idx']
total_size = batch['total_size']
### start of new code --------------------------------------------

# case where single doc too big to be translated to all languages, but small enough that it can be translated to some languages
if total_size > MAX_CHARACTER_LIMITS:
translation_output = [dict(translations=[])]
doc_size = sizes_dict[doc_ids[0]] # necessarily only single document in batch
batch_size = MAX_CHARACTER_LIMITS // doc_size
batch_range = range(0, n_target_langs, batch_size)
n_batches = len(batch_range)

# batch by target languages
for batch_id, start_lang_idx in enumerate(batch_range):
end_lang_idx = start_lang_idx + batch_size
target_languages_ = target_language[start_lang_idx: end_lang_idx]

# rebuild the url for subset of langs
url_ = add_array_api_parameters(
url,
param_name='to',
param_values=target_languages_
)
body = [{'text': text[doc_ids[0]]}]
LOGGER.debug(f'Translating batch {batch_id+1}/{n_batches} of text with idx={doc_ids[0]}. Target languages: {target_languages_}')
resp = requests.post(url_, headers=HEADERS, json=body)
status_code = resp.status_code
if not is_request_valid(status_code):
raise Exception(f'Translation failed for texts {doc_ids[0]}')
partial_translation_output = resp.json()
# concatenate outputs in correct format
translation_output['translations'] += partial_translation_output['translations']

else:
# -- code as before, except translation_output now part of else
batch_texts = [text[doc_id] for doc_id in doc_ids]
body = [{'text': text_} for text_ in batch_texts]
LOGGER.info(f'Translating {len(text)} texts to {len(target_language)} languages')
# rebuild url for all languages
url_ = add_array_api_parameters(
url,
param_name='to',
param_values=target_language
)
resp = requests.post(url_, headers=HEADERS, json=body)
status_code = resp.status_code
if not is_request_valid(status_code):
raise Exception(f'Translation failed for texts {doc_ids}')

translation_output = resp.json()
### end of new code ----------------------------------------------
translation_outputs += translation_output

return translation_outputs, status_code

As before, there are some key considerations here:

  • Still, we are failing the entire translation if there is a failure in request. This is partially motivated by the fact that we don’t know how to resolve the status codes for partial translations, as our output expects a single status code.
  • The batching function is still O(n)
  • The translation loop has theoretically increased in computational complexity. In the worst case we now have O(n*k) where k is the number of languages. However, unlike before, we can now deal with cases where a text is large enough that it can’t be translated to all target languages in a single request, but small enough that it can be translated to subsets of the target languages as separate requests.

Challenge 3 — Single Large Text

We now reach the inevitable edge case: a single text that has more than 50000 characters. In this case, none of our previous batching methods work. The only thing we can consider doing is automatically splitting the text prior to translation. However, I decided to avoid this for the following reasons:

  • Single Responsibility Principle: the purpose of the translation function is to act as a wrapper over the Microsoft Translator API. In my view, any extra processing should be done prior to translation.
  • Increased Code Complexity: adding support for sentence splitting requires the use of sentence splitting libraries, which add unnecessary dependencies for a function whose role is to be a wrapper for the Microsoft API. It would also require lots of code for ensuring sentence splitting methods all return the same output format, ground truth labels prior to sentence splitting are preserved, and to enable split sentences to be mapped back together after translation.
  • Increased Output Ambiguity: the choice of sentence splitter could vary a lot. These could be naïve, classical or neural methods. The choice of sentence splitting method would inevitably impact the quality of the translations, and would raise ambiguities in mapping back translated split sentences back together. In principle, the sentence splitting methods should be used outside the translation function, evaluated using some metric to determine split quality, and then translated.

Due to the above, for now I decided to add an error using the logger if such were encountered. Later we’ll look into how this fits in with the rest of the code when trying to resolve ambiguity in the case of partially failed translations.

        if total_size > MAX_CHARACTER_LIMITS:
# -- start of new code
doc_size = sizes_dict[doc_ids[0]] # necessarily only single document in batch
batch_size = MAX_CHARACTER_LIMITS // doc_size
if not batch_size:
msg = f'Text `{doc_ids[0]}` too large to be translated'
if raise_error_on_translation_failure:
raise Exception(msg)
LOGGER.error(msg)
else:
batch_range = range(0, n_target_langs, batch_size)
translation_output = [dict(translations=[])]
# -- end of new code

Challenge 4 — Too Many Requests

Finally, most cases that you encounter will not be edge cases. They will all fit within the max character limit. However, the issue you’ll run into is trying to call the translate API too many times in a short amount of time.

In order to avoid this, Microsoft recommends that your requests not exceed more than 2 million characters per hour, or 33300 per minute. We can estimate if our request will pass the threshold using the following function:

MAX_CHARACTER_LIMITS_PER_HOUR = 2000000
def _sleep(time_of_last_success, request_size):
if time_of_last_success:
time_diff_since_first_request = time.time() - time_of_last_success
time_diff_needed_for_next_request = request_size / (MAX_CHARACTER_LIMITS_PER_HOUR / 3600)
sleep_time = time_diff_needed_for_next_request - time_diff_since_first_request
if sleep_time > 0:
LOGGER.debug(f'Sleeping {sleep_time:.3g} seconds...')
time.sleep(sleep_time)

Each time we make a request, we need to update the value of time_of_last_success (which is initially set to None). The request_size is the size of the request we’re currently making, e.g. total characters in batch * num_languages. In my experience, this method helps avoid overloading the Microsoft API, however it is not optimal. I have tried running translations without sleeping and without overloading the server. But without knowing how exactly Microsoft calculates the request limits, it is difficult to design an optimal solution (for speed).

In the previous section, we designed a Proof-of-Concept solution that improved over our simple translation wrapper. We can now automatically:

  • Batch documents such that the total size of each batch does not exceed the max limit
  • Batch documents by language for cases where a single document is too large to be translated to all target languages
  • Get the IDs of documents that cannot be translated because they are larger than the max limit
  • Avoid max number of requests limits by slowing down requests based on the total size of requests we’ve made in a single session

However, we still had some unanswered questions and inefficient methods. In this section, we will introduce four enhancements to finalise the translation function:

  • Enhancement 1: decreasing the number of requests made by improving the batching functionality
  • Enhancement 2: defining a strategy for resolving partial translations
  • Enhancement 3: refactoring our code into reusable code blocks

Enhancement 1 — Decreasing the Number of Requests Made

Our previous method of batching the data was fast, but not the best for minimising the number of batches. Instead, we modify it so that documents are assigned to each batch such that they minimise the difference between the total batch size and the max character limit.

def batch_by_size_min_buckets(sizes: Dict[Union[int, str], int], limit: int, sort_docs: bool = True) -> List[Dict[str, Union[int, List[int]]]]:
"""Given dictionary of documents and their sizes {doc_id: doc_size}, batch documents such that the total size of each batch <= limit. Algorithm designed to decrease number of batches, but does not guarantee that it will be an optimal fit

:param sizes: mapping that gives document size for each document_id, {doc_1: 10, doc_2: 20, ...}
:param limit: size limit for each batch
:sort_doc: if True sorts `sizes` in descending order
:return: [{'idx': [ids_for_batch], 'total_size': total_size_of_documents_in_batch}, ...]

Example:
>>> documents = ['Joe Smith is cool', 'Django', 'Hi']
>>> sizes = {i: len(doc) for i, doc in enumerate(documents)}
>>> limit = 10
>>> batch_by_size(sizes, limit)
[{'idx': [0], 'total_size': 17}, {'idx': [1, 2], 'total_size': 8}]
"""
if sort_docs:
sizes = {key: size for key, size in sorted(sizes.items(), key=lambda x: x[1], reverse=True)}

batched_items = []
sizes_iter = iter(sizes)
key = next(sizes_iter) # doc_id

# -- helpers
def _add_doc(key):
batched_items.append({
'idx': [key],
'total_size': sizes[key]
})

def _append_doc_to_batch(batch_id, key):
batched_items[batch_id]['idx'].append(key)
batched_items[batch_id]['total_size'] += sizes[key]

while key is not None:

# initial condition
if not batched_items:
_add_doc(key)
else:
size = sizes[key]

if size > limit:
LOGGER.warning(f'Document {key} exceeds max limit size: {size}>{limit}')
_add_doc(key)
else:
# find the batch that fits the current doc best
batch_id = -1
total_capacity = limit - size # how much we can still fit
min_capacity = total_capacity
for i, batched_item in enumerate(batched_items):
total_size = batched_item['total_size']
remaining_capacity = total_capacity - total_size # we want to minimise this

# current batch too large for doc, go to next batch
if remaining_capacity < 0:
continue
# current batch is a better fit for doc, save batch_id
elif remaining_capacity < min_capacity:
min_capacity = remaining_capacity
batch_id = i

# if perfect fit, break loop
if remaining_capacity == 0:
break

if batch_id == -1:
_add_doc(key)
else:
_append_doc_to_batch(batch_id, key)

key = next(sizes_iter, None)
return batched_items

Note some comparisons with the naïve method:

  • The batching function is now O(n²) at worst. While this may sound alarming, when we think of the whole system, we notice that the main bottleneck in terms of time will come from requests, so we could justify going for O(n²) for batching, if we’ll get a substantial decrease in the number of batches
  • While this is a substantial improvement over the naïve method for decreasing the number of batches, it is still an approximate solution
  • I added an optional sorting parameter sort_docs to sort the sizes in descending order. My hunch is that setting sort_docs=True should decrease the number of batches.

For an illustration of why this is still an approximate solution, see the figure below:

It’s worth noting that having a random split can in theory result in the optimal solution. In this case, if the texts were organised in the following order [“Hello World”, “Bye”, “Hi”, “Hi World”, “Cry”, “Halo”] (as an example) then the algorithm without sorting would have led to the optimal solution. However, we will see that on average, sorting leads to more optimal solutions!

To get a better understanding of how sorting affects the algorithm, I ran a few simulations, here are some results:

  • How Does Sorting Affect Time Complexity?

The figure below shows the average time taken (over 50 iterations) for the four batching strategies we have (quadratic and naive for sorted and unsorted cases). As expected, sorting slows down batching by a tiny bit. However, as argued before in the article our main bottleneck in terms of time comes from requests, not batching, and since the difference in time between the methods is not massive, we can justify sorting if it decreases the number of batches.

  • How Does Sorting Affect The Number of Batches?

The figure below shows two important things:

  1. The quadratic batching algorithm substantially decreases the number of batches compared to the naive algorithm
  2. Sorting decreases the number of batches for both the naive and quadratic algorithms

The first point shows the impact of using quadratic batching. A typically response from the Microsoft Translator has a latency of 150 milliseconds [2]. If we consider the smallest gap between the naive and quadratic algorithms at 300 documents, quadratic batching unsorted vs. naive batching sorted: we have a difference of 40 batches. This means that in effect the quadratic algorithm (in the worst case) saves us 6 seconds on requests alone. Comparing with the time complexity of the batching algorithm, we find that the O(n²) of the quadratic algorithm has a negligible impact on overall translation speed. The other hidden advantage of course, is that we have 40 less chances of request failures!

The second point is interesting, for it empirically proves my theory that sorting decreases the number of batches. Unfortunately I don’t have a mathematical proof for this, so the empirical results suffice for now.

  • How Does Sorting Affect The Average Batch Size?

Naturally, following the number of batches trend we expect that quadratic algorithms have a higher average batch size than the naive ones, and that sorting increases the batch size for each algorithm respectively. This is reflected empirically in the figure below. What’s interesting to note though, is that the average batch size seems to converge for all cases. We can also see that the sorted quadratic algorithm converges to the max character limit has the number of documents approaches infinity.

Enhancement 2 — Resolving Partial Translations

Previously, we automatically failed the entire translation process on encountering a single failure. However, this can be wasteful in cases where most translation requests have already succeeded, or in cases where there is no hard requirement on having a 100% success translation rate. The behaviour of the translate function should therefore be defined by the user. Here are a few notable cases that we should cover:

  • Case 1: Fail the entire process if there is any failure
  • Case 2: Ignore failures from the output, and remove partial translations
  • Case 3: Ignore failures from the output, but keep partial translations

To achieve these cases, we add two boolean parameters: raise_error_on_translation_failure and include_partials_in_output . The first will raise an error anytime there is a failure (e.g. at batch level, or at a language level) if set to True (if False, we still log the error!). This is to cover Case 1. The second is only relevant when raise_error_on_translation_failure=False , and it will keep partial translations if set to True , and remove them if set to False . The code for this is shown below:

def translate_text(
text: Union[str, list],
target_language: Union[str, list],
source_language: Optional[str] = None,
api_version: str = '3.0', raise_error_on_translation_failure=True, include_partials_in_output=False) -> tuple:
"""translates txt using the microsoft translate API

:param text: text to be translated. Either single or multiple (stored in a list)
:param target_language: ISO format of target translation languages
:param source_language: ISO format of source language. If not provided is inferred by the translator, defaults to None
:param api_version: api version to use, defaults to "3.0"
:param raise_error_on_translation_failure: if `True`, raises errors on translation failure. If `False` ignores failed translations in output
:pram include_partials_in_output: if `True` includes partially translated texts in output, otherwise ignores them
:return: for successful response, (status_code, [{"translations": [{"text": translated_text_1, "to": lang_1}, ...]}, ...]))
"""

url = f'{MICROSOFT_TRANSLATE_URL}/translate?api-version={api_version}'

if isinstance(target_language, str):
target_language = [target_language]

if source_language:
url = f'{url}&from={source_language}'

if isinstance(text, str):
text = [text]

n_target_langs = len(target_language)

sizes_dict = {i: len(text_)*n_target_langs for i, text_ in enumerate(text)}
batched_texts = batch_by_size_min_buckets(sizes_dict, CHARACTER_LIMITS)

translation_outputs = []
for batch in batched_texts:
translation_output = []
doc_ids = batch['idx']
total_size = batch['total_size']

if total_size > CHARACTER_LIMITS:
doc_size = sizes_dict[doc_ids[0]]
batch_size = CHARACTER_LIMITS // doc_size

if not batch_size:
msg = f'Text `{doc_ids[0]}` too large to be translated'
# -- new code ------------------------------------------
if raise_error_on_translation_failure:
raise Exception(msg)
LOGGER.error(msg)
# -- end of new code ------------------------------------------

else:
_translation_output = dict(translations=[])
batch_range = range(0, n_target_langs, batch_size)
n_batches = len(batch_range)

_translation_failed = False # to track translations at language batching level
for batch_id, start_lang_idx in enumerate(batch_range):
end_lang_idx = start_lang_idx + batch_size
target_languages_ = target_language[start_lang_idx: end_lang_idx]

url_ = add_array_api_parameters(url, param_name='to', param_values=target_languages_)
body = [{'text': text[doc_ids[0]]}]

LOGGER.info(f'Translating batch {batch_id+1}/{n_batches} of text with idx={doc_ids[0]}. Target languages: {target_languages_}')
resp = requests.post(url_, headers=HEADERS, json=body)
status_code = resp.status_code
if not is_request_valid(status_code):
# -- new code ------------------------------------------
msg = f'Partial translation of text `{doc_ids[0]}` to languages {target_languages_} failed.'
if raise_error_on_translation_failure:
raise Exception(msg)
LOGGER.error(msg)
_translation_failed = True
if not include_partials_in_output:
break
# -- end of new code-------------------------------------

partial_translation_output = resp.json()
_translation_output['translations'] += partial_translation_output['translations']

# -- new code -------------------------------------------
if not _translation_failed or include_partials_in_output:
translation_output.append(_translation_output)
# -- end of new code ------------------------------------

else:
batch_texts = [text[doc_id] for doc_id in doc_ids]
body = [{'text': text_} for text_ in batch_texts]
LOGGER.info(f'Translating {len(text)} texts to {len(target_language)} languages')
# rebuild url for all languages
url_ = add_array_api_parameters(
url,
param_name='to',
param_values=target_language
)
resp = requests.post(url_, headers=HEADERS, json=body)
status_code = resp.status_code
if not is_request_valid(status_code):
# -- new code -----------------------------------
msg = f'Translation failed for texts {doc_ids}. Reason: {resp.text}'
if raise_error_on_translation_failure:
raise Exception(msg)
LOGGER.error(msg)
# -- end of new code --------------------------------
else:
translation_output += resp.json()

translation_outputs += translation_output

return translation_outputs, status_code

Enhancement 3 — Cleaning Up Code

It’s now time to clean up the code and to add useful comments. However, since there is lots of reusability and the code logic is complicated, it makes sense to re-write the the translate function as a class. This also has the added benefit of allowing us to store the IDs of documents/batches whose translation fails (immensely beneficial if you want to re-run those specific translations as opposed to finding them through the logs!). I also decided to change the raise_error_on_translation_failure to ignore_on_translation_failure , and instead simply return the error and status code if a failure occurs and ignore_on_translation_failure=False . I decided to change this as I don’t wish for the API function to raise errors (errors should be captured by status codes).

The final code looks as follows:

CHARACTER_LIMITS = 50000
MAX_CHARACTER_LIMITS_PER_HOUR = 2000000

class MicrosoftTranslator:
"""Class for translating text using the Microsoft Translate API

:param api_version: api version to use, defaults to "3.0"
:param ignore_on_translation_failure: if `False`, returns failed translations with error and status code. If `False` ignores failed translations in output, defaults to False
:param include_partials_in_output: if `True` includes partially translated texts in output, otherwise ignores them, defaults to False

"""
def __init__(
self,
api_version: str = '3.0',
ignore_on_translation_failure: bool = False,
include_partials_in_output: bool = False
):

base_url = f'{MICROSOFT_TRANSLATE_URL}/translate?api-version={api_version}'

self.base_url = base_url
self.api_version = api_version
self.ignore_on_translation_failure = ignore_on_translation_failure
self.include_partials_in_output = include_partials_in_output

def translate_text(
self,
texts: Union[str, list],
target_languages: Union[str, list],
source_language: Optional[str] = None
) -> tuple:
"""translates txt using the microsoft translate API

:param texts: text(s) to be translated. Can either be a single text (str) or multiple (list)
:param target_languages: ISO format of target translation language(s). Can be single lang (str) or multiple (list)
:param source_language: ISO format of source language. If not provided is inferred by the translator, defaults to None
:return: for successful response, (status_code, [{"translations": [{"text": translated_text_1, "to": lang_1}, ...]}, ...]))
"""

# -- create storage for translation failures and flag for failed translation
self._set_request_default

# add source language to url
if source_language:
base_url = f'{self.base_url}&from={source_language}'
else:
base_url = self.base_url

# standardise target_languages and texts types
if isinstance(target_languages, str):
target_languages = [target_languages]

if isinstance(texts, str):
texts = [texts]

# batch texts for translation, based on doc_size = len(doc)*n_target_langs
n_target_langs = len(target_languages)
sizes_dict = {i: len(text)*n_target_langs for i, text in enumerate(texts)}

profile_texts = self._profile_texts(sizes_dict)
if profile_texts:
return profile_texts

batched_texts = batch_by_size_min_buckets(sizes_dict, CHARACTER_LIMITS, sort_docs=True)

translation_outputs = [] # variable to store all translation outputs

for batch in batched_texts:
batch_translation_output = [] # variable to store translation output for single batch
doc_ids = batch['idx']
total_size = batch['total_size']
texts_in_batch = [texts[doc_id] for doc_id in doc_ids]
# -- case when batch exceeds character limits
if total_size > CHARACTER_LIMITS:
assert len(doc_ids) == 1, 'Critical error: batching function is generating batches that exceed max limit with more than 1 text. Revisit the function and fix this.'

doc_id = doc_ids[0]
doc_size = sizes_dict[doc_id] // n_target_langs
batch_size = CHARACTER_LIMITS // doc_size

# -- case when a single doc is too large to be translated
if not batch_size:
process_error = self._process_error(f'Text idx={doc_id} too large to be translated', 'Max character limit for request', 400, doc_ids, target_languages)
if process_error:
return process_error

# -- case when single doc too big to be translated to all language, but small enough that it can be translated to some languages
else:
_translation_output = dict() # variable to store translations for single text but different language batches
_translation_failed = False # variable to track if translation is partial

batch_range = range(0, n_target_langs, batch_size)
n_batches = len(batch_range)

# batch by target languages
for batch_id, start_lang_idx in enumerate(batch_range):
end_lang_idx = start_lang_idx + batch_size
target_languages_ = target_languages[start_lang_idx: end_lang_idx]
total_size_ = doc_size * len(target_languages_)

response_output, status_code = self._post_request(
f'Translating batch {batch_id+1}/{n_batches} of text with idx={doc_id}. Target languages: {target_languages_}',
base_url, target_languages_, texts_in_batch, total_size_
)
if not is_request_valid(status_code):
process_error = self._process_error(
f'Partial translation of text idx={doc_id} to languages {target_languages_} failed. Reason: {response_output}',
response_output, status_code, doc_ids, target_languages_)
if process_error:
return process_error

# failure indicates translation is partial. Break loop if we don't case about partials in output
_translation_failed = True
if not self.include_partials_in_output:
break
else:
self._update_partial_translation(response_output, _translation_output, source_language)

if not _translation_failed or self.include_partials_in_output:
if _translation_output:
batch_translation_output.append(_translation_output)

# -- case when batch does not exceed character limits
else:
response_output, status_code = self._post_request(
f'Translating {len(texts)} texts to {len(target_languages)} languages',
base_url, target_languages, texts_in_batch, total_size
)

if not is_request_valid(status_code):
process_error = self._process_error(
f'Translation failed for texts {doc_ids}. Reason: {response_output}',
response_output, status_code, doc_ids, target_languages
)
if process_error:
return process_error
else:
batch_translation_output += response_output

translation_outputs += batch_translation_output

if self.no_failures:
status_code = 200
# case when all translations failed, so return translation errors instead
elif not translation_outputs:
translation_outputs = self.translation_errors
status_code = 400
# case when translations partially failed, modify status code
else:
status_code = 206

return translation_outputs, status_code

@property
def _set_request_default(self):
"""Function for resetting translation errors and no_failures flag
"""

self.translation_errors = {}
self.no_failures = True

@property
def _set_no_failures_to_false(self):
"""Function to explicitly set no failures to False
"""
self.no_failures = False

@property
def _set_success_request_time(self):
"""Function to set the time a request is made
"""
self.time_of_last_success_request = time.time()

def _update_partial_translation(self, response_output, partial_translation_output, source_language):

# concatenate outputs in correct format
if 'translations' not in partial_translation_output:
partial_translation_output['translations'] = response_output['translations']
else:
partial_translation_output['translations'] += response_output['translations']

if not source_language:
if 'detectedLanguage' not in partial_translation_output:
partial_translation_output['detectedLanguage'] = response_output['detectedLanguage']
else:
partial_translation_output['detectedLanguage'] += response_output['detectedLanguage']

return partial_translation_output

def _update_translation_errors(self, response_text: str, status_code: int, doc_ids: list, target_languages: list):
"""Add failed translation to errors dictionary

:param response_text: response text from failed request (status_code not beginning with 2)
:param status_code: status code from failed request
:param doc_ids: documents that were to be translated
:param target_languages: target languages used in request
"""
doc_ids = tuple(doc_ids)
if doc_ids not in self.translation_errors:
self.translation_errors[doc_ids] = dict(
reason=response_text,
status_code=status_code,
target_languages=target_languages
)
else:
self.translation_errors[doc_ids]['target_languages'] += target_languages
self.translation_errors[doc_ids]['status_code'] = status_code
self.translation_errors[doc_ids]['reason'] = response_text

def _process_error(self, msg: str, response_text: str, status_code: int, doc_ids: list, target_languages: list):
"""Processes failed request based on `ignore_on_translation_failure` strategy

:param msg: message to return or log depending on `ignore_on_translation_failure` strategy
:param response_text: response text from failed request (status_code not beginning with 2)
:param status_code: status code from failed request
:param doc_ids: documents that were to be translated
:param target_languages: target languages used in request
"""

self._set_no_failures_to_false

self._update_translation_errors(response_text, status_code, doc_ids, target_languages)

if not self.ignore_on_translation_failure:
return response_text, status_code

LOGGER.error(msg)

def _profile_texts(self, sizes: Dict[Union[int, str], int]):
"""Profiles texts to see if the request can be translated

:param sizes: size mapping for each document, {doc_id_1: size, ...}
"""
num_texts = len(sizes)
total_request_size = sum(sizes.values())
if total_request_size > MAX_CHARACTER_LIMITS_PER_HOUR:
return 'Your texts exceed max character limits per hour', 400

LOGGER.info(f'Detected `{num_texts}` texts with total request size `{total_request_size}`')

def _sleep(self, request_size: int):
"""Function to sleep prior to requests being made based on the size of the request and the time last successful request was made in order to avoid overloading Microsoft servers

:param request_size: size of the request being made
"""
if hasattr(self, 'time_of_last_success_request'):
time_diff_since_first_request = time.time() - self.time_of_last_success_request
time_diff_needed_for_next_request = request_size / (MAX_CHARACTER_LIMITS_PER_HOUR / 3600)
sleep_time = time_diff_needed_for_next_request - time_diff_since_first_request
if sleep_time > 0:
LOGGER.debug(f'Sleeping {sleep_time:.3g} seconds...')
time.sleep(sleep_time)

def _post_request(self, msg: str, base_url: str, target_languages: list, texts: List[str], request_size: Optional[int] = None) -> Tuple[Union[dict, str], int]:
"""Internal function to post requests to microsoft API

:param msg: message to log
:param base_url: base url for making the request
:param target_languages: list of target languages to translate text to
:param texts: texts to translate
:param request_size: size of the request being made for calculating sleep period, defaults to None

:return: for successful response, (status_code, [{"translations": [{"text": translated_text_1, "to": lang_1}, ...]}, ...]))
"""
if not request_size:
n_target_langs = len(target_languages)
request_size = sum([len(text)*n_target_langs for text in texts])

self._sleep(request_size)

url = add_array_api_parameters(base_url, param_name='to', param_values=target_languages)
LOGGER.info(msg)
body = [{'text': text} for text in texts]
resp = requests.post(url, headers=HEADERS, json=body)
status_code = resp.status_code
if is_request_valid(status_code):
self._set_success_request_time
return resp.json(), status_code
return resp.text, status_code

For access to the full working code, please visit the repository.

The Microsoft Translate API is a versatile tool for translation and very easy to setup. However, it has request and character limits that make using it effectively difficult and costly. In this article we discussed methods for making the best use of the translate API, by automatically batching inputs.

Some final notes and comments on further direction:

  • Speed: the biggest bottleneck to speed is by far the wait time between requests in order to avoid the max request limits imposed by Microsoft. However, just for an academic exercise, the batching algorithm can be made better. Of course, this has no practical use since any benefits we might gain from batching faster would become relevant when the number of documents approaches infinity, but by virtue of the max character limit we can never exceed 50000 (e.g. 50000 documents of length 1, being translated to a single language). If you are interested though, I would recommend reading about the bin packing problem.
  • Assurance: at the moment, the code is designed to avoid most request limit problems. However, should there be a failure, it will not re-try the request until all translations have been completed. For this, a worker architecture is needed that keeps re-trying the failed translations (perhaps with exponential backoff) until all of them have been successfully completed. Of course, whether this is practical is also a question worth considering, since the current translator class stores failed translations, making it a relatively trivial task to re-translate manually.

Author’s Note

If you liked this article or learned something new, please consider getting a membership using my referral link:

This gives you unrestricted access to all of Medium, while helping me produce more content at no extra cost to you.

If you are interested in in-depth tutorials on Software Engineering and Machine Learning, then join my email list to get notified whenever I release a new article:

Happy learning and till next time!


Photo from Unsplash courtesy of Edurne.

The Microsoft Translator API [1] is one of the easiest translator services to set up, and it’s quite powerful, giving you access to translators for a multitude of low and high resource languages for free. However, it has a 50000 max character limit per request [2] and a 2 million max character limit per hour of use (on the free version). So while the service itself is very easy to setup and use, using it reliably is quite difficult.

In this article, I will explore methods for ensuring that:

  • Any arbitrary number of texts can be translated to any number of target languages while adhering to the max character limit by autobatching inputs
  • Consecutive requests are delayed such that they adhere to the max character limit per hour

At worst (free subscription), these methods will decrease wasted characters on partial translations and at best (paid subscription) they’ll save you cash.

The tutorial is targeted towards those with working knowledge of the Microsoft Translator API. If you are unfamiliar and would like a beginner friendly-guide on setting it up, check my introduction article below:

The original translation code I wrote was simply a wrapper over requests that took a list of texts, a list of target languages, and optionally a source language as parameters, and called the translate endpoint of the translator API. The code snippet of the translation function is shown below:

def translate_text(
text: Union[str, list],
target_language: Union[str, list],
source_language: Optional[str] = None,
api_version: str = '3.0') -> tuple:
"""translates txt using the microsoft translate API

:param text: text to be translated. Either single or multiple (stored in a list)
:param target_language: ISO format of target translation languages
:param source_language: ISO format of source language. If not provided is inferred by the translator, defaults to None
:param api_version: api version to use, defaults to "3.0"
:return: for successful response, (status_code, [{"translations": [{"text": translated_text_1, "to": lang_1}, ...]}, ...]))
"""

url = f'{MICROSOFT_TRANSLATE_URL}/translate?api-version={api_version}'

if isinstance(target_language, str):
target_language = [target_language]

# extend url with array parameters, e.g. f"{url}&to=de&to=ru"
url = add_array_api_parameters(
url,
param_name='to',
param_values=target_language
)

if source_language:
url = f'{url}&from={source_language}'

if isinstance(text, str):
text = [text]

body = [{'text': text_} for text_ in text]

LOGGER.info(f'Translating {len(text)} texts to {len(target_language)} languages')
resp = requests.post(url, headers=HEADERS, json=body)
status_code = resp.status_code

if is_request_valid(status_code):
return resp.json(), status_code

return resp.text, status_code

Upon use, I ran into the following challenges that caused my translation requests to fail:

  • Challenge 1: The total size of the texts exceeded the max character limit when translated to all target languages
  • Challenge 2: At least one of my texts exceeded the max character limit when translated to all target languages
  • Challenge 3: At least one of my texts exceeded the max character limit
  • Challenge 4: The total size of my texts was negligible but because of very frequent calls to the endpoint I was hitting a max number of requests limit

The next section will run through a first implementation of a translation function to deal with the 4 challenges above.

Challenge 1 — Too Many Texts, Too Many Languages

The first issue occurs when you have a list of texts such that no single text exceeds the max character limit when translated to its target languages, but that the texts together do.

To put it concretely, this means that you can’t send a single request of n texts whose total size S is greater than 50000. From Microsoft’s documentation [2], we know that the size of a text is calculated as the number of characters in the text multiplied by the number of languages it’s being translated to. In Python:

texts = ["Hi this is a dummy text", "Oh no dummy text"]
target_languages = ["de", "ru"]
n_target_langs = len(target_languages)
text_sizes = [len(text) for text in texts]
total_size = sum(text_sizes) * n_target_langs
assert total_size <= 50000

As such, we need a function to bucket all input texts into batches whose total size does not exceed 50000. My first implementation resulted in the following code:

from typing import Dict, List

def batch_by_size(sizes: Dict[int, int], limit: int, sort_docs: bool = False) -> List[Dict[str, Union[int, List[int]]]]:
"""Given a size mapping such {document_id: size_of_document}, batches documents such that the total size of a batch of documents does not exceed pre-specified limit

:param sizes: mapping that gives document size for each document_id
:param limit: size limit for each batch
:sort_doc: if True sorts `sizes` in descending order
:return: [{'idx': [ids_for_batch], 'total_size': total_size_of_documents_in_batch}, ...]

Example:
>>> documents = ['Joe Smith is cool', 'Django', 'Hi']
>>> sizes = {i: len(doc) for i, doc in enumerate(documents)}
>>> limit = 10
>>> batch_by_size(sizes, limit)
[{'idx': [0], 'total_size': 17}, {'idx': [1, 2], 'total_size': 8}]
"""
if sort_docs:
sizes = {key: size for key, size in sorted(sizes.items(), key=lambda x: x[1], reverse=True)}

batched_items = []
sizes_iter = iter(sizes)
key = next(sizes_iter)
while key is not None:
if not batched_items:
batched_items.append({
'idx': [key],
'total_size': sizes[key]
})
else:
size = sizes[key]
if size > limit:
LOGGER.warning(f'Document {key} exceeds max limit size: {size}>{limit}')
total_size = batched_items[-1]['total_size'] + size
if total_size > limit:
batched_items.append({
'idx': [key],
'total_size': size
})
else:
batched_items[-1]['idx'].append(key)
batched_items[-1]['total_size'] = total_size
key = next(sizes_iter, None)

return batched_items

The translate_text function from above can now be modified to include this functionality:

MAX_CHARACTER_LIMITS = 50000
def translate_text(
text: Union[str, list],
target_language: Union[str, list],
source_language: Optional[str] = None,
api_version: str = '3.0') -> tuple:
"""translates txt using the microsoft translate API

:param text: text to be translated. Either single or multiple (stored in a list)
:param target_language: ISO format of target translation languages
:param source_language: ISO format of source language. If not provided is inferred by the translator, defaults to None
:param api_version: api version to use, defaults to "3.0"
:return: for successful response, (status_code, [{"translations": [{"text": translated_text_1, "to": lang_1}, ...]}, ...]))
"""

url = f'{MICROSOFT_TRANSLATE_URL}/translate?api-version={api_version}'

if isinstance(target_language, str):
target_language = [target_language]

# extend url with array parameters, e.g. f"{url}&to=de&to=ru"
url = add_array_api_parameters(
url,
param_name='to',
param_values=target_language
)

if source_language:
url = f'{url}&from={source_language}'

if isinstance(text, str):
text = [text]

### start of new code --------------------------------------------
n_target_langs = len(target_language)

# get maximum size of each document

sizes_dict = {i: len(text_)*n_target_langs for i, text_ in enumerate(text)}
batched_texts = batch_by_size(sizes_dict, MAX_CHARACTER_LIMITS)

# for each batch, translate the texts. If successful append to list
translation_outputs = []
for batch in batched_texts:
doc_ids = batch['idx']
batch_texts = [text[doc_id] for doc_id in doc_ids]
body = [{'text': text_} for text_ in batch_texts]
LOGGER.info(f'Translating {len(text)} texts to {len(target_language)} languages')
resp = requests.post(url, headers=HEADERS, json=body)
status_code = resp.status_code
if not is_request_valid(status_code):
raise Exception(f'Translation failed for texts {doc_ids}')

translation_output = resp.json()
translation_outouts += translation_output

return translation_outouts, status_code
### end of new code ----------------------------------------------

Note some key considerations that were made:

  • A failure in translating a single batch would fail the whole request
  • The batching function is O(n)
  • The translation loop is O(n) in the worst case (e.g. each text size hits the max limit). Realistically it will be less, however the algorithm will not try to minimise the number of batches, since it only linearly loops through the texts

Challenge 2 — Single Text, Too Many Languages

While the above code works for most cases, you will inevitably run into a case where you have a single text that alone will not exceed the max character limit, but would do if translated to all. In Python:

max_limit = 10
text = "Hello!"
text_size = len(text) # 6
target_languages = ["de", "ru"]
n_target_langs = len(target_languages)
total_size = text_size * n_target_langs # 12
assert total_size <= max_limit # 12 <= 10, raises error!

In this case, the problematic text must be translated separately for a batch of target languages, and at worst, for each language separately. At this point though, we realise that there are two batching strategies that we can consider:

  • Batching where size is defined by len(text) * n_target_langs : batch items with the solution for Challenge 1. For texts whose size S = len(text) * n_target_langs > max_limit , further batch by target language
  • Batching where size is defined by the optimal combination of text lengths and target languages: batch texts such that each batch can have an arbitrary number of languages associated with it, provided that sum(texts_in_batch) * n_target_langs_for_batch <= max_limit

Below is a pictorial representation of an example where the second method leads to solution with less batches overall:

While the second batching strategy is an interesting algorithm design problem, in this use case problems outweigh any possible benefits:

  • Firstly, Microsoft’s limits are based on the total characters consumed, and not on the number of requests. So the only theoretical benefit from decreasing the number of requests is speed
  • Secondly, it is hard to design an efficient algorithm for it so any speed gains that you could get from shaving the number of requests are negligible, it at all existent
  • Thirdly, and most importantly, breaking up texts by target language makes mapping them back very difficult. If there is a failure during the translation process you will have partially translated texts*

*Note: While you will also run into this problem if you use the first batching strategy, the frequency of such failures will be minimal. For most cases that I can think of, it’s more important to have complete translations (e.g. texts translated to all target languages) while dropping a finite number of failed ones than having partial translations (e.g. texts translated to a subset of the target languages) for all texts. It is also easier to re-translate the failed examples because you can extract the IDs of the failed texts, whereas filling gaps in partial translations is much more difficult.

As such, I continued with the first batching strategy. This resulted in some modifications of the translation function:

MAX_CHARACTER_LIMITS = 50000
def translate_text(
text: Union[str, list],
target_language: Union[str, list],
source_language: Optional[str] = None,
api_version: str = '3.0') -> tuple:
"""translates txt using the microsoft translate API

:param text: text to be translated. Either single or multiple (stored in a list)
:param target_language: ISO format of target translation languages
:param source_language: ISO format of source language. If not provided is inferred by the translator, defaults to None
:param api_version: api version to use, defaults to "3.0"
:return: for successful response, (status_code, [{"translations": [{"text": translated_text_1, "to": lang_1}, ...]}, ...]))
"""

url = f'{MICROSOFT_TRANSLATE_URL}/translate?api-version={api_version}'

if isinstance(target_language, str):
target_language = [target_language]

if source_language:
url = f'{url}&from={source_language}'

if isinstance(text, str):
text = [text]

n_target_langs = len(target_language)

# get maximum size of each document

sizes_dict = {i: len(text_)*n_target_langs for i, text_ in enumerate(texts)}
batched_texts = batch_by_size(sizes_dict, MAX_CHARACTER_LIMITS)

# for each batch, translate the texts. If successful append to list
translation_outputs = []
for batch in batched_texts:
doc_ids = batch['idx']
total_size = batch['total_size']
### start of new code --------------------------------------------

# case where single doc too big to be translated to all languages, but small enough that it can be translated to some languages
if total_size > MAX_CHARACTER_LIMITS:
translation_output = [dict(translations=[])]
doc_size = sizes_dict[doc_ids[0]] # necessarily only single document in batch
batch_size = MAX_CHARACTER_LIMITS // doc_size
batch_range = range(0, n_target_langs, batch_size)
n_batches = len(batch_range)

# batch by target languages
for batch_id, start_lang_idx in enumerate(batch_range):
end_lang_idx = start_lang_idx + batch_size
target_languages_ = target_language[start_lang_idx: end_lang_idx]

# rebuild the url for subset of langs
url_ = add_array_api_parameters(
url,
param_name='to',
param_values=target_languages_
)
body = [{'text': text[doc_ids[0]]}]
LOGGER.debug(f'Translating batch {batch_id+1}/{n_batches} of text with idx={doc_ids[0]}. Target languages: {target_languages_}')
resp = requests.post(url_, headers=HEADERS, json=body)
status_code = resp.status_code
if not is_request_valid(status_code):
raise Exception(f'Translation failed for texts {doc_ids[0]}')
partial_translation_output = resp.json()
# concatenate outputs in correct format
translation_output['translations'] += partial_translation_output['translations']

else:
# -- code as before, except translation_output now part of else
batch_texts = [text[doc_id] for doc_id in doc_ids]
body = [{'text': text_} for text_ in batch_texts]
LOGGER.info(f'Translating {len(text)} texts to {len(target_language)} languages')
# rebuild url for all languages
url_ = add_array_api_parameters(
url,
param_name='to',
param_values=target_language
)
resp = requests.post(url_, headers=HEADERS, json=body)
status_code = resp.status_code
if not is_request_valid(status_code):
raise Exception(f'Translation failed for texts {doc_ids}')

translation_output = resp.json()
### end of new code ----------------------------------------------
translation_outputs += translation_output

return translation_outputs, status_code

As before, there are some key considerations here:

  • Still, we are failing the entire translation if there is a failure in request. This is partially motivated by the fact that we don’t know how to resolve the status codes for partial translations, as our output expects a single status code.
  • The batching function is still O(n)
  • The translation loop has theoretically increased in computational complexity. In the worst case we now have O(n*k) where k is the number of languages. However, unlike before, we can now deal with cases where a text is large enough that it can’t be translated to all target languages in a single request, but small enough that it can be translated to subsets of the target languages as separate requests.

Challenge 3 — Single Large Text

We now reach the inevitable edge case: a single text that has more than 50000 characters. In this case, none of our previous batching methods work. The only thing we can consider doing is automatically splitting the text prior to translation. However, I decided to avoid this for the following reasons:

  • Single Responsibility Principle: the purpose of the translation function is to act as a wrapper over the Microsoft Translator API. In my view, any extra processing should be done prior to translation.
  • Increased Code Complexity: adding support for sentence splitting requires the use of sentence splitting libraries, which add unnecessary dependencies for a function whose role is to be a wrapper for the Microsoft API. It would also require lots of code for ensuring sentence splitting methods all return the same output format, ground truth labels prior to sentence splitting are preserved, and to enable split sentences to be mapped back together after translation.
  • Increased Output Ambiguity: the choice of sentence splitter could vary a lot. These could be naïve, classical or neural methods. The choice of sentence splitting method would inevitably impact the quality of the translations, and would raise ambiguities in mapping back translated split sentences back together. In principle, the sentence splitting methods should be used outside the translation function, evaluated using some metric to determine split quality, and then translated.

Due to the above, for now I decided to add an error using the logger if such were encountered. Later we’ll look into how this fits in with the rest of the code when trying to resolve ambiguity in the case of partially failed translations.

        if total_size > MAX_CHARACTER_LIMITS:
# -- start of new code
doc_size = sizes_dict[doc_ids[0]] # necessarily only single document in batch
batch_size = MAX_CHARACTER_LIMITS // doc_size
if not batch_size:
msg = f'Text `{doc_ids[0]}` too large to be translated'
if raise_error_on_translation_failure:
raise Exception(msg)
LOGGER.error(msg)
else:
batch_range = range(0, n_target_langs, batch_size)
translation_output = [dict(translations=[])]
# -- end of new code

Challenge 4 — Too Many Requests

Finally, most cases that you encounter will not be edge cases. They will all fit within the max character limit. However, the issue you’ll run into is trying to call the translate API too many times in a short amount of time.

In order to avoid this, Microsoft recommends that your requests not exceed more than 2 million characters per hour, or 33300 per minute. We can estimate if our request will pass the threshold using the following function:

MAX_CHARACTER_LIMITS_PER_HOUR = 2000000
def _sleep(time_of_last_success, request_size):
if time_of_last_success:
time_diff_since_first_request = time.time() - time_of_last_success
time_diff_needed_for_next_request = request_size / (MAX_CHARACTER_LIMITS_PER_HOUR / 3600)
sleep_time = time_diff_needed_for_next_request - time_diff_since_first_request
if sleep_time > 0:
LOGGER.debug(f'Sleeping {sleep_time:.3g} seconds...')
time.sleep(sleep_time)

Each time we make a request, we need to update the value of time_of_last_success (which is initially set to None). The request_size is the size of the request we’re currently making, e.g. total characters in batch * num_languages. In my experience, this method helps avoid overloading the Microsoft API, however it is not optimal. I have tried running translations without sleeping and without overloading the server. But without knowing how exactly Microsoft calculates the request limits, it is difficult to design an optimal solution (for speed).

In the previous section, we designed a Proof-of-Concept solution that improved over our simple translation wrapper. We can now automatically:

  • Batch documents such that the total size of each batch does not exceed the max limit
  • Batch documents by language for cases where a single document is too large to be translated to all target languages
  • Get the IDs of documents that cannot be translated because they are larger than the max limit
  • Avoid max number of requests limits by slowing down requests based on the total size of requests we’ve made in a single session

However, we still had some unanswered questions and inefficient methods. In this section, we will introduce four enhancements to finalise the translation function:

  • Enhancement 1: decreasing the number of requests made by improving the batching functionality
  • Enhancement 2: defining a strategy for resolving partial translations
  • Enhancement 3: refactoring our code into reusable code blocks

Enhancement 1 — Decreasing the Number of Requests Made

Our previous method of batching the data was fast, but not the best for minimising the number of batches. Instead, we modify it so that documents are assigned to each batch such that they minimise the difference between the total batch size and the max character limit.

def batch_by_size_min_buckets(sizes: Dict[Union[int, str], int], limit: int, sort_docs: bool = True) -> List[Dict[str, Union[int, List[int]]]]:
"""Given dictionary of documents and their sizes {doc_id: doc_size}, batch documents such that the total size of each batch <= limit. Algorithm designed to decrease number of batches, but does not guarantee that it will be an optimal fit

:param sizes: mapping that gives document size for each document_id, {doc_1: 10, doc_2: 20, ...}
:param limit: size limit for each batch
:sort_doc: if True sorts `sizes` in descending order
:return: [{'idx': [ids_for_batch], 'total_size': total_size_of_documents_in_batch}, ...]

Example:
>>> documents = ['Joe Smith is cool', 'Django', 'Hi']
>>> sizes = {i: len(doc) for i, doc in enumerate(documents)}
>>> limit = 10
>>> batch_by_size(sizes, limit)
[{'idx': [0], 'total_size': 17}, {'idx': [1, 2], 'total_size': 8}]
"""
if sort_docs:
sizes = {key: size for key, size in sorted(sizes.items(), key=lambda x: x[1], reverse=True)}

batched_items = []
sizes_iter = iter(sizes)
key = next(sizes_iter) # doc_id

# -- helpers
def _add_doc(key):
batched_items.append({
'idx': [key],
'total_size': sizes[key]
})

def _append_doc_to_batch(batch_id, key):
batched_items[batch_id]['idx'].append(key)
batched_items[batch_id]['total_size'] += sizes[key]

while key is not None:

# initial condition
if not batched_items:
_add_doc(key)
else:
size = sizes[key]

if size > limit:
LOGGER.warning(f'Document {key} exceeds max limit size: {size}>{limit}')
_add_doc(key)
else:
# find the batch that fits the current doc best
batch_id = -1
total_capacity = limit - size # how much we can still fit
min_capacity = total_capacity
for i, batched_item in enumerate(batched_items):
total_size = batched_item['total_size']
remaining_capacity = total_capacity - total_size # we want to minimise this

# current batch too large for doc, go to next batch
if remaining_capacity < 0:
continue
# current batch is a better fit for doc, save batch_id
elif remaining_capacity < min_capacity:
min_capacity = remaining_capacity
batch_id = i

# if perfect fit, break loop
if remaining_capacity == 0:
break

if batch_id == -1:
_add_doc(key)
else:
_append_doc_to_batch(batch_id, key)

key = next(sizes_iter, None)
return batched_items

Note some comparisons with the naïve method:

  • The batching function is now O(n²) at worst. While this may sound alarming, when we think of the whole system, we notice that the main bottleneck in terms of time will come from requests, so we could justify going for O(n²) for batching, if we’ll get a substantial decrease in the number of batches
  • While this is a substantial improvement over the naïve method for decreasing the number of batches, it is still an approximate solution
  • I added an optional sorting parameter sort_docs to sort the sizes in descending order. My hunch is that setting sort_docs=True should decrease the number of batches.

For an illustration of why this is still an approximate solution, see the figure below:

It’s worth noting that having a random split can in theory result in the optimal solution. In this case, if the texts were organised in the following order [“Hello World”, “Bye”, “Hi”, “Hi World”, “Cry”, “Halo”] (as an example) then the algorithm without sorting would have led to the optimal solution. However, we will see that on average, sorting leads to more optimal solutions!

To get a better understanding of how sorting affects the algorithm, I ran a few simulations, here are some results:

  • How Does Sorting Affect Time Complexity?

The figure below shows the average time taken (over 50 iterations) for the four batching strategies we have (quadratic and naive for sorted and unsorted cases). As expected, sorting slows down batching by a tiny bit. However, as argued before in the article our main bottleneck in terms of time comes from requests, not batching, and since the difference in time between the methods is not massive, we can justify sorting if it decreases the number of batches.

  • How Does Sorting Affect The Number of Batches?

The figure below shows two important things:

  1. The quadratic batching algorithm substantially decreases the number of batches compared to the naive algorithm
  2. Sorting decreases the number of batches for both the naive and quadratic algorithms

The first point shows the impact of using quadratic batching. A typically response from the Microsoft Translator has a latency of 150 milliseconds [2]. If we consider the smallest gap between the naive and quadratic algorithms at 300 documents, quadratic batching unsorted vs. naive batching sorted: we have a difference of 40 batches. This means that in effect the quadratic algorithm (in the worst case) saves us 6 seconds on requests alone. Comparing with the time complexity of the batching algorithm, we find that the O(n²) of the quadratic algorithm has a negligible impact on overall translation speed. The other hidden advantage of course, is that we have 40 less chances of request failures!

The second point is interesting, for it empirically proves my theory that sorting decreases the number of batches. Unfortunately I don’t have a mathematical proof for this, so the empirical results suffice for now.

  • How Does Sorting Affect The Average Batch Size?

Naturally, following the number of batches trend we expect that quadratic algorithms have a higher average batch size than the naive ones, and that sorting increases the batch size for each algorithm respectively. This is reflected empirically in the figure below. What’s interesting to note though, is that the average batch size seems to converge for all cases. We can also see that the sorted quadratic algorithm converges to the max character limit has the number of documents approaches infinity.

Enhancement 2 — Resolving Partial Translations

Previously, we automatically failed the entire translation process on encountering a single failure. However, this can be wasteful in cases where most translation requests have already succeeded, or in cases where there is no hard requirement on having a 100% success translation rate. The behaviour of the translate function should therefore be defined by the user. Here are a few notable cases that we should cover:

  • Case 1: Fail the entire process if there is any failure
  • Case 2: Ignore failures from the output, and remove partial translations
  • Case 3: Ignore failures from the output, but keep partial translations

To achieve these cases, we add two boolean parameters: raise_error_on_translation_failure and include_partials_in_output . The first will raise an error anytime there is a failure (e.g. at batch level, or at a language level) if set to True (if False, we still log the error!). This is to cover Case 1. The second is only relevant when raise_error_on_translation_failure=False , and it will keep partial translations if set to True , and remove them if set to False . The code for this is shown below:

def translate_text(
text: Union[str, list],
target_language: Union[str, list],
source_language: Optional[str] = None,
api_version: str = '3.0', raise_error_on_translation_failure=True, include_partials_in_output=False) -> tuple:
"""translates txt using the microsoft translate API

:param text: text to be translated. Either single or multiple (stored in a list)
:param target_language: ISO format of target translation languages
:param source_language: ISO format of source language. If not provided is inferred by the translator, defaults to None
:param api_version: api version to use, defaults to "3.0"
:param raise_error_on_translation_failure: if `True`, raises errors on translation failure. If `False` ignores failed translations in output
:pram include_partials_in_output: if `True` includes partially translated texts in output, otherwise ignores them
:return: for successful response, (status_code, [{"translations": [{"text": translated_text_1, "to": lang_1}, ...]}, ...]))
"""

url = f'{MICROSOFT_TRANSLATE_URL}/translate?api-version={api_version}'

if isinstance(target_language, str):
target_language = [target_language]

if source_language:
url = f'{url}&from={source_language}'

if isinstance(text, str):
text = [text]

n_target_langs = len(target_language)

sizes_dict = {i: len(text_)*n_target_langs for i, text_ in enumerate(text)}
batched_texts = batch_by_size_min_buckets(sizes_dict, CHARACTER_LIMITS)

translation_outputs = []
for batch in batched_texts:
translation_output = []
doc_ids = batch['idx']
total_size = batch['total_size']

if total_size > CHARACTER_LIMITS:
doc_size = sizes_dict[doc_ids[0]]
batch_size = CHARACTER_LIMITS // doc_size

if not batch_size:
msg = f'Text `{doc_ids[0]}` too large to be translated'
# -- new code ------------------------------------------
if raise_error_on_translation_failure:
raise Exception(msg)
LOGGER.error(msg)
# -- end of new code ------------------------------------------

else:
_translation_output = dict(translations=[])
batch_range = range(0, n_target_langs, batch_size)
n_batches = len(batch_range)

_translation_failed = False # to track translations at language batching level
for batch_id, start_lang_idx in enumerate(batch_range):
end_lang_idx = start_lang_idx + batch_size
target_languages_ = target_language[start_lang_idx: end_lang_idx]

url_ = add_array_api_parameters(url, param_name='to', param_values=target_languages_)
body = [{'text': text[doc_ids[0]]}]

LOGGER.info(f'Translating batch {batch_id+1}/{n_batches} of text with idx={doc_ids[0]}. Target languages: {target_languages_}')
resp = requests.post(url_, headers=HEADERS, json=body)
status_code = resp.status_code
if not is_request_valid(status_code):
# -- new code ------------------------------------------
msg = f'Partial translation of text `{doc_ids[0]}` to languages {target_languages_} failed.'
if raise_error_on_translation_failure:
raise Exception(msg)
LOGGER.error(msg)
_translation_failed = True
if not include_partials_in_output:
break
# -- end of new code-------------------------------------

partial_translation_output = resp.json()
_translation_output['translations'] += partial_translation_output['translations']

# -- new code -------------------------------------------
if not _translation_failed or include_partials_in_output:
translation_output.append(_translation_output)
# -- end of new code ------------------------------------

else:
batch_texts = [text[doc_id] for doc_id in doc_ids]
body = [{'text': text_} for text_ in batch_texts]
LOGGER.info(f'Translating {len(text)} texts to {len(target_language)} languages')
# rebuild url for all languages
url_ = add_array_api_parameters(
url,
param_name='to',
param_values=target_language
)
resp = requests.post(url_, headers=HEADERS, json=body)
status_code = resp.status_code
if not is_request_valid(status_code):
# -- new code -----------------------------------
msg = f'Translation failed for texts {doc_ids}. Reason: {resp.text}'
if raise_error_on_translation_failure:
raise Exception(msg)
LOGGER.error(msg)
# -- end of new code --------------------------------
else:
translation_output += resp.json()

translation_outputs += translation_output

return translation_outputs, status_code

Enhancement 3 — Cleaning Up Code

It’s now time to clean up the code and to add useful comments. However, since there is lots of reusability and the code logic is complicated, it makes sense to re-write the the translate function as a class. This also has the added benefit of allowing us to store the IDs of documents/batches whose translation fails (immensely beneficial if you want to re-run those specific translations as opposed to finding them through the logs!). I also decided to change the raise_error_on_translation_failure to ignore_on_translation_failure , and instead simply return the error and status code if a failure occurs and ignore_on_translation_failure=False . I decided to change this as I don’t wish for the API function to raise errors (errors should be captured by status codes).

The final code looks as follows:

CHARACTER_LIMITS = 50000
MAX_CHARACTER_LIMITS_PER_HOUR = 2000000

class MicrosoftTranslator:
"""Class for translating text using the Microsoft Translate API

:param api_version: api version to use, defaults to "3.0"
:param ignore_on_translation_failure: if `False`, returns failed translations with error and status code. If `False` ignores failed translations in output, defaults to False
:param include_partials_in_output: if `True` includes partially translated texts in output, otherwise ignores them, defaults to False

"""
def __init__(
self,
api_version: str = '3.0',
ignore_on_translation_failure: bool = False,
include_partials_in_output: bool = False
):

base_url = f'{MICROSOFT_TRANSLATE_URL}/translate?api-version={api_version}'

self.base_url = base_url
self.api_version = api_version
self.ignore_on_translation_failure = ignore_on_translation_failure
self.include_partials_in_output = include_partials_in_output

def translate_text(
self,
texts: Union[str, list],
target_languages: Union[str, list],
source_language: Optional[str] = None
) -> tuple:
"""translates txt using the microsoft translate API

:param texts: text(s) to be translated. Can either be a single text (str) or multiple (list)
:param target_languages: ISO format of target translation language(s). Can be single lang (str) or multiple (list)
:param source_language: ISO format of source language. If not provided is inferred by the translator, defaults to None
:return: for successful response, (status_code, [{"translations": [{"text": translated_text_1, "to": lang_1}, ...]}, ...]))
"""

# -- create storage for translation failures and flag for failed translation
self._set_request_default

# add source language to url
if source_language:
base_url = f'{self.base_url}&from={source_language}'
else:
base_url = self.base_url

# standardise target_languages and texts types
if isinstance(target_languages, str):
target_languages = [target_languages]

if isinstance(texts, str):
texts = [texts]

# batch texts for translation, based on doc_size = len(doc)*n_target_langs
n_target_langs = len(target_languages)
sizes_dict = {i: len(text)*n_target_langs for i, text in enumerate(texts)}

profile_texts = self._profile_texts(sizes_dict)
if profile_texts:
return profile_texts

batched_texts = batch_by_size_min_buckets(sizes_dict, CHARACTER_LIMITS, sort_docs=True)

translation_outputs = [] # variable to store all translation outputs

for batch in batched_texts:
batch_translation_output = [] # variable to store translation output for single batch
doc_ids = batch['idx']
total_size = batch['total_size']
texts_in_batch = [texts[doc_id] for doc_id in doc_ids]
# -- case when batch exceeds character limits
if total_size > CHARACTER_LIMITS:
assert len(doc_ids) == 1, 'Critical error: batching function is generating batches that exceed max limit with more than 1 text. Revisit the function and fix this.'

doc_id = doc_ids[0]
doc_size = sizes_dict[doc_id] // n_target_langs
batch_size = CHARACTER_LIMITS // doc_size

# -- case when a single doc is too large to be translated
if not batch_size:
process_error = self._process_error(f'Text idx={doc_id} too large to be translated', 'Max character limit for request', 400, doc_ids, target_languages)
if process_error:
return process_error

# -- case when single doc too big to be translated to all language, but small enough that it can be translated to some languages
else:
_translation_output = dict() # variable to store translations for single text but different language batches
_translation_failed = False # variable to track if translation is partial

batch_range = range(0, n_target_langs, batch_size)
n_batches = len(batch_range)

# batch by target languages
for batch_id, start_lang_idx in enumerate(batch_range):
end_lang_idx = start_lang_idx + batch_size
target_languages_ = target_languages[start_lang_idx: end_lang_idx]
total_size_ = doc_size * len(target_languages_)

response_output, status_code = self._post_request(
f'Translating batch {batch_id+1}/{n_batches} of text with idx={doc_id}. Target languages: {target_languages_}',
base_url, target_languages_, texts_in_batch, total_size_
)
if not is_request_valid(status_code):
process_error = self._process_error(
f'Partial translation of text idx={doc_id} to languages {target_languages_} failed. Reason: {response_output}',
response_output, status_code, doc_ids, target_languages_)
if process_error:
return process_error

# failure indicates translation is partial. Break loop if we don't case about partials in output
_translation_failed = True
if not self.include_partials_in_output:
break
else:
self._update_partial_translation(response_output, _translation_output, source_language)

if not _translation_failed or self.include_partials_in_output:
if _translation_output:
batch_translation_output.append(_translation_output)

# -- case when batch does not exceed character limits
else:
response_output, status_code = self._post_request(
f'Translating {len(texts)} texts to {len(target_languages)} languages',
base_url, target_languages, texts_in_batch, total_size
)

if not is_request_valid(status_code):
process_error = self._process_error(
f'Translation failed for texts {doc_ids}. Reason: {response_output}',
response_output, status_code, doc_ids, target_languages
)
if process_error:
return process_error
else:
batch_translation_output += response_output

translation_outputs += batch_translation_output

if self.no_failures:
status_code = 200
# case when all translations failed, so return translation errors instead
elif not translation_outputs:
translation_outputs = self.translation_errors
status_code = 400
# case when translations partially failed, modify status code
else:
status_code = 206

return translation_outputs, status_code

@property
def _set_request_default(self):
"""Function for resetting translation errors and no_failures flag
"""

self.translation_errors = {}
self.no_failures = True

@property
def _set_no_failures_to_false(self):
"""Function to explicitly set no failures to False
"""
self.no_failures = False

@property
def _set_success_request_time(self):
"""Function to set the time a request is made
"""
self.time_of_last_success_request = time.time()

def _update_partial_translation(self, response_output, partial_translation_output, source_language):

# concatenate outputs in correct format
if 'translations' not in partial_translation_output:
partial_translation_output['translations'] = response_output['translations']
else:
partial_translation_output['translations'] += response_output['translations']

if not source_language:
if 'detectedLanguage' not in partial_translation_output:
partial_translation_output['detectedLanguage'] = response_output['detectedLanguage']
else:
partial_translation_output['detectedLanguage'] += response_output['detectedLanguage']

return partial_translation_output

def _update_translation_errors(self, response_text: str, status_code: int, doc_ids: list, target_languages: list):
"""Add failed translation to errors dictionary

:param response_text: response text from failed request (status_code not beginning with 2)
:param status_code: status code from failed request
:param doc_ids: documents that were to be translated
:param target_languages: target languages used in request
"""
doc_ids = tuple(doc_ids)
if doc_ids not in self.translation_errors:
self.translation_errors[doc_ids] = dict(
reason=response_text,
status_code=status_code,
target_languages=target_languages
)
else:
self.translation_errors[doc_ids]['target_languages'] += target_languages
self.translation_errors[doc_ids]['status_code'] = status_code
self.translation_errors[doc_ids]['reason'] = response_text

def _process_error(self, msg: str, response_text: str, status_code: int, doc_ids: list, target_languages: list):
"""Processes failed request based on `ignore_on_translation_failure` strategy

:param msg: message to return or log depending on `ignore_on_translation_failure` strategy
:param response_text: response text from failed request (status_code not beginning with 2)
:param status_code: status code from failed request
:param doc_ids: documents that were to be translated
:param target_languages: target languages used in request
"""

self._set_no_failures_to_false

self._update_translation_errors(response_text, status_code, doc_ids, target_languages)

if not self.ignore_on_translation_failure:
return response_text, status_code

LOGGER.error(msg)

def _profile_texts(self, sizes: Dict[Union[int, str], int]):
"""Profiles texts to see if the request can be translated

:param sizes: size mapping for each document, {doc_id_1: size, ...}
"""
num_texts = len(sizes)
total_request_size = sum(sizes.values())
if total_request_size > MAX_CHARACTER_LIMITS_PER_HOUR:
return 'Your texts exceed max character limits per hour', 400

LOGGER.info(f'Detected `{num_texts}` texts with total request size `{total_request_size}`')

def _sleep(self, request_size: int):
"""Function to sleep prior to requests being made based on the size of the request and the time last successful request was made in order to avoid overloading Microsoft servers

:param request_size: size of the request being made
"""
if hasattr(self, 'time_of_last_success_request'):
time_diff_since_first_request = time.time() - self.time_of_last_success_request
time_diff_needed_for_next_request = request_size / (MAX_CHARACTER_LIMITS_PER_HOUR / 3600)
sleep_time = time_diff_needed_for_next_request - time_diff_since_first_request
if sleep_time > 0:
LOGGER.debug(f'Sleeping {sleep_time:.3g} seconds...')
time.sleep(sleep_time)

def _post_request(self, msg: str, base_url: str, target_languages: list, texts: List[str], request_size: Optional[int] = None) -> Tuple[Union[dict, str], int]:
"""Internal function to post requests to microsoft API

:param msg: message to log
:param base_url: base url for making the request
:param target_languages: list of target languages to translate text to
:param texts: texts to translate
:param request_size: size of the request being made for calculating sleep period, defaults to None

:return: for successful response, (status_code, [{"translations": [{"text": translated_text_1, "to": lang_1}, ...]}, ...]))
"""
if not request_size:
n_target_langs = len(target_languages)
request_size = sum([len(text)*n_target_langs for text in texts])

self._sleep(request_size)

url = add_array_api_parameters(base_url, param_name='to', param_values=target_languages)
LOGGER.info(msg)
body = [{'text': text} for text in texts]
resp = requests.post(url, headers=HEADERS, json=body)
status_code = resp.status_code
if is_request_valid(status_code):
self._set_success_request_time
return resp.json(), status_code
return resp.text, status_code

For access to the full working code, please visit the repository.

The Microsoft Translate API is a versatile tool for translation and very easy to setup. However, it has request and character limits that make using it effectively difficult and costly. In this article we discussed methods for making the best use of the translate API, by automatically batching inputs.

Some final notes and comments on further direction:

  • Speed: the biggest bottleneck to speed is by far the wait time between requests in order to avoid the max request limits imposed by Microsoft. However, just for an academic exercise, the batching algorithm can be made better. Of course, this has no practical use since any benefits we might gain from batching faster would become relevant when the number of documents approaches infinity, but by virtue of the max character limit we can never exceed 50000 (e.g. 50000 documents of length 1, being translated to a single language). If you are interested though, I would recommend reading about the bin packing problem.
  • Assurance: at the moment, the code is designed to avoid most request limit problems. However, should there be a failure, it will not re-try the request until all translations have been completed. For this, a worker architecture is needed that keeps re-trying the failed translations (perhaps with exponential backoff) until all of them have been successfully completed. Of course, whether this is practical is also a question worth considering, since the current translator class stores failed translations, making it a relatively trivial task to re-translate manually.

Author’s Note

If you liked this article or learned something new, please consider getting a membership using my referral link:

This gives you unrestricted access to all of Medium, while helping me produce more content at no extra cost to you.

If you are interested in in-depth tutorials on Software Engineering and Machine Learning, then join my email list to get notified whenever I release a new article:

Happy learning and till next time!

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.
Leave a comment