utils
– Various utility functions¶
Various general utility functions.
- class gensim.utils.ClippedCorpus(corpus, max_docs=None)¶
-
Bases:
SaveLoad
Wrap a corpus and return max_doc element from it.
- Parameters
-
corpus (iterable of iterable of (int, numeric)) – Input corpus.
max_docs (int) – Maximum number of documents in the wrapped corpus.
Warning
Any documents after max_docs are ignored. This effectively limits the length of the returned corpus to <= max_docs. Set max_docs=None for “no limit”, effectively wrapping the entire input corpus.
- add_lifecycle_event(event_name, log_level=20, **event)¶
-
Append an event into the lifecycle_events attribute of this object, and also optionally log the event at log_level.
Events are important moments during the object’s life, such as “model created”, “model saved”, “model loaded”, etc.
The lifecycle_events attribute is persisted across object’s
save()
andload()
operations. It has no impact on the use of the model, but is useful during debugging and support.Set self.lifecycle_events = None to disable this behaviour. Calls to add_lifecycle_event() will not record events into self.lifecycle_events then.
- Parameters
-
event_name (str) – Name of the event. Can be any label, e.g. “created”, “stored” etc.
event (dict) –
Key-value mapping to append to self.lifecycle_events. Should be JSON-serializable, so keep it simple. Can be empty.
This method will automatically add the following key-values to event, so you don’t have to specify them:
datetime: the current date & time
gensim: the current Gensim version
python: the current Python version
platform: the current platform
event: the name of this event
log_level (int) – Also log the complete event dict, at the specified log level. Set to False to not log at all.
- classmethod load(fname, mmap=None)¶
-
Load an object previously saved using
save()
from a file.- Parameters
-
fname (str) – Path to file that contains needed object.
mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.
See also
-
save()
-
Save object to file.
- Returns
-
Object loaded from fname.
- Return type
-
object
- Raises
-
AttributeError – When called on an object instance instead of class (this is a class method).
- save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset({}), pickle_protocol=4)¶
-
Save the object to a file.
- Parameters
-
fname_or_handle (str or file-like) – Path to output file or already opened file-like object. If the object is a file handle, no special array handling will be performed, all attributes will be saved to the same file.
separately (list of str or None, optional) –
If None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This prevent memory errors for large objects, and also allows memory-mapping the large arrays for efficient loading and sharing the large arrays in RAM between multiple processes.
If list of str: store these attributes into separate files. The automated size check is not performed in this case.
sep_limit (int, optional) – Don’t store arrays smaller than this separately. In bytes.
ignore (frozenset of str, optional) – Attributes that shouldn’t be stored at all.
pickle_protocol (int, optional) – Protocol number for pickle.
See also
-
load()
-
Load object from file.
- class gensim.utils.FakeDict(num_terms)¶
-
Bases:
object
Objects of this class act as dictionaries that map integer->str(integer), for a specified range of integers <0, num_terms).
This is meant to avoid allocating real dictionaries when num_terms is huge, which is a waste of memory.
- Parameters
-
num_terms (int) – Number of terms.
- get(val, default=None)¶
- iteritems()¶
-
Iterate over all keys and values.
- Yields
-
(int, str) – Pair of (id, token).
- keys()¶
-
Override the dict.keys(), which is used to determine the maximum internal id of a corpus, i.e. the vocabulary dimensionality.
- Returns
-
Highest id, packed in list.
- Return type
-
list of int
Notes
To avoid materializing the whole range(0, self.num_terms), this returns the highest id = [self.num_terms - 1] only.
- class gensim.utils.InputQueue(q, corpus, chunksize, maxsize, as_numpy)¶
-
Bases:
Process
Populate a queue of input chunks from a streamed corpus.
Useful for reading and chunking corpora in the background, in a separate process, so that workers that use the queue are not starved for input chunks.
- Parameters
-
q (multiprocessing.Queue) – Enqueue chunks into this queue.
corpus (iterable of iterable of (int, numeric)) – Corpus to read and split into “chunksize”-ed groups
chunksize (int) – Split corpus into chunks of this size.
as_numpy (bool, optional) – Enqueue chunks as numpy.ndarray instead of lists.
- property authkey¶
- close()¶
-
Close the Process object.
This method releases resources held by the Process object. It is an error to call this method if the child process is still running.
- property daemon¶
-
Return whether process is a daemon
- property exitcode¶
-
Return exit code of process or None if it has yet to stop
- property ident¶
-
Return identifier (PID) of process or None if it has yet to start
- is_alive()¶
-
Return whether process is alive
- join(timeout=None)¶
-
Wait until child process terminates
- kill()¶
-
Terminate process; sends SIGKILL signal or uses TerminateProcess()
- property name¶
- property pid¶
-
Return identifier (PID) of process or None if it has yet to start
- run()¶
-
Method to be run in sub-process; can be overridden in sub-class
- property sentinel¶
-
Return a file descriptor (Unix) or handle (Windows) suitable for waiting for process termination.
- start()¶
-
Start child process
- terminate()¶
-
Terminate process; sends SIGTERM signal or uses TerminateProcess()
- gensim.utils.NO_CYTHON = RuntimeError("Compiled extensions are unavailable. If you've installed from a package, ask the package maintainer to include compiled extensions. If you're building Gensim from source yourself, install Cython and a C compiler, and then run `python setup.py build_ext --inplace` to retry. ")¶
-
An exception that gensim code raises when Cython extensions are unavailable.
- class gensim.utils.RepeatCorpus(corpus, reps)¶
-
Bases:
SaveLoad
Wrap a corpus as another corpus of length reps. This is achieved by repeating documents from corpus over and over again, until the requested length len(result) == reps is reached. Repetition is done on-the-fly=efficiently, via itertools.
Examples
>>> from gensim.utils import RepeatCorpus >>> >>> corpus = [[(1, 2)], []] # 2 documents >>> list(RepeatCorpus(corpus, 5)) # repeat 2.5 times to get 5 documents [[(1, 2)], [], [(1, 2)], [], [(1, 2)]]
- Parameters
-
corpus (iterable of iterable of (int, numeric)) – Input corpus.
reps (int) – Number of repeats for documents from corpus.
- add_lifecycle_event(event_name, log_level=20, **event)¶
-
Append an event into the lifecycle_events attribute of this object, and also optionally log the event at log_level.
Events are important moments during the object’s life, such as “model created”, “model saved”, “model loaded”, etc.
The lifecycle_events attribute is persisted across object’s
save()
andload()
operations. It has no impact on the use of the model, but is useful during debugging and support.Set self.lifecycle_events = None to disable this behaviour. Calls to add_lifecycle_event() will not record events into self.lifecycle_events then.
- Parameters
-
event_name (str) – Name of the event. Can be any label, e.g. “created”, “stored” etc.
event (dict) –
Key-value mapping to append to self.lifecycle_events. Should be JSON-serializable, so keep it simple. Can be empty.
This method will automatically add the following key-values to event, so you don’t have to specify them:
datetime: the current date & time
gensim: the current Gensim version
python: the current Python version
platform: the current platform
event: the name of this event
log_level (int) – Also log the complete event dict, at the specified log level. Set to False to not log at all.
- classmethod load(fname, mmap=None)¶
-
Load an object previously saved using
save()
from a file.- Parameters
-
fname (str) – Path to file that contains needed object.
mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.
See also
-
save()
-
Save object to file.
- Returns
-
Object loaded from fname.
- Return type
-
object
- Raises
-
AttributeError – When called on an object instance instead of class (this is a class method).
- save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset({}), pickle_protocol=4)¶
-
Save the object to a file.
- Parameters
-
fname_or_handle (str or file-like) – Path to output file or already opened file-like object. If the object is a file handle, no special array handling will be performed, all attributes will be saved to the same file.
separately (list of str or None, optional) –
If None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This prevent memory errors for large objects, and also allows memory-mapping the large arrays for efficient loading and sharing the large arrays in RAM between multiple processes.
If list of str: store these attributes into separate files. The automated size check is not performed in this case.
sep_limit (int, optional) – Don’t store arrays smaller than this separately. In bytes.
ignore (frozenset of str, optional) – Attributes that shouldn’t be stored at all.
pickle_protocol (int, optional) – Protocol number for pickle.
See also
-
load()
-
Load object from file.
- class gensim.utils.RepeatCorpusNTimes(corpus, n)¶
-
Bases:
SaveLoad
Wrap a corpus and repeat it n times.
Examples
>>> from gensim.utils import RepeatCorpusNTimes >>> >>> corpus = [[(1, 0.5)], []] >>> list(RepeatCorpusNTimes(corpus, 3)) # repeat 3 times [[(1, 0.5)], [], [(1, 0.5)], [], [(1, 0.5)], []]
- Parameters
-
corpus (iterable of iterable of (int, numeric)) – Input corpus.
n (int) – Number of repeats for corpus.
- add_lifecycle_event(event_name, log_level=20, **event)¶
-
Append an event into the lifecycle_events attribute of this object, and also optionally log the event at log_level.
Events are important moments during the object’s life, such as “model created”, “model saved”, “model loaded”, etc.
The lifecycle_events attribute is persisted across object’s
save()
andload()
operations. It has no impact on the use of the model, but is useful during debugging and support.Set self.lifecycle_events = None to disable this behaviour. Calls to add_lifecycle_event() will not record events into self.lifecycle_events then.
- Parameters
-
event_name (str) – Name of the event. Can be any label, e.g. “created”, “stored” etc.
event (dict) –
Key-value mapping to append to self.lifecycle_events. Should be JSON-serializable, so keep it simple. Can be empty.
This method will automatically add the following key-values to event, so you don’t have to specify them:
datetime: the current date & time
gensim: the current Gensim version
python: the current Python version
platform: the current platform
event: the name of this event
log_level (int) – Also log the complete event dict, at the specified log level. Set to False to not log at all.
- classmethod load(fname, mmap=None)¶
-
Load an object previously saved using
save()
from a file.- Parameters
-
fname (str) – Path to file that contains needed object.
mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.
See also
-
save()
-
Save object to file.
- Returns
-
Object loaded from fname.
- Return type
-
object
- Raises
-
AttributeError – When called on an object instance instead of class (this is a class method).
- save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset({}), pickle_protocol=4)¶
-
Save the object to a file.
- Parameters
-
fname_or_handle (str or file-like) – Path to output file or already opened file-like object. If the object is a file handle, no special array handling will be performed, all attributes will be saved to the same file.
separately (list of str or None, optional) –
If None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This prevent memory errors for large objects, and also allows memory-mapping the large arrays for efficient loading and sharing the large arrays in RAM between multiple processes.
If list of str: store these attributes into separate files. The automated size check is not performed in this case.
sep_limit (int, optional) – Don’t store arrays smaller than this separately. In bytes.
ignore (frozenset of str, optional) – Attributes that shouldn’t be stored at all.
pickle_protocol (int, optional) – Protocol number for pickle.
See also
-
load()
-
Load object from file.
- class gensim.utils.SaveLoad¶
-
Bases:
object
Serialize/deserialize objects from disk, by equipping them with the save() / load() methods.
Warning
This uses pickle internally (among other techniques), so objects must not contain unpicklable attributes such as lambda functions etc.
- add_lifecycle_event(event_name, log_level=20, **event)¶
-
Append an event into the lifecycle_events attribute of this object, and also optionally log the event at log_level.
Events are important moments during the object’s life, such as “model created”, “model saved”, “model loaded”, etc.
The lifecycle_events attribute is persisted across object’s
save()
andload()
operations. It has no impact on the use of the model, but is useful during debugging and support.Set self.lifecycle_events = None to disable this behaviour. Calls to add_lifecycle_event() will not record events into self.lifecycle_events then.
- Parameters
-
event_name (str) – Name of the event. Can be any label, e.g. “created”, “stored” etc.
event (dict) –
Key-value mapping to append to self.lifecycle_events. Should be JSON-serializable, so keep it simple. Can be empty.
This method will automatically add the following key-values to event, so you don’t have to specify them:
datetime: the current date & time
gensim: the current Gensim version
python: the current Python version
platform: the current platform
event: the name of this event
log_level (int) – Also log the complete event dict, at the specified log level. Set to False to not log at all.
- classmethod load(fname, mmap=None)¶
-
Load an object previously saved using
save()
from a file.- Parameters
-
fname (str) – Path to file that contains needed object.
mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.
See also
-
save()
-
Save object to file.
- Returns
-
Object loaded from fname.
- Return type
-
object
- Raises
-
AttributeError – When called on an object instance instead of class (this is a class method).
- save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset({}), pickle_protocol=4)¶
-
Save the object to a file.
- Parameters
-
fname_or_handle (str or file-like) – Path to output file or already opened file-like object. If the object is a file handle, no special array handling will be performed, all attributes will be saved to the same file.
separately (list of str or None, optional) –
If None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This prevent memory errors for large objects, and also allows memory-mapping the large arrays for efficient loading and sharing the large arrays in RAM between multiple processes.
If list of str: store these attributes into separate files. The automated size check is not performed in this case.
sep_limit (int, optional) – Don’t store arrays smaller than this separately. In bytes.
ignore (frozenset of str, optional) – Attributes that shouldn’t be stored at all.
pickle_protocol (int, optional) – Protocol number for pickle.
See also
-
load()
-
Load object from file.
- class gensim.utils.SlicedCorpus(corpus, slice_)¶
-
Bases:
SaveLoad
Wrap corpus and return a slice of it.
- Parameters
-
corpus (iterable of iterable of (int, numeric)) – Input corpus.
slice (slice or iterable) – Slice for corpus.
Notes
Negative slicing can only be used if the corpus is indexable, otherwise, the corpus will be iterated over. Slice can also be a np.ndarray to support fancy indexing.
Calculating the size of a SlicedCorpus is expensive when using a slice as the corpus has to be iterated over once. Using a list or np.ndarray does not have this drawback, but consumes more memory.
- add_lifecycle_event(event_name, log_level=20, **event)¶
-
Append an event into the lifecycle_events attribute of this object, and also optionally log the event at log_level.
Events are important moments during the object’s life, such as “model created”, “model saved”, “model loaded”, etc.
The lifecycle_events attribute is persisted across object’s
save()
andload()
operations. It has no impact on the use of the model, but is useful during debugging and support.Set self.lifecycle_events = None to disable this behaviour. Calls to add_lifecycle_event() will not record events into self.lifecycle_events then.
- Parameters
-
event_name (str) – Name of the event. Can be any label, e.g. “created”, “stored” etc.
event (dict) –
Key-value mapping to append to self.lifecycle_events. Should be JSON-serializable, so keep it simple. Can be empty.
This method will automatically add the following key-values to event, so you don’t have to specify them:
datetime: the current date & time
gensim: the current Gensim version
python: the current Python version
platform: the current platform
event: the name of this event
log_level (int) – Also log the complete event dict, at the specified log level. Set to False to not log at all.
- classmethod load(fname, mmap=None)¶
-
Load an object previously saved using
save()
from a file.- Parameters
-
fname (str) – Path to file that contains needed object.
mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.
See also
-
save()
-
Save object to file.
- Returns
-
Object loaded from fname.
- Return type
-
object
- Raises
-
AttributeError – When called on an object instance instead of class (this is a class method).
- save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset({}), pickle_protocol=4)¶
-
Save the object to a file.
- Parameters
-
fname_or_handle (str or file-like) – Path to output file or already opened file-like object. If the object is a file handle, no special array handling will be performed, all attributes will be saved to the same file.
separately (list of str or None, optional) –
If None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This prevent memory errors for large objects, and also allows memory-mapping the large arrays for efficient loading and sharing the large arrays in RAM between multiple processes.
If list of str: store these attributes into separate files. The automated size check is not performed in this case.
sep_limit (int, optional) – Don’t store arrays smaller than this separately. In bytes.
ignore (frozenset of str, optional) – Attributes that shouldn’t be stored at all.
pickle_protocol (int, optional) – Protocol number for pickle.
See also
-
load()
-
Load object from file.
- gensim.utils.any2unicode(text, encoding='utf8', errors='strict')¶
-
Convert text (bytestring in given encoding or unicode) to unicode.
- Parameters
-
text (str) – Input text.
errors (str, optional) – Error handling behaviour if text is a bytestring.
encoding (str, optional) – Encoding of text if it is a bytestring.
- Returns
-
Unicode version of text.
- Return type
-
str
- gensim.utils.any2utf8(text, errors='strict', encoding='utf8')¶
-
Convert a unicode or bytes string in the given encoding into a utf8 bytestring.
- Parameters
-
text (str) – Input text.
errors (str, optional) – Error handling behaviour if text is a bytestring.
encoding (str, optional) – Encoding of text if it is a bytestring.
- Returns
-
Bytestring in utf8.
- Return type
-
str
- gensim.utils.call_on_class_only(*args, **kwargs)¶
-
Helper to raise AttributeError if a class method is called on an instance. Used internally.
- Parameters
-
*args – Variable length argument list.
**kwargs – Arbitrary keyword arguments.
- Raises
-
AttributeError – If a class method is called on an instance.
- gensim.utils.check_output(stdout=-1, *popenargs, **kwargs)¶
-
Run OS command with the given arguments and return its output as a byte string.
Backported from Python 2.7 with a few minor modifications. Used in word2vec/glove2word2vec tests. Behaves very similar to https://docs.python.org/2/library/subprocess.html#subprocess.check_output.
Examples
>>> from gensim.utils import check_output >>> check_output(args=['echo', '1']) '1\n'
- Raises
-
KeyboardInterrupt – If Ctrl+C pressed.
- gensim.utils.chunkize(corpus, chunksize, maxsize=0, as_numpy=False)¶
-
Split corpus into fixed-sized chunks, using
chunkize_serial()
.- Parameters
-
corpus (iterable of object) – An iterable.
chunksize (int) – Split corpus into chunks of this size.
maxsize (int, optional) – If > 0, prepare chunks in a background process, filling a chunk queue of size at most maxsize.
as_numpy (bool, optional) – Yield chunks as np.ndarray instead of lists?
- Yields
-
list OR np.ndarray – “chunksize”-ed chunks of elements from corpus.
Notes
Each chunk is of length chunksize, except the last one which may be smaller. A once-only input stream (corpus from a generator) is ok, chunking is done efficiently via itertools.
If maxsize > 0, don’t wait idly in between successive chunk yields, but rather keep filling a short queue (of size at most maxsize) with forthcoming chunks in advance. This is realized by starting a separate process, and is meant to reduce I/O delays, which can be significant when corpus comes from a slow medium like HDD, database or network.
If maxsize == 0, don’t fool around with parallelism and simply yield the chunksize via
chunkize_serial()
(no I/O optimizations).- Yields
-
list of object OR np.ndarray – Groups based on iterable
- gensim.utils.chunkize_serial(iterable, chunksize, as_numpy=False, dtype=<class 'numpy.float32'>)¶
-
Yield elements from iterable in “chunksize”-ed groups.
The last returned element may be smaller if the length of collection is not divisible by chunksize.
- Parameters
-
iterable (iterable of object) – An iterable.
chunksize (int) – Split iterable into chunks of this size.
as_numpy (bool, optional) – Yield chunks as np.ndarray instead of lists.
- Yields
-
list OR np.ndarray – “chunksize”-ed chunks of elements from iterable.
Examples
>>> print(list(grouper(range(10), 3))) [[0, 1, 2], [3, 4, 5], [6, 7, 8], [9]]
- gensim.utils.copytree_hardlink(source, dest)¶
-
Recursively copy a directory ala shutils.copytree, but hardlink files instead of copying.
- Parameters
-
source (str) – Path to source directory
dest (str) – Path to destination directory
Warning
Available on UNIX systems only.
- gensim.utils.deaccent(text)¶
-
Remove letter accents from the given string.
- Parameters
-
text (str) – Input string.
- Returns
-
Unicode string without accents.
- Return type
-
str
Examples
>>> from gensim.utils import deaccent >>> deaccent("Šéf chomutovských komunistů dostal poštou bílý prášek") u'Sef chomutovskych komunistu dostal postou bily prasek'
- gensim.utils.decode_htmlentities(text)¶
-
Decode all HTML entities in text that are encoded as hex, decimal or named entities. Adapted from python-twitter-ircbot/html_decode.py.
- Parameters
-
text (str) – Input HTML.
Examples
>>> from gensim.utils import decode_htmlentities >>> >>> u = u'E tu vivrai nel terrore - L'aldilà (1981)' >>> print(decode_htmlentities(u).encode('UTF-8')) E tu vivrai nel terrore - L'aldilà (1981) >>> print(decode_htmlentities("l'eau")) l'eau >>> print(decode_htmlentities("foo < bar")) foo < bar
- gensim.utils.default_prng = Generator(PCG64) at 0x7F71ED884F20¶
-
A default, shared numpy-Generator-based PRNG for any/all uses that don’t require seeding
- gensim.utils.deprecated(reason)¶
-
Decorator to mark functions as deprecated.
Calling a decorated function will result in a warning being emitted, using warnings.warn. Adapted from https://stackoverflow.com/a/40301488/8001386.
- Parameters
-
reason (str) – Reason of deprecation.
- Returns
-
Decorated function
- Return type
-
function
- gensim.utils.dict_from_corpus(corpus)¶
-
Scan corpus for all word ids that appear in it, then construct a mapping which maps each word_id -> str(word_id).
- Parameters
-
corpus (iterable of iterable of (int, numeric)) – Collection of texts in BoW format.
- Returns
-
id2word – “Fake” mapping which maps each word_id -> str(word_id).
- Return type
Warning
This function is used whenever words need to be displayed (as opposed to just their ids) but no word_id -> word mapping was provided. The resulting mapping only covers words actually used in the corpus, up to the highest word_id found.
- gensim.utils.effective_n_jobs(n_jobs)¶
-
Determines the number of jobs can run in parallel.
Just like in sklearn, passing n_jobs=-1 means using all available CPU cores.
- Parameters
-
n_jobs (int) – Number of workers requested by caller.
- Returns
-
Number of effective jobs.
- Return type
-
int
- gensim.utils.file_or_filename(input)¶
-
Open a filename for reading with smart_open, or seek to the beginning if input is an already open file.
- Parameters
-
input (str or file-like) – Filename or file-like object.
- Returns
-
An open file, positioned at the beginning.
- Return type
-
file-like object
- gensim.utils.flatten(nested_list)¶
-
Recursively flatten a nested sequence of elements.
- Parameters
-
nested_list (iterable) – Possibly nested sequence of elements to flatten.
- Returns
-
Flattened version of nested_list where any elements that are an iterable (collections.abc.Iterable) have been unpacked into the top-level list, in a recursive fashion.
- Return type
-
list
- gensim.utils.getNS(host=None, port=None, broadcast=True, hmac_key=None)¶
-
Get a Pyro4 name server proxy.
- Parameters
-
host (str, optional) – Name server hostname.
port (int, optional) – Name server port.
broadcast (bool, optional) – Use broadcast mechanism? (i.e. reach out to all Pyro nodes in the network)
hmac_key (str, optional) – Private key.
- Raises
-
RuntimeError – When Pyro name server is not found.
- Returns
-
Proxy from Pyro4.
- Return type
-
Pyro4.core.Proxy
- gensim.utils.get_max_id(corpus)¶
-
Get the highest feature id that appears in the corpus.
- Parameters
-
corpus (iterable of iterable of (int, numeric)) – Collection of texts in BoW format.
- Returns
-
Highest feature id.
- Return type
-
int
Notes
For empty corpus return -1.
- gensim.utils.get_my_ip()¶
-
Try to obtain our external ip (from the Pyro4 nameserver’s point of view)
- Returns
-
IP address.
- Return type
-
str
Warning
This tries to sidestep the issue of bogus /etc/hosts entries and other local misconfiguration, which often mess up hostname resolution. If all else fails, fall back to simple socket.gethostbyname() lookup.
- gensim.utils.get_random_state(seed)¶
-
Generate
numpy.random.RandomState
based on input seed.- Parameters
-
seed ({None, int, array_like}) – Seed for random state.
- Returns
-
Random state.
- Return type
-
numpy.random.RandomState
- Raises
-
AttributeError – If seed is not {None, int, array_like}.
Notes
Method originally from maciejkula/glove-python and written by @joshloyal.
- gensim.utils.grouper(iterable, chunksize, as_numpy=False, dtype=<class 'numpy.float32'>)¶
-
Yield elements from iterable in “chunksize”-ed groups.
The last returned element may be smaller if the length of collection is not divisible by chunksize.
- Parameters
-
iterable (iterable of object) – An iterable.
chunksize (int) – Split iterable into chunks of this size.
as_numpy (bool, optional) – Yield chunks as np.ndarray instead of lists.
- Yields
-
list OR np.ndarray – “chunksize”-ed chunks of elements from iterable.
Examples
>>> print(list(grouper(range(10), 3))) [[0, 1, 2], [3, 4, 5], [6, 7, 8], [9]]
- gensim.utils.identity(p)¶
-
Identity fnc, for flows that don’t accept lambda (pickling etc).
- Parameters
-
p (object) – Input parameter.
- Returns
-
Same as p.
- Return type
-
object
- gensim.utils.ignore_deprecation_warning()¶
-
Contextmanager for ignoring DeprecationWarning.
- gensim.utils.is_corpus(obj)¶
-
Check whether obj is a corpus, by peeking at its first element. Works even on streamed generators. The peeked element is put back into a object returned by this function, so always use that returned object instead of the original obj.
- Parameters
-
obj (object) – An iterable of iterable that contains (int, numeric).
- Returns
-
Pair of (is obj a corpus, obj with peeked element restored)
- Return type
-
(bool, object)
Examples
>>> from gensim.utils import is_corpus >>> corpus = [[(1, 1.0)], [(2, -0.3), (3, 0.12)]] >>> corpus_or_not, corpus = is_corpus(corpus)
Warning
An “empty” corpus (empty input sequence) is ambiguous, so in this case the result is forcefully defined as (False, obj).
- gensim.utils.is_empty(corpus)¶
-
Is the corpus (an iterable or a scipy.sparse array) empty?
- gensim.utils.iter_windows(texts, window_size, copy=False, ignore_below_size=True, include_doc_num=False)¶
-
Produce a generator over the given texts using a sliding window of window_size.
The windows produced are views of some subsequence of a text. To use deep copies instead, pass copy=True.
- Parameters
-
texts (list of str) – List of string sentences.
window_size (int) – Size of sliding window.
copy (bool, optional) – Produce deep copies.
ignore_below_size (bool, optional) – Ignore documents that are not at least window_size in length?
include_doc_num (bool, optional) – Yield the text position with texts along with each window?
- gensim.utils.keep_vocab_item(word, count, min_count, trim_rule=None)¶
-
Should we keep word in the vocab or remove it?
- Parameters
-
word (str) – Input word.
count (int) – Number of times that word appeared in a corpus.
min_count (int) – Discard words with frequency smaller than this.
trim_rule (function, optional) – Custom function to decide whether to keep or discard this word. If a custom trim_rule is not specified, the default behaviour is simply count >= min_count.
- Returns
-
True if word should stay, False otherwise.
- Return type
-
bool
- gensim.utils.lazy_flatten(nested_list)¶
-
Lazy version of
flatten()
.- Parameters
-
nested_list (list) – Possibly nested list.
- Yields
-
object – Element of list
- gensim.utils.merge_counts(dict1, dict2)¶
-
Merge dict1 of (word, freq1) and dict2 of (word, freq2) into dict1 of (word, freq1+freq2). :param dict1: First dictionary. :type dict1: dict of (str, int) :param dict2: Second dictionary. :type dict2: dict of (str, int)
- Returns
-
result – Merged dictionary with sum of frequencies as values.
- Return type
-
dict
- gensim.utils.mock_data(n_items=1000, dim=1000, prob_nnz=0.5, lam=1.0)¶
-
Create a random Gensim-style corpus (BoW), using
mock_data_row()
.- Parameters
-
n_items (int) – Size of corpus
dim (int) – Dimension of vector, used for
mock_data_row()
.prob_nnz (float, optional) – Probability of each coordinate will be nonzero, will be drawn from Poisson distribution, used for
mock_data_row()
.lam (float, optional) – Parameter for Poisson distribution, used for
mock_data_row()
.
- Returns
-
Gensim-style corpus.
- Return type
-
list of list of (int, float)
- gensim.utils.mock_data_row(dim=1000, prob_nnz=0.5, lam=1.0)¶
-
Create a random gensim BoW vector, with the feature counts following the Poisson distribution.
- Parameters
-
dim (int, optional) – Dimension of vector.
prob_nnz (float, optional) – Probability of each coordinate will be nonzero, will be drawn from the Poisson distribution.
lam (float, optional) – Lambda parameter for the Poisson distribution.
- Returns
-
Vector in BoW format.
- Return type
-
list of (int, float)
- gensim.utils.open_file(input)¶
-
Provide “with-like” behaviour without closing the file object.
- Parameters
-
input (str or file-like) – Filename or file-like object.
- Yields
-
file – File-like object based on input (or input if this already file-like).
- gensim.utils.pickle(obj, fname, protocol=4)¶
-
Pickle object obj to file fname, using smart_open so that fname can be on S3, HDFS, compressed etc.
- Parameters
-
obj (object) – Any python object.
fname (str) – Path to pickle file.
protocol (int, optional) – Pickle protocol number.
- gensim.utils.prune_vocab(vocab, min_reduce, trim_rule=None)¶
-
Remove all entries from the vocab dictionary with count smaller than min_reduce.
Modifies vocab in place, returns the sum of all counts that were pruned.
- Parameters
-
vocab (dict) – Input dictionary.
min_reduce (int) – Frequency threshold for tokens in vocab.
trim_rule (function, optional) – Function for trimming entities from vocab, default behaviour is vocab[w] <= min_reduce.
- Returns
-
result – Sum of all counts that were pruned.
- Return type
-
int
- gensim.utils.pyro_daemon(name, obj, random_suffix=False, ip=None, port=None, ns_conf=None)¶
-
Register an object with the Pyro name server.
Start the name server if not running yet and block until the daemon is terminated. The object is registered under name, or name`+ some random suffix if `random_suffix is set.
- gensim.utils.qsize(queue)¶
-
Get the (approximate) queue size where available.
- Parameters
-
queue (
queue.Queue
) – Input queue. - Returns
-
Queue size, -1 if qsize method isn’t implemented (OS X).
- Return type
-
int
- gensim.utils.randfname(prefix='gensim')¶
-
Generate a random filename in temp.
- Parameters
-
prefix (str) – Prefix of filename.
- Returns
-
Full path in the in system’s temporary folder, ending in a random filename.
- Return type
-
str
- gensim.utils.revdict(d)¶
-
Reverse a dictionary mapping, i.e. {1: 2, 3: 4} -> {2: 1, 4: 3}.
- Parameters
-
d (dict) – Input dictionary.
- Returns
-
Reversed dictionary mapping.
- Return type
-
dict
Notes
When two keys map to the same value, only one of them will be kept in the result (which one is kept is arbitrary).
Examples
>>> from gensim.utils import revdict >>> d = {1: 2, 3: 4} >>> revdict(d) {2: 1, 4: 3}
- gensim.utils.safe_unichr(intval)¶
-
Create a unicode character from its integer value. In case unichr fails, render the character as an escaped U<8-byte hex value of intval> string.
- Parameters
-
intval (int) – Integer code of character
- Returns
-
Unicode string of character
- Return type
-
string
- gensim.utils.sample_dict(d, n=10, use_random=True)¶
-
Selected n (possibly random) items from the dictionary d.
- Parameters
-
d (dict) – Input dictionary.
n (int, optional) – Number of items to select.
use_random (bool, optional) – Select items randomly (without replacement), instead of by the natural dict iteration order?
- Returns
-
Selected items from dictionary, as a list.
- Return type
-
list of (object, object)
- gensim.utils.save_as_line_sentence(corpus, filename)¶
-
Save the corpus in LineSentence format, i.e. each sentence on a separate line, tokens are separated by space.
- Parameters
-
corpus (iterable of iterables of strings) –
- gensim.utils.simple_preprocess(doc, deacc=False, min_len=2, max_len=15)¶
-
Convert a document into a list of lowercase tokens, ignoring tokens that are too short or too long.
Uses
tokenize()
internally.- Parameters
-
doc (str) – Input document.
deacc (bool, optional) – Remove accent marks from tokens using
deaccent()
?min_len (int, optional) – Minimum length of token (inclusive). Shorter tokens are discarded.
max_len (int, optional) – Maximum length of token in result (inclusive). Longer tokens are discarded.
- Returns
-
Tokens extracted from doc.
- Return type
-
list of str
- gensim.utils.simple_tokenize(text)¶
-
Tokenize input test using
gensim.utils.PAT_ALPHABETIC
.- Parameters
-
text (str) – Input text.
- Yields
-
str – Tokens from text.
- gensim.utils.smart_extension(fname, ext)¶
-
Append a file extension ext to fname, while keeping compressed extensions like .bz2 or .gz (if any) at the end.
- Parameters
-
fname (str) – Filename or full path.
ext (str) – Extension to append before any compression extensions.
- Returns
-
New path to file with ext appended.
- Return type
-
str
Examples
>>> from gensim.utils import smart_extension >>> smart_extension("my_file.pkl.gz", ".vectors") 'my_file.pkl.vectors.gz'
- gensim.utils.strided_windows(ndarray, window_size)¶
-
Produce a numpy.ndarray of windows, as from a sliding window.
- Parameters
-
ndarray (numpy.ndarray) – Input array
window_size (int) – Sliding window size.
- Returns
-
Subsequences produced by sliding a window of the given size over the ndarray. Since this uses striding, the individual arrays are views rather than copies of ndarray. Changes to one view modifies the others and the original.
- Return type
-
numpy.ndarray
Examples
>>> from gensim.utils import strided_windows >>> strided_windows(np.arange(5), 2) array([[0, 1], [1, 2], [2, 3], [3, 4]]) >>> strided_windows(np.arange(10), 5) array([[0, 1, 2, 3, 4], [1, 2, 3, 4, 5], [2, 3, 4, 5, 6], [3, 4, 5, 6, 7], [4, 5, 6, 7, 8], [5, 6, 7, 8, 9]])
- gensim.utils.synchronous(tlockname)¶
-
A decorator to place an instance-based lock around a method.
Notes
Adapted from http://code.activestate.com/recipes/577105-synchronization-decorator-for-class-methods/.
- gensim.utils.to_unicode(text, encoding='utf8', errors='strict')¶
-
Convert text (bytestring in given encoding or unicode) to unicode.
- Parameters
-
text (str) – Input text.
errors (str, optional) – Error handling behaviour if text is a bytestring.
encoding (str, optional) – Encoding of text if it is a bytestring.
- Returns
-
Unicode version of text.
- Return type
-
str
- gensim.utils.to_utf8(text, errors='strict', encoding='utf8')¶
-
Convert a unicode or bytes string in the given encoding into a utf8 bytestring.
- Parameters
-
text (str) – Input text.
errors (str, optional) – Error handling behaviour if text is a bytestring.
encoding (str, optional) – Encoding of text if it is a bytestring.
- Returns
-
Bytestring in utf8.
- Return type
-
str
- gensim.utils.tokenize(text, lowercase=False, deacc=False, encoding='utf8', errors='strict', to_lower=False, lower=False)¶
-
Iteratively yield tokens as unicode strings, optionally removing accent marks and lowercasing it.
- Parameters
-
text (str or bytes) – Input string.
deacc (bool, optional) – Remove accentuation using
deaccent()
?encoding (str, optional) – Encoding of input string, used as parameter for
to_unicode()
.errors (str, optional) – Error handling behaviour, used as parameter for
to_unicode()
.lowercase (bool, optional) – Lowercase the input string?
to_lower (bool, optional) – Same as lowercase. Convenience alias.
lower (bool, optional) – Same as lowercase. Convenience alias.
- Yields
-
str – Contiguous sequences of alphabetic characters (no digits!), using
simple_tokenize()
Examples
>>> from gensim.utils import tokenize >>> list(tokenize('Nic nemůže letět rychlostí vyšší, než 300 tisíc kilometrů za sekundu!', deacc=True)) [u'Nic', u'nemuze', u'letet', u'rychlosti', u'vyssi', u'nez', u'tisic', u'kilometru', u'za', u'sekundu']
- gensim.utils.toptexts(query, texts, index, n=10)¶
-
Debug fnc to help inspect the top n most similar documents (according to a similarity index index), to see if they are actually related to the query.
- Parameters
-
query ({list of (int, number), numpy.ndarray}) – vector OR BoW (list of tuples)
texts (str) – object that can return something insightful for each document via texts[docid], such as its fulltext or snippet.
index (any) – A instance from from
gensim.similarity.docsim
.
- Returns
-
a list of 3-tuples (docid, doc’s similarity to the query, texts[docid])
- Return type
-
list
- gensim.utils.trim_vocab_by_freq(vocab, topk, trim_rule=None)¶
-
Retain topk most frequent words in vocab. If there are more words with the same frequency as topk-th one, they will be kept. Modifies vocab in place, returns nothing.
- Parameters
-
vocab (dict) – Input dictionary.
topk (int) – Number of words with highest frequencies to keep.
trim_rule (function, optional) – Function for trimming entities from vocab, default behaviour is vocab[w] <= min_count.
- gensim.utils.unpickle(fname)¶
-
Load object from fname, using smart_open so that fname can be on S3, HDFS, compressed etc.
- Parameters
-
fname (str) – Path to pickle file.
- Returns
-
Python object loaded from fname.
- Return type
-
object
- gensim.utils.upload_chunked(server, docs, chunksize=1000, preprocess=None)¶
-
Memory-friendly upload of documents to a SimServer (or Pyro SimServer proxy).
Notes
Use this function to train or index large collections.abc – avoid sending the entire corpus over the wire as a single Pyro in-memory object. The documents will be sent in smaller chunks, of chunksize documents each.